Page 1 |
Save page Remove page | Previous | 1 of 160 | Next |
|
small (250x250 max)
medium (500x500 max)
large ( > 500x500)
Full Resolution
All (PDF)
|
This page
All
Subset |
A REFERENCE-SET APPROACH TO INFORMATION EXTRACTION FROM
UNSTRUCTURED, UNGRAMMATICAL DATA SOURCES
by
Matthew Michelson
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Ful llment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2009
Copyright 2009 Matthew Michelson
Object Description
| Title | A reference-set approach to information extraction from unstructured, ungrammatical data sources |
| Author | Michelson, Matthew |
| Author email | matt.michelson@gmail.com; michelso@isi.edu |
| Degree | Doctor of Philosophy |
| Document type | Dissertation |
| Degree program | Computer Science |
| School | Viterbi School of Engineering |
| Date defended/completed | 2008-11-03 |
| Date submitted | 2009 |
| Restricted until | Unrestricted |
| Date published | 2009-01-26 |
| Advisor (committee chair) | Knoblock, Craig A. |
| Advisor (committee member) |
Knight. Kevin Shahabi, Cyrus O'Leary, Daniel |
| Abstract | This thesis investigates information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum postings. Since the data is unstructured and ungrammatical, this information extraction precludes the use of rule-based methods that rely on consistent structures within the text or natural language processing techniques that rely on grammar. Instead, I describe extraction using a "reference set" which I define as a collection of known entities and their attributes. A reference set can be constructed from structured sources, such as databases, or scraped from semi-structured sources such as collections of Web pages. In some cases, as I shown in this thesis, a reference set can even be constructed automatically from the unstructured, ungrammatical text itself. This thesis presents methods to exploit reference sets for extraction using both automatic techniques and machine learning techniques. The automatic technique provides a scalable and accurate approach to extraction from unstructured, ungrammatical text. The machine learning approach provides even higher accuracy extractions and deals with ambiguous extractions, although at the cost of requiring human effort to label training data. The results demonstrate that reference-set based extraction outperforms the current state-of-the-art systems that rely on structural or grammatical clues, which is not appropriate for unstructured, ungrammatical text. Even the fully automatic case, which constructs its own reference set for automatic extraction, is competitive with the current state-of-the-art techniques that require labeled data. Reference-set based extraction from unstructured, ungrammatical text allows for a whole category of sources to be queried, allowing for their inclusion in data integration systems that were previously limited to structured and semi-structured sources. |
| Keyword | information extraction; unstructured data sources |
| Language | English |
| Part of collection | University of Southern California dissertations and theses |
| Publisher (of the original version) | University of Southern California |
| Place of publication (of the original version) | Los Angeles, California |
| Publisher (of the digital version) | University of Southern California. Libraries |
| Provenance | Electronically uploaded by the author |
| Type | texts |
| Legacy record ID | usctheses-m1957 |
| Rights | Michelson, Matthew |
| Repository name | Libraries, University of Southern California |
| Repository address | Los Angeles, California |
| Repository email | http://www.usc.edu/isd/libraries/services/ask_a_librarian/email/ |
| Filename | etd-Michelson-2565 |
| Archival file | uscthesesreloadpub_Volume29/etd-Michelson-2565.pdf |
Description
| Title | Page 1 |
| Full text | A REFERENCE-SET APPROACH TO INFORMATION EXTRACTION FROM UNSTRUCTURED, UNGRAMMATICAL DATA SOURCES by Matthew Michelson A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2009 Copyright 2009 Matthew Michelson |
Comments
Post a Comment for Page 1

