Page 1 |
Save page Remove page | Previous | 1 of 203 | Next |
|
small (250x250 max)
medium (500x500 max)
large ( > 500x500)
Full Resolution
All (PDF)
|
This page
All
Subset |
FACTORIZING INFORMATION EXTRACTION FROM TEXT
CORPORA
by
Donghui Feng
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2007
Copyright 2007 Donghui Feng
Object Description
| Title | Factorizing information extraction from text corpora |
| Author | Feng, Donghui |
| Author email | dfeng@usc.edu |
| Degree | Doctor of Philosophy |
| Document type | Dissertation |
| Degree program | Computer Science |
| School | Viterbi School of Engineering |
| Date defended/completed | 2007-10-08 |
| Date submitted | 2007 |
| Restricted until | Unrestricted |
| Date published | 2007-11-08 |
| Advisor (committee chair) | Hovy, Eduard |
| Advisor (committee member) |
Shahabi, Cyrus Burns, Gully O'Leary, Daniel |
| Abstract | Automatic information extraction (IE) from unstructured text is critical for solving the information overload problem. As text formats become ever more varied and IE requirements become more demanding, target information becomes harder to define and extract. Therefore, IE procedures become more complex and generally require an iterative cycle involving multiple factors, such as the simple traditional IE framework is no longer adequate. However, a new paradigm of IE to address these issues has not been formalized yet.; In this thesis, we develop a framework to formalize a more general and powerful procedure of IE. Our formalization provides a method to rapidly define and structure new IE tasks. For this procedure, we analyze the role and impact of each factor. This thesis makes two kinds of contributions: first, a new high-level and expressive framework that shows the relationships between activities such as domain knowledge modeling and representation, annotation, system building, evaluation, and feedback adjustment, etc., and second, several new approaches to performing various level IE tasks.; We start with a simple IE task, extracting biographical facts from the web. Here a traditional, straightforward, and one-pass problem-solving procedure, consisting of definition-learning-testing, is sufficient. Our system automatically learns surface text patterns from dynamic flat web corpora for answering biographical queries. In addition, sentence fragments can serve as knowledge indicators to guide the handling of queries.; We then demonstrate the new IE framework using two more-complex tasks. First, we investigate the problem of extracting data records (individual experiments) from the biomedical research literature. In conformance with the elaborated IE framework, we have developed approaches to labeling individual sentences, grouping fields into individual meaningful objects, and scaling the results up to large corpora. We design a novel solution to segment semantic objects based on semantic analysis, for cases where traditional word-similarity-based text segmentation approaches do not work.; Second, the IE procedures need be adapted and extended for IE problems emerging from new media formats. We address the task of extracting the most informative message(s) from discussion threads. Since the relationship between thread messages characterizes how knowledge is spread, the IE framework has to accommodate the analysis of source structure. We describe a novel way to classify message and thread topics that requires zero annotation for a supervised approach, using ontological knowledge induced from a canonical text. We also invent a novel HITS-style algorithm with link generation functions to extract the conversation focus of discussion threads. |
| Keyword | information extraction; information extraction framework; unstructured text; data records extraction; threaded discussion analysis |
| Language | English |
| Part of collection | University of Southern California dissertations and theses |
| Publisher (of the original version) | University of Southern California |
| Place of publication (of the original version) | Los Angeles, California |
| Publisher (of the digital version) | University of Southern California. Libraries |
| Type | texts |
| Legacy record ID | usctheses-m914 |
| Rights | Feng, Donghui |
| Repository name | Libraries, University of Southern California |
| Repository address | Los Angeles, California |
| Repository email | http://www.usc.edu/isd/libraries/services/ask_a_librarian/email/ |
| Filename | etd-Feng-20071108 |
| Archival file | uscthesesreloadpub_Volume48/etd-Feng-20071108.pdf |
Description
| Title | Page 1 |
| Full text | FACTORIZING INFORMATION EXTRACTION FROM TEXT CORPORA by Donghui Feng A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2007 Copyright 2007 Donghui Feng |
Comments
Post a Comment for Page 1

