Page 1 |
Save page Remove page | Previous | 1 of 128 | Next |
|
small (250x250 max)
medium (500x500 max)
Large (1000x1000 max)
Extra Large
large ( > 500x500)
Full Resolution
All (PDF)
|
This page
All
|
EXPLOITING COMPARABLE CORPORA by Dragos Stefan Munteanu A Dissertation Proposal Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2006 Copyright 2006 Dragos Stefan Munteanu
Object Description
Title | Exploiting comparable corpora |
Author | Munteanu, Dragos Stefan |
Author email | dragos@isi.edu |
Degree | Doctor of Philosophy |
Document type | Dissertation |
Degree program | Computer Science |
School | Viterbi School of Engineering |
Date defended/completed | 2006-10-16 |
Date submitted | 2006 |
Restricted until | Unrestricted |
Date published | 2006-12-05 |
Advisor (committee chair) | Marcu, Daniel |
Advisor (committee member) |
Hovy, Eduard Rosenbloom, Paul S. Knight, Kevin Narayanan, Shrikanth S. |
Abstract | One of the major bottlenecks in the development of Statistical Machine Translation systems for most language pairs is the lack of bilingual parallel training data. Currently available parallel corpora span relatively few language pairs and very few domains; building new ones of sufficiently large size and high quality is time-consuming and expensive.; In this thesis, I propose methods that enable automatic creation of parallel corpora by exploiting a rich, diverse, and readily available resource: comparable corpora. Comparable corpora are bilingual texts that, while not parallel in the strict sense, are somewhat related and convey overlapping information. Such texts exist in large quantities on the Web; a good example are the multilingual news feeds produced by news agencies such as Agence France Presse, CNN, and BBC.; I present novel methods for extracting parallel data of good quality from such comparable collections. I show how to detect parallelism at various granularity levels, and thus find parallel documents (if there are any in the collection), parallel sentences, and parallel sub-sentential fragments.; In order to demonstrate the validity of this approach, I use my method to extract data from large-scale comparable corpora for various language pairs, and show that the extracted data helps improve the end-to-end performance of a state-of-the art machine translation system. |
Keyword | machine translation; parallel corpora; comparable corpora |
Language | English |
Part of collection | University of Southern California dissertations and theses |
Publisher (of the original version) | University of Southern California |
Place of publication (of the original version) | Los Angeles, California |
Publisher (of the digital version) | University of Southern California. Libraries |
Type | texts |
Legacy record ID | usctheses-m217 |
Contributing entity | University of Southern California |
Rights | Munteanu, Dragos Stefan |
Repository name | Libraries, University of Southern California |
Repository address | Los Angeles, California |
Repository email | cisadmin@lib.usc.edu |
Filename | etd-Munteanu-20061205 |
Archival file | uscthesesreloadpub_Volume14/etd-Munteanu-20061205.pdf |
Description
Title | Page 1 |
Contributing entity | University of Southern California |
Repository email | cisadmin@lib.usc.edu |
Full text | EXPLOITING COMPARABLE CORPORA by Dragos Stefan Munteanu A Dissertation Proposal Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2006 Copyright 2006 Dragos Stefan Munteanu |