Page 1 |
Save page Remove page | Previous | 1 of 222 | Next |
|
small (250x250 max)
medium (500x500 max)
large ( > 500x500)
Full Resolution
All (PDF)
|
This page
All
Subset |
IMPROVED WORD ALIGNMENTS FOR STATISTICAL MACHINE
TRANSLATION
by
Alexander Fraser
Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2007
Copyright 2007 Alexander Fraser
Object Description
| Title | Improved word alignments for statistical machine translation |
| Author | Fraser, Alexander M. |
| Author email | alexfraser@gmail.com |
| Degree | Doctor of Philosophy |
| Document type | Dissertation |
| Degree program | Computer Science |
| School | Viterbi School of Engineering |
| Date defended/completed | 2007-07-20 |
| Date submitted | 2007 |
| Restricted until | Unrestricted |
| Date published | 2007-10-05 |
| Advisor (committee chair) | Marcu, Daniel |
| Advisor (committee member) |
Hovy, Eduard James, Gareth Knight, Kevin Rosenbloom, Paul |
| Abstract | All state of the art statistical machine translation systems and many example-based machine translation systems depend on an annotation of word-level translational correspondence between sets of parallel sentences. Such an annotation of two parallel sentences is called a"word alignment". The largest number of manually annotated word alignments currently available to the research community for any pair of languages consists of alignments for only thousands of parallel sentences, even though there are several orders of magnitude moreparallel sentences available. For instance, for the task of translating Chinese news articles to English, there are currently on the order of 10 million parallel sentences. This is too many formanual alignment, so they must be automatically word aligned.; Unsupervised word alignment systems generate poor quality alignments, often using statistical word alignment models developed over 10 years ago, but most recent research into improving word alignments has not led to improved translation. There are several reasons for this: 1. There is no good metric which can be used to automatically measure word alignment quality for the translation task. 2. Statistical word alignment models are based on assumptions about the structure of the problem which are incorrect. 3. It is difficult to add new sources of linguistic knowledge because many current systems must be completely reengineered for each new knowledge source. 4. Statistical models of word alignment are most often learned in an unsupervised training process which is unable to take advantage of annotated data.; This thesis remedies these problems by making contributions in the following three areas:1. We have found a new method for automatically measuring alignment quality using an unbalanced F-Measure metric (Fraser & Marcu, 2007b). We have validated that this metric adequately measures alignment quality for the translation task. We have shown that the metric can be used to derive a loss function for discriminative training approaches, and it is useful for measuring progress during the development of new word alignment procedures. 2. We have designed a new statistical model for word alignment called LEAF (Fraser & Marcu, 2007a), which directly models the word alignment structure as it is used for machine translation, in contrast with previous models which make unreasonable structural assumptions. 3. We have developed a semi-supervised training algorithm, the EMD algorithm (Fraser & Marcu, 2006), which automatically takes advantage of whatever quantity of manually annotated data can be obtained. The use of the EMD algorithm allows for the introduction of new knowledge sources with minimal effort. We have shown that these contributions improve state of the art statistical machine translation systems in experiments on challenging large data sets. |
| Keyword | word alignment; machine translation; semi-supervised learning; statistical machine translation; discriminative models; generative models |
| Language | English |
| Part of collection | University of Southern California dissertations and theses |
| Publisher (of the original version) | University of Southern California |
| Place of publication (of the original version) | Los Angeles, California |
| Publisher (of the digital version) | University of Southern California. Libraries |
| Type | texts |
| Legacy record ID | usctheses-m839 |
| Rights | Fraser, Alexander M. |
| Repository name | Libraries, University of Southern California |
| Repository address | Los Angeles, California |
| Repository email | http://www.usc.edu/isd/libraries/services/ask_a_librarian/email/ |
| Filename | etd-Fraser-20071005 |
| Archival file | uscthesesreloadpub_Volume51/etd-Fraser-20071005.pdf |
Description
| Title | Page 1 |
| Full text | IMPROVED WORD ALIGNMENTS FOR STATISTICAL MACHINE TRANSLATION by Alexander Fraser Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2007 Copyright 2007 Alexander Fraser |
Comments
Post a Comment for Page 1

