Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deciphering natural language
(USC Thesis Other)
Deciphering natural language
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DECIPHERING NATURAL LANGUAGE
by
Sujith Ravi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2011
Copyright 2011 Sujith Ravi
Dedication
To my parents, R. R. Pillai and Anitha Kumari.
ii
Acknowledgments
I would like to begin by acknowledging my advisor, Kevin Knight, for his valuable guid-
ance in bringing this work to fruition. Kevin has been a great source of inspiration for
me and I believe he has bestowed me with the best advice that any Ph.D. student can
get from his mentor. I am indebted to him for the endless support, help and opportuni-
ties that he has provided me throughout my research career. I am also grateful to other
members of my committee including Daniel Marcu, David Chiang, Shri Narayanan and
Shang-Hua Teng for providing valuable feedback. David and Daniel also deserve special
mention for giving me advice on other aspects of my research career.
My research has beneted greatly from discussions and collaborations with many
people including but not limited to Jason Baldridge, Bo Pang, Marius Pasca, Deepak
Ravichandran, Evgeniy Gabrilovich, Andrei Broder, Vanja Josifovski, Sandeep Pandey,
Jihie Kim, Stacy Marsella, Jonathan Graehl, David Pynadath, Regina Barzilay, Hal
Daume III, and Adam Pauls. I would also like to express my gratitude to USC faculty
members including Yolanda Gil, Ed Hovy, Jerry Hobbs, Milind Tambe, Aram Galystan,
Rajiv Maheswaran, and Erin Shaw.
Thank you, Zornitsa Kozareva for all the research discussions and feedback, and most
of all for being a great friend and supporting me. I have enjoyed interacting with my ISI
iii
colleagues Ashish Vaswani, Steve Deneefe, Jonathan May, Jason Riesa, Ulf Hermjakob,
Rutu Mulkar, Victoria Fossum, Dirk Hovy, Stephen Tratz, Yao-Yi Chiang, Rahu Bhagat,
and Oana Nicolov. A special thanks goes to Kary Lau, Alma Nava, Peter Zamar and
Erika Barragan-Nunez for helping me with administrative issues at ISI.
I want to thank all my friends for their support and great times we spent together.
Lastly, I wish to thank my parents for their eternal love, belief in my abilities and support
oered at every stage of my life. They have always encouraged me to pursue my dreams
and I would not be here if it were not for them. I want to thank my brother, Sarath,
for his love and friendship which I greatly value, and the newest member of my family,
Chithra for her aection.
iv
Table of Contents
Dedication ii
Acknowledgments iii
List Of Tables viii
List Of Figures ix
Abstract xiv
Chapter 1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Decipherment in the Past . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Cryptanalysis and Decipherment . . . . . . . . . . . . . . . . . . . 4
1.2.2 Decipherment of Ancient Languages and Scripts . . . . . . . . . . 5
1.3 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2: Deciphering Simple Letter and Syllable Substitution Ciphers 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Statistical Language Models for Decipherment . . . . . . . . . . . . . . . . 18
2.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Probabilistic Decipherment . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Inference: Modied EM Algorithm . . . . . . . . . . . . . . . . . . 22
2.4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Integer Linear Programming (ILP) Decipherment . . . . . . . . . . . . . . 29
2.5.1 Decipherment Objective . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 ILP Formulation and Optimization . . . . . . . . . . . . . . . . . . 30
2.5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 An Empirical Study of Shannon's Decipherment Theory . . . . . . . . . . 36
2.6.1 Shannon Equivocation . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.2 Unicity Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
v
Chapter 3: Deciphering Complex Letter Substitution Ciphers 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Homophonic Substitution Ciphers . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Bayesian Decipherment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 4: Small Decipherment Models for Natural Language Problems 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Unsupervised Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Decipherment using IP+EM Approach . . . . . . . . . . . . . . . . 68
4.2.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Unsupervised Supertagging . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2 Grammar-Informed Initialization for Supertagging . . . . . . . . . 79
4.3.3 Minimized Models for Supertagging . . . . . . . . . . . . . . . . . 80
4.3.4 Combining Minimization with Grammar-Informed Initialization . . 84
4.3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Unsupervised Word / Sub-word Alignment . . . . . . . . . . . . . . . . . 89
4.4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2 Minimized Models for Word Alignment . . . . . . . . . . . . . . . 92
4.4.3 Minimized Models for Sub-Word Alignment . . . . . . . . . . . . . 96
4.5 Occam's Razor and MDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Fast, Approximate Algorithms for Model Minimization . . . . . . . . . . . 102
4.6.1 Model Minimization Formulated as a Path Problem . . . . . . . . 103
4.6.2 Greedy Model Minimization . . . . . . . . . . . . . . . . . . . . . . 105
4.6.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 109
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Chapter 5: Phonetic Decipherment (Transliteration) 116
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2 Previous Work on Machine Transliteration . . . . . . . . . . . . . . . . . . 118
5.3 Probabilistic Machine Transliteration . . . . . . . . . . . . . . . . . . . . . 120
5.4 Transliterating without Parallel Data using Phonetic Decipherment . . . . 122
5.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6 Comparable versus Non-Parallel Corpora . . . . . . . . . . . . . . . . . . 129
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
vi
Chapter 6: Deciphering Foreign Language: Statistical Language Transla-
tion without Parallel Data 131
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3 Word Substitution Decipherment . . . . . . . . . . . . . . . . . . . . . . . 136
6.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.3.2 Iterative EM Decipherment . . . . . . . . . . . . . . . . . . . . . . 140
6.3.3 Bayesian Decipherment . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 143
6.4 Machine Translation as a Decipherment Task . . . . . . . . . . . . . . . . 145
6.4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.4.2 EM Decipherment . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.3 Bayesian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.4.4 MT Experiments and Results . . . . . . . . . . . . . . . . . . . . . 152
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Chapter 7: Conclusions and Future Work 157
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Bibliography 164
vii
List Of Tables
2.1 Decipherment error rates on a 98-letter English cipher using the original
EM objective function (Knight et al., 2006). All the LMs are trained on
1.4 million letters of English data, and use 10 random restarts per point. . 23
2.2 Decipherment error rates on a 98-letter English cipher using (a) Original
EM objective function (Knight et al., 2006) (b) New objective function
(square-rooting LM). All the LMs are trained on 1.4 million letters of
English data, and use 10 random restarts per point. . . . . . . . . . . . . 27
2.3 Decipherment error rates on two dierent Uesugi ciphers using (a) Original
EM objective function (b) New objective function (square-rooting LM) . . 28
2.4 Decipherment error rates on various ciphers with a 2-gram English lan-
guage model using (a) EM method (Knight et al., 2006), and (b) IP method
(Ravi & Knight, 2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 Statistics for the training data used to extract lexicons for CCGbank and
CCG-TUT. Distinct: # of distinct lexical categories; Max: # of cate-
gories for the most ambiguous word; Type ambig: per word type category
ambiguity; Tok ambig: per word token category ambiguity. . . . . . . . . 78
4.2 Supertagging accuracy for CCGbank sections 22-24. Accuracies are re-
ported for four settings|(1) ambiguous word tokens in the test corpus, (2)
ambiguous word tokens, ignoring punctuation, (3) all word tokens, and (4)
all word tokens except punctuation. . . . . . . . . . . . . . . . . . . . . . 87
4.3 Comparison of supertagging results for CCG-TUT. Accuracies are for am-
biguous word tokens in the test corpus, ignoring punctuation. . . . . . . . 88
viii
List Of Figures
1.1 Decipherment process for converting an observed ciphertext into a plaintext
solution. An example is shown for letter substitution decipherment. . . . 7
1.2 Decipherment approach for various cryptanalysis and natural language pro-
cessing applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Original Uesugi cipher key in Japanese . . . . . . . . . . . . . . . . . . . . 16
2.2 Transliterated version of the checkerboard-style key used for encrypting
the Uesugi cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Ciphertext sequence and the original English plaintext message for a 98-
letter Simple Substitution Cipher. . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Relationship between LM memory size and LM entropy for English let-
ter substitution decipherment. Plotted points represent language models
trained on dierent amounts of data and dierent n-gram orders. . . . . . 25
2.5 LM entropy vs. decipherment error rate (using various n-gram order LMs
on a 98-letter English cipher) . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Eect of random restarts on decipherment on a 98-letter English cipher
using a 2-gram language model. Each point measures the result of a single
EM run, in terms of EM's objective function (x-axis) and decipherment
error (y-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 LM entropy vs. decipherment error rate (using various n-gram order LMs
on a 577-syllable Uesugi cipher) . . . . . . . . . . . . . . . . . . . . . . . . 28
ix
2.8 A decipherment network. The beginning of the ciphertext is shown at the
top of the gure (underscores represent spaces). Any left-to-right path
through the network constitutes a potential decipherment. The bold path
corresponds to the decipherment \decade". The dotted path corresponds
to the decipherment \ababab". Given a cipher length of n, the network
has 27 27 (n 1) links and 27
n
paths. Each link corresponds to a named
variable in our integer program. Three links are shown with their names
in the gure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9 Summary of how to build an integer program for any given ciphertext
c
1
:::c
n
. Solving the integer program will yield the decipherment of highest
probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10 Average decipherment error using integer programming vs. cipher length,
for 1-gram, 2-gram and 3-gram models of English. Error bars indicate 95%
condence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.11 Equivocation for simple substitution on English, for human-level language
model (Shannon, 1949). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.12 Average key equivocation observed (bits) vs. cipher length (letters), for
1-gram, 2-gram and 3-gram models of English. . . . . . . . . . . . . . . . 40
2.13 Average message equivocation observed (bits) vs. cipher length (letters),
for 1-gram, 2-gram and 3-gram models of English. . . . . . . . . . . . . . 40
3.1 Samples from the ciphertext sequence, corresponding English plaintext
message and output from Bayesian decipherment (using word+3-gram LM)
for three dierent ciphers: (a) Simple Substitution Cipher (top), (b) Ho-
mophonic Substitution Cipher with spaces (middle), and (c) Zodiac-408
Cipher (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Comparison of decipherment accuracies for EM versus Bayesian method
when using dierent language models of English on the three substitution
ciphers: (a) 414-letter Simple Substitution Cipher, (b) 414-letter Homo-
phonic Substitution Cipher (with spaces), and (c) the famous Zodiac-408
Cipher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1 Comparison of decipherment error versus model size (i.e., number of plaintext-
ciphertext mappings left in the model after training) for letter substitution
decipherment when using (1) EM method, and (2) IP method, with a 2-
gram letter-based language model. For EM decipherment, we compute the
model size counting only those mappings which occur with probability
0.0001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
x
4.2 Previous results on unsupervised POS tagging using a dictionary (Meri-
aldo, 1994) on the full 45-tag set. . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Integer Programming formulation for nding the smallest grammar that
explains a given word sequence. Here, we show a sample word sequence
and the corresponding IP network generated for that sequence. . . . . . . 69
4.4 Examples of tagging obtained from dierent systems for prepositions in
and on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Percentage of word tokens tagged correctly by dierent models. The ob-
served sizes and model sizes of grammar (G) and dictionary (D) produced
by these models are shown in the last two columns. . . . . . . . . . . . . . 73
4.6 Two-stage IP method for selecting minimized models for supertagging. . . 83
4.7 Sample sentence pairs from an English-Spanish bilingual corpus|original
sentence pairs (left), and after they have been word-aligned (right). . . . . 91
4.8 Gold alignment samples. The induced bilingual dictionary has 28 distinct
entries, including garcia/garcia, are/son, are/estan, not/no, the/los, etc. . 93
4.9 IP alignment samples. The induced bilingual dictionary has 28 distinct
entries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.10 Comparison of dierent word alignment systems in terms of dictionary size
and alignment accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.11 Relationship between IP objective (x-axis = size of induced bilingual dic-
tionary) and alignment accuracy (y-axis = f-score). Small alignment dic-
tionaries tend to produce better results. . . . . . . . . . . . . . . . . . . . 96
4.12 Two Turkish-English sentence pairs. . . . . . . . . . . . . . . . . . . . . . 96
4.13 A Turkish-English corpus produced by an English grammar pipelined with
an English-to-Turkish tree-to-string transducer. . . . . . . . . . . . . . . . 97
4.14 Corpus statistics showing percentage of Turkish word types and tokens
occurring with dierent morpheme counts. . . . . . . . . . . . . . . . . . . 98
4.15 Sample gold and (initial) IP sub-word alignments on our Turkish-English
corpus. Dashes indicate where the IP search has decided to break Turkish
words in the process of aligning. For examples, the word magazaya has
been broken into magaza- and -ya. . . . . . . . . . . . . . . . . . . . . . . 99
xi
4.16 Comparison of dierent systems on the Turkish-English sub-word align-
ment task in terms of dictionary size, count of Turkish sub-word tokens
and f-score accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.17 Graph instantiation for the MinTagPath problem. . . . . . . . . . . . . . 104
4.18 Graph constructed with tag bigrams chosen in Phase 1 of the MIN-GREEDY
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.19 Results for unsupervised English POS tagging with a dictionary for dier-
ent data sizes when using a set of 45 tags. (
IP method does not scale to
large data). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.20 Comparison of MIN-GREEDY versus MIN-GREEDY approach in terms
of eciency (running time in seconds) for dierent data sizes. All the
experiments were run on a single machine with a 64-bit, 2.4 GHz AMD
Opteron 850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.21 Comparison of observed grammar size (# of tag bigram types) in the nal
English tagging output from EM, IP and MIN-GREEDY. . . . . . . . . . 112
4.22 Speedup versus Optimality ratio computed for the model minimization step
(when using MIN-GREEDY over IP) on dierent English datasets. . . . . 113
4.23 Results for unsupervised Italian POS tagging with a dictionary using a set
of 90 tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1 Model used for back-transliteration of Japanese katakana names and terms
into English. The model employs a four-stage cascade of weighted nite-
state transducers (Knight & Graehl, 1998). . . . . . . . . . . . . . . . . . 120
5.2 Samples from the phonemic substitution table learnt from 3343 parallel En-
glish/Japanese phoneme string pairs. English phonemes are in uppercase,
Japanese in lowercase. Mappings with P(jje)> 0:01 are shown. . . . . . . 121
5.3 Some Japanese phoneme sequences generated from the monolingual katakana
corpus using WFST D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4 Results on name transliteration obtained when using the phonemic substi-
tution model trained under dierent scenarios|(1) parallel training data,
(2a-e) using only monolingual resources. . . . . . . . . . . . . . . . . . . . 125
xii
5.5 Results for end-to-end name transliteration. This gure shows the correct
answer, the answer obtained by training mappings on parallel data (Knight
& Graehl, 1998), and various answers obtained by deciphering non-parallel
data. Method 1 uses a 2-gram P(e), Method 2 uses a 3-gram P(e), and
Method 3 uses a word-based P(e). . . . . . . . . . . . . . . . . . . . . . . 128
6.1 MT with/without parallel data . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2 Comparison of word substitution decipherment results using (1) Iterative
EM, and (2) Bayesian method. For the Transtac corpus, decipherment
performance is also shown for dierent training data sizes (9k versus 100k
cipher tokens). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3 Comparison of the original (O) English plaintext with output from Bayesian
word substitution decipherment (D) for a few samples cipher (C) sentences
from the Transtac corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.4 Comparison of Spanish/English MT performance on the Time and OPUS
test corpora achieved by various MT systems trained under (1) parallel|
(a) MOSES, (b) IBM 3 without distortion, and (2) decipherment settings|
(a) EM, (b) Bayesian. The scores reported here are normalized edit dis-
tance values with BLEU scores shown in parentheses. . . . . . . . . . . . 154
6.5 Comparison of training data size versus MT accuracy in terms of BLEU
score under dierent training conditions: (1) Parallel training|(a) MOSES,
(b) IBM Model 3 without distortion, and (2) Decipherment without par-
allel data using EM method (from Section 6.4.2). . . . . . . . . . . . . . . 156
xiii
Abstract
Most state-of-the-art techniques used in natural language processing (NLP) are supervised
and require labeled training data. For example, statistical language translation requires
huge amounts of bilingual data for training translation systems. But such data does not
exist for all language pairs and domains. Using human annotation to create new bilingual
resources is not a scalable solution. This raises a key research challenge: How can we
circumvent the problem of limited labeled resources for NLP applications? Interestingly,
cryptanalysts and archaeologists have tackled similar challenges in solving decipherment
problems.
This thesis work aims to bring together techniques from classical cryptography, NLP
and machine learning. We introduce a novel approach called natural language decipher-
ment that can solve natural language problems without labeled (parallel) data. A wide
variety of NLP problems can be formulated as decipherment tasks|for example, in statis-
tical language translation one can view the foreign-language text as a cipher for English.
Instead of relying on parallel training data, decipherment uses knowledge of the tar-
get language (e.g., English) and large quantities of readily available monolingual source
(cipher) data to induce bilingual connections between the source and target languages.
Using decipherment techniques, we make headway in attacking a hierarchy of problems
xiv
ranging from letter substitution decipherment to sequence labeling problems (such as
part-of-speech tagging) to language translation. Along the way, we make several key
contributions|novel unsupervised algorithms that search for minimized models during
decipherment and achieve state-of-the-art results on a number of important natural lan-
guage tasks. Unlike conventional approaches, these decipherment methods can be easily
extended to multiple domains and languages (especially resource-poor languages), thereby
helping to spread the impact and benets of NLP research.
xv
Chapter 1
Introduction
1.1 Motivation
One of the major challenges involved in articial intelligence (AI) research is to deal with
uncertainty. In natural language, this relates to the problem of ambiguity which can
manifest in many forms.
For example, consider the following sentence:
The dog saw the cat on the mat.
Syntactically, the word saw can take Noun or Verb form, but given the specic context
the correct syntactic category here is Verb. Humans are very good at resolving ambiguity
using context and their knowledge of the world. On the other hand, computer systems
do not have such knowledge and face many problems when dealing with various natural
language tasks.
Recently, statistical techniques have become popular and have been widely applied for
solving many problems in natural language processing. These methods typically require
large annotated (labeled) corpora for training supervised models to perform dierent
1
tasks. In this scenario, labeled data is provided as a form of supervision which helps
the system automatically learn how to resolve ambiguities. For example, in statistical
language translation (where the task is to translate between two languages automatically)
huge amounts of bilingual (parallel) data containing source/target language sentence pairs
is used for training translation systems.
Unfortunately, the dependency on labeled data for many of these NLP tasks limits
their applications to specic domains, or language pairs for which a lot of training data
is readily available. But if we want to perform NLP tasks in new domains or translate
between new language pairs automatically, we would have to rst collect a lot of labeled
(parallel) data to train our models, and this is a costly as well as time-intensive operation.
For such tasks, the development of novel unsupervised approaches which do not require
labeled data for training, can enable their application to new domains and potentially
broaden the impact and benets of NLP research to wider areas. This leads to several
challenging research questions:
How can we overcome the limitation of labeled resources?
Can we rely on unsupervised learning algorithms to tackle fundamental
problems in NLP and related elds without the use of labeled data? For
example, can we train a statistical language translation system without parallel
data?
These are the research challenges that we address in this thesis work. We present a
novel framework for doing unsupervised learning using a decipherment approach. Unlike
previous approaches, our methods do not require any labeled data and we demonstrate its
2
applicability on a wide range of tasks in diverse research areas such as cryptanalysis and
natural language processing. We also perform extensive evaluations on multiple tasks and
show that this approach yields state-of-the-art results on several fundamental problems
in NLP.
In the next section, we discuss the origins of decipherment and some of its previous
applications. Following this, in Section 1.3 we talk about the main ideas underlying this
thesis work, and brie
y describe a new unsupervised learning methodology along with
several decipherment applications. Section 1.4 lists the major contributions of this thesis
and nally, Section 1.5 outlines how the remainder of this thesis is organized.
1.2 Decipherment in the Past
In the previous section, we discussed the limitations of using labeled data and to address
this problem, we posed an interesting research question: Is it possible to train NLP
systems without any labeled data at all?
As daunting as this challenge may seem, we note that a similar problem has been
tackled by cryptographers and archaeologists in a dierent context|for decipherment
purposes. For such problems, it is highly unlikely to assume the availability of parallel
data relating the ciphertext and plaintext, yet cryptographers and archaeologists have
attempted to solve such tasks using various decipherment techniques along with other
non-parallel sources of information (such as linguistic resources, etc.).
3
1.2.1 Cryptanalysis and Decipherment
For thousands of years, people have used ciphers as a method of communicating messages
that can be read only by intended recipients, and no one else. Throughout history, kings,
queens and generals have relied on such secret codes to send and receive messages, without
revealing vital information to their enemies. The process of converting a message in
original form (referred to as plaintext) into a cipher message (referred to as ciphertext)
is called encipherment. The transformation happens via a key, which is a sequence of
operations (substitution, transposition, etc.), which if applied to the original message,
converts it to ciphertext form. The reverse process of trying to recover the original
message from a given ciphertext is called decipherment. Decipherment can proceed with
or without the use of a key. An intended recipient of a ciphertext message usually has a
copy of the original key used for encipherment and can apply it to the ciphertext message
to retrieve the original plaintext. However, if the message falls into the hands of a person
who is not the intended recipient and does not possess the key, he can still attempt to
break the code using other means|using mathematical operations, linguistic knowledge,
or information about the sender or message topic.
Ciphers were used in ancient Rome, and one of the earliest ciphers was developed
by Julius Caesar and named after him. The Caesar substitution cipher is a code in
which every letter in the plaintext alphabet was replaced by one three places down in
position. In this scheme, a=D, b=E, c=F, etc. and so the plaintext message \hello"
will be enciphered as \KHOOR". Following simple substitution ciphers such as these,
more complex ciphers began to emerge such as polyalphabetic ciphers, where one letter
4
in the plaintext message could be mapped to multiple ciphertext characters depending
on its position in the text. With the advent of more complex ciphers, cryptanalysts
(people who analyze the ciphertext in an attempt to break them) also started developing
more sophisticated techniques for deciphering these ciphers. In the 16th century, a secret
cipher message from Mary (Queen of Scots) was intercepted and deciphered, revealing a
plot to assassinate Queen Elizabeth of England, and eventually led to the conspirators
standing trial. Cryptanalysis operations during World War I led to the decipherment of
many important messages and had a huge impact on the United States entering in the
war. But by far, the most important incident in the history of cryptanalysis, and possibly
critical to the outcome of World War II, was the breaking of the German Enigma machine
cipher by British and American scientists.
1.2.2 Decipherment of Ancient Languages and Scripts
The problem faced by a cryptanalyst trying to break a cipher code is similar to one faced
by an archaeologist attempting to decipher a long-forgotten language, or an ancient writ-
ing script. This is a harder problem in many cases. Whereas a cryptanalyst has some
idea about the original plaintext language (for example, German), he can use existing
plaintext information to nd patterns in the cipher and help break the code. On the other
hand, an archaeologist has to work with very limited collection of text, and often the orig-
inal plaintext language is an ancient language that is unknown or extinct. Deciphering
ancient texts seems like a hopeless pursuit, without the availability of any parallel text
or very limited information. Yet, archaeologists have successfully deciphered documents
written in ancient scripts and successfully interpreted the writings and pictures inscribed
5
on tablets from ancient civilizations. The Rosetta Stone, an ancient Egyptian artifact
inscribed in 196 B.C., contains the same text written in three dierent scripts, and has
contributed greatly to the deciphering of hieroglyphic writing. Other important scripts
that have been deciphered in the 20th century include Linear B and the Maya script.
Other historical manuscripts (e.g., the Voynich manuscript, and the Phaistos Disc) con-
taining writings or drawings have also caught the attention of many decipherers around
the world. Some of them are yet to be deciphered.
Next, we return to the original motivation underlying this thesis|solving natural
language problems (and other unsupervised tasks) without using labeled data.. We present
a decipherment approach to tackle this challenge and discuss the main ideas presented in
this thesis.
1.3 This Thesis
In this thesis work, we combine the two ideas (decipherment and unsupervised learning for
NLP problems) together and present a unied decipherment-based approach for modeling
a variety of NLP tasks, where there is no parallel (or manually annotated) data available
for training.
Figure 1.1 illustrates the decipherment process for attacking a simple decipherment
problem such as English letter substitution decipherment. Given a ciphertext sequence,
we build a stochastic generative story (in the direction of encipherment) starting with
the production of a plaintext sequence and then using a key to transform the generated
plaintext into a ciphertext. The decipherment process attempts to model this generative
6
Ciphertext
Key Plaintext (Observed data)
(decipherment training)
LEARNING DECODING
Non-parallel
(plaintext language) data
(Deciphered English text)
HELLO
(Letter substitution cipher)
noeei A/x
B/y
:
E/o
:
:
H/n
L/e
:
Z/w Monolingual English text
Example: Letter Substitution Decipherment
(Key found by decipherment)
(Plaintext language data)
Deciphering direction
Enciphering direction
Figure 1.1: Decipherment process for converting an observed ciphertext into a plaintext
solution. An example is shown for letter substitution decipherment.
process and learn the transformation key using only monolingual plaintext data and
the observed ciphertext sequence. Once the key is learnt via decipherment training, we
proceed to decode the ciphertext using this key and generate possible plaintext solutions.
Decipherment Key Characteristics: The main objective of the decipherment process
is to model the key, which maps the ciphertext to plaintext, and subsequently use it to
decode the ciphertext. As part of this thesis work, we rst identify various properties
associated with the key which aect the decipherment process. Designing the key and its
properties has implications for the complexity of the decipherment problem and hence
the algorithms designed to solve them. In particular, modifying a key characteristic can
aect the type or class of decipherment problem that can be solved under the particular
key settings. At the same time, if we come up with a method to solve a particular
decipherment problem which has certain key characteristics, then the same method can
7
be applied for solving any decipherment problem which shares the same key properties.
Here, we list a set of key characteristics that are used to model the mapping between
ciphertext and plaintext within the decipherment process:
Key Characteristics Choices
1. Is the key deterministic in the enciphering direction? Yes/No
2. Is the key deterministic in the deciphering direction? Yes/No
3. Does the key substitute one-for-one (symbol for symbol)
in the enciphering direction?
Yes/No
4. Does the key substitute one-for-one in the deciphering
direction?
Yes/No
5. What linguistic unit is substituted? Letter / Syllable / Phoneme /
Word / Phrase
6. Does the key involve transposition (re-ordering)? Yes/No
7. Other key characteristics specic to the problem Is parallel data available?
Are some key \hints" provided?
...
As we progress through dierent chapters in this thesis, we will formulate various NLP
problems as decipherment tasks and describe methods and algorithms to solve these tasks.
In addition, we will also show how the decipherment key characteristics vary for each of
the problems we tackle.
8
Overall, the contributions of this thesis can be categorized into three main areas:
Decipherment for Cryptanalysis: We develop several novel methods to decipher
simple as well as complex letter substitution ciphers (including the famous Zodiac-
408 cipher).
Deciphering Natural Language without Labeled Data: We show that a wide
range of natural language problems (from sequence labeling tasks such as part-of-
speech tagging to word alignment to language translation) can be formulated as
decipherment tasks involving substitution (and other) operations at the level of
morphemes, phonemes, words and phrases. We develop novel decipherment algo-
rithms to solve these without using any labeled data and show empirical results on
several existing NLP problems.
Minimized Models for Decipherment: We introduce novel unsupervised al-
gorithms based on the idea of searching for \minimized (or small-sized) models"
during decipherment and achieve state-of-the-art results on a number of important
natural language tasks.
Figure 1.2 illustrates some of the NLP problems that we solve as part of this thesis
work using a decipherment approach, and for each problem, it shows (1) the ciphertext
sequence, (2) sample key mappings, and (3) the plaintext sequence.
We start with simple problems like letter substitution (e.g., solving dierent types of
ciphers such as the Zodiac-408 cipher), and move on to more complex tasks in NLP which
involve word substitution operations in the keyspace (for example, unsupervised part-
of-speech-tagging, unsupervised supertagging, word alignment, etc.) and nally tackle
9
Word Substitution Ciphers
Word Substitution+Transposition Ciphers
(Machine Translation)
Ciphertext Key
Simple Letter/Syllable
Substitution Decipherment
noeei timel
A/x
B/y
:
E/o
:
:
H/n
L/e
:
Z/w
Phonetic Decipherment
(Machine Transliteration)
ABRAHAM
(English name)
EY/e
EY/e e
:
B/b
B/b u
:
L/r
L/r u
EY B R AH HH AE M
(English pronunciation)
phonemeinventories. Wecanframethistaskasaphoneticdeciphermentproblem,wherelearn-
ing a ciphertext-to-plaintext key mapping corresponds to learning sound translation patterns
betweenlanguages.
For example, in order to transliterate names from Japanese to English such as the one shown
below:
エーブラハム ↔ ABRAHAM
(pronounced as: e e b u r a h a m u EY B R AH HH AE M )
someofthesekeymappingslearntduringthedeciphermentmightlooklikethis:
EnglishEY → Japanese{e, e e, ...}
English B → Japanese{b, b u, ...}
English L → Japanese{r, r u, ...}
Decipherment under the conditions of transliteration is substantially more difficult than
solving letter-substitution ciphers (Knight et al., 2006; Ravi & Knight, 2008; Ravi & Knight,
2009b)orphoneme-substitutionciphersdescribedin(Knight&Yamada,1999). Thisisbecause
the target table contains significant non-determinism, and because each symbol has multiple
possiblefertilities,whichintroducesuncertaintyaboutthelengthofthetargetstring.
3.2 PriorWorkinMachineTransliteration
Transliteration refers to the transport of names and terms between languages with different
writingsystemsandphonemeinventories. Recentlytherehasbeenalargeamountofinteresting
workinthisarea,andtheliteraturehasoutgrownbeingcitableinitsentirety. Muchofthiswork
focuseson back-transliteration,whichtriestorestoreanameortermthathasbeentransported
intoaforeignlanguage. Here,thereisoftenonlyonecorrecttargetspelling—forexample,given
jyon.kairu(thenameofaU.S.SenatortransportedtoJapanese),wemustoutput“JonKyl”,
not“JohnKyre”oranyothervariation.
There are many techniques for transliteration and back-transliteration, and they vary along
anumberofdimensions:
• phonemesubstitutionvs.charactersubstitution
• heuristicvs.generativevs.discriminativemodels
• manualvs.automaticknowledgeacquisition
Weexplorethethirddimension,whereweseeseveraltechniquesinuse:
• Manually-constructedtransliterationmodels,e.g.,(Hermjakobetal.,2008).
20
(Japanese katakana)
e e b u r a h a m u
(Japanese pronunciation)
Plaintext
(English text)
HELLO WORLD
Enciphering direction
Deciphering direction
Unsupervised Tagging
with a dictionary
(e.g., POS Tagging)
they -> {DT}
can -> {V,AUX}
fish -> {N,V}
DT AUX V
(Part-Of-Speech tag sequence)
DT/they
AUX/can
V/fish
:
they can fish
(word/tag
dictionary)
(word sequence)
Word/Sub-word
Alignment
(English/
Spanish
sentence pair)
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
...
...
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
...
...
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
Word
Alignment
Figure4.7: SamplesentencepairsfromanEnglish-Spanishbilingualcorpus—originalsentence
pairs(left),andaftertheyhavebeenword-aligned(right).
While people are able to make good alignment decisions, it is not clear what function they are
maximizing,ifany.
Wordalignmenthasseveraldownstreamconsumers. Oneismachinetranslation(MT).MT
systemsroutinelyextractphrase-pairsand translationrulesfromword-alignedcorpora(Och&
Ney, 2004; Galley et al., 2004; Chiang, 2007; Quirk et al., 2005). This extraction generally
wantsasingle,hardalignmentforeachsentencepair,thoughthereareexceptions. Otherdown-
stream processes need the probabilistic dictionary derived during alignment, for example, to
translatequeriesincross-lingualIR(Sch¨ onhofenetal.,2008)orre-scorecandidateMToutputs
(Ochetal.,2004).
Prior Work: In the past, researchers have explored various methods for automatic word
alignmentusingmachines. GenerativemodelslikeIBM1-5(Brownetal.,1993),HMM(Vogel
et al., 1996), ITG (Wu, 1997), and LEAF (Fraser & Marcu, 2007a) define formulas for P(f | e)
orP(e,f),withhiddenalignmentvariables. EMalgorithmssetdictionaryandotherprobabilities
in order to maximize those quantities. Dictionary probabilities and alignment choices are soft,
but one can ask for the Viterbi alignment, which maximizes P(a | e, f). Discriminative models,
e.g. (Taskar et al., 2005), set parameters instead to maximize alignment accuracy against a
hand-aligneddevelopmentset. EMDtraining(Fraser&Marcu,2006)combinesgenerativeand
discriminative elements. Different accuracy metrics have been proposed, e.g., (Och & Ney,
2003;Fraser&Marcu,2007b;Ayan&Dorr,2006).
Alignment accuracy is still low for many language pairs, and most practitioners still use
1990s algorithms toaligntheir data. Itstands toreasonthatwe havenot yetseen thelastword
in alignment models. Another weakness of current systems is that they only align full words.
With few exceptions, e.g. (Snyder & Barzilay, 2008), they do not align at the sub-word level,
makingthemmuchlessusefulforagglutinativelanguages.
Minimized Models for Word Alignment: We present a new objective function for word
alignment—we search for the legal alignment that minimizes the size of the induced bilingual
39
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
...
...
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
...
...
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
Word
Alignment
Figure4.7: SamplesentencepairsfromanEnglish-Spanishbilingualcorpus—originalsentence
pairs(left),andaftertheyhavebeenword-aligned(right).
While people are able to make good alignment decisions, it is not clear what function they are
maximizing,ifany.
Wordalignmenthasseveraldownstreamconsumers. Oneismachinetranslation(MT).MT
systemsroutinelyextractphrase-pairsand translationrulesfromword-alignedcorpora(Och&
Ney, 2004; Galley et al., 2004; Chiang, 2007; Quirk et al., 2005). This extraction generally
wantsasingle,hardalignmentforeachsentencepair,thoughthereareexceptions. Otherdown-
stream processes need the probabilistic dictionary derived during alignment, for example, to
translatequeriesincross-lingualIR(Sch¨ onhofenetal.,2008)orre-scorecandidateMToutputs
(Ochetal.,2004).
Prior Work: In the past, researchers have explored various methods for automatic word
alignmentusingmachines. GenerativemodelslikeIBM1-5(Brownetal.,1993),HMM(Vogel
et al., 1996), ITG (Wu, 1997), and LEAF (Fraser & Marcu, 2007a) define formulas for P(f | e)
orP(e,f),withhiddenalignmentvariables. EMalgorithmssetdictionaryandotherprobabilities
in order to maximize those quantities. Dictionary probabilities and alignment choices are soft,
but one can ask for the Viterbi alignment, which maximizes P(a | e, f). Discriminative models,
e.g. (Taskar et al., 2005), set parameters instead to maximize alignment accuracy against a
hand-aligneddevelopmentset. EMDtraining(Fraser&Marcu,2006)combinesgenerativeand
discriminative elements. Different accuracy metrics have been proposed, e.g., (Och & Ney,
2003;Fraser&Marcu,2007b;Ayan&Dorr,2006).
Alignment accuracy is still low for many language pairs, and most practitioners still use
1990s algorithms toaligntheir data. It stands toreasonthatwe havenot yetseen the lastword
in alignment models. Another weakness of current systems is that they only align full words.
With few exceptions, e.g. (Snyder & Barzilay, 2008), they do not align at the sub-word level,
makingthemmuchlessusefulforagglutinativelanguages.
Minimized Models for Word Alignment: We present a new objective function for word
alignment—we search for the legal alignment that minimizes the size of the induced bilingual
39
strong/fuertes
are/son
not/no
:
(English text)
I/xyzz
THE/crqq
BOY/tmnz
:
I SAW THE BOY
THE BOY RAN
xyzz fxyy crqq tmnz
crqq tmnz gddx
(Word cipher)
(Letter cipher)
(English translation)
IN THE YEAR 1990
1990 !"# $%
(Arabic text)
IN/
YEAR/
1990/
:
1990 !"# $%
$%
!"#
1990
Complex Letter
Substitution Decipherment
A/23
:
E/82
:
:
H/65
L/51
L/84
:
Z/93
(Homophonic cipher without spaces)
65 82 51 84 05 60
54 42 51 45
(English text)
HELLOWORLD
Figure 1.2: Decipherment approach for various cryptanalysis and natural language pro-
cessing applications.
language translation tasks such as machine transliteration (phonetic substitution) and
machine translation (word substitution + transposition + other key operations).
Decipherment techniques are used to model natural language transformations in these
tasks and we present algorithms and results for tackling many of these problems. We show
that we can obtain high task accuracies for many problems of interest in NLP using only
non-parallel data.
10
1.4 Thesis Contributions
The contributions of this thesis are:
1. We provide novel methods for solving letter and syllable substitution ciphers, which
yield accuracy improvements ranging from 10-64% over existing methods, and we
also compare empirical results with Shannon's mathematical theory of decipher-
ment.
2. We introduce the idea of using \small model sizes" for decipherment. Following this
notion, we present novel methods for solving unsupervised problems like part-of-
speech (POS) tagging with a dictionary, supertagging with a dictionary and word
alignment. These methods explicitly search for small models during decipherment.
For unsupervised POS tagging, the new proposed approach yields a very high 92.3%
tagging accuracy for English, which is the best reported result so far on this task. On
the supertagging task, we achieve 3-4% improvement over existing state-of-the-art
approaches on multiple languages. We also show that we can achieve signicant ac-
curacy (f-measure) improvements ranging from 9% to 63% when using our approach
for the word alignment and sub-word alignment tasks over existing approaches.
3. We tackle phonetic ciphers and apply techniques to an existing NLP task|machine
transliteration. In comparison to current transliteration approaches (which use
parallel data for training), we provide a decipherment approach to perform machine
transliteration using only monolingual resources. The method is not constrained by
the availability of parallel resources, and hence can work with any language pair.
We show that using this method, it is possible to achieve good performance (26%
11
lower accuracy than a parallel-trained system) on a standard Japanese/English
name-transliteration task without using any parallel data.
4. We present ecient decipherment algorithms that can scale to large vocabulary (and
data) sizes and demonstrate their eectiveness on a word substitution decipherment
task, achieving over 80% accuracy.
5. We develop novel decipherment techniques to tackle machine translation without
parallel corpora. We present empirical results for automatic language translation
in two dierent domains. Compared to existing parallel-trained systems, our meth-
ods yield good translations without relying on any bilingual resources at all. In
addition, we provide empirical studies which show how dierent factors (such as
monolingual training data sizes, parallel versus non-parallel corpora, etc.) aect
the decipherment process and have a bearing on the end-accuracies for the transla-
tion task.
1.5 Thesis Overview
The remainder of the thesis is organized as follows:
Chapter 2 discusses simple letter and syllable substitution decipherment, and an
empirical study of Shannon's mathematical theory of decipherment.
Chapter 3 discusses the famous Zodiac cipher and other homophonic ciphers (a
more complex cipher system compared to simple substitution ciphers) and presents
decipherment techniques for solving such ciphers.
12
Chapter 4 discusses small models for decipherment and applications to various NLP
tasks such as sequence labeling (e.g., part-of-speech tagging) and word alignment.
Chapter 5 presents phonetic decipherment and application to machine translitera-
tion.
Chapter 6 presents the work on deciphering foreign language and application to
machine translation without parallel data.
Chapter 7 concludes the thesis, summarizing some of the results from previous
chapters and outlines several directions for future work.
13
Chapter 2
Deciphering Simple Letter and Syllable Substitution
Ciphers
In this chapter, we present novel methods for deciphering letter and syllable substitution
ciphers. To break these ciphers, our goal is two-fold|(a) use as little knowledge about
the plaintext language as possible, and (b) solve the ciphers as accurately as possible.
First, we describe novel decipherment methods using a probabilistic approach based on
the Expectation Maximization (EM) algorithm. This includes a new objective function for
decipherment which produces better results than the previous EM method from Knight
et al. (2006) on both Japanese and English substitution ciphers.
In the second part, we introduce a novel method for solving substitution ciphers
based on integer programming. The method is exact, not heuristic, and unlike many
previous approaches, requires only minimal knowledge of the plaintext language (using
only low-order letter n-gram models) for decipherment. The decipherment accuracies
obtained from the new methods far exceed some of the previous methods, with accuracy
improvements ranging from 10%-64% on dierent ciphers.
14
In the nal part, we analyze our decipherment work in the context of Shannon's math-
ematical theory of decipherment (Shannon, 1949), and show how Shannon's theoretical
predictions compare against our empirical results for decipherment.
2.1 Introduction
The table shown below lists some important properties of simple letter and syllable sub-
stitution ciphers with respect to dierent key characteristics (introduced in the previous
chapter).
Key Characteristics Simple
sub
Is the key deterministic in the enciphering direction? Yes
Is the key deterministic in the deciphering direction? Yes
Does the key substitute one-for-one (symbol for symbol) in
the enciphering direction?
Yes
Does the key substitute one-for-one in the deciphering direc-
tion?
Yes
What linguistic unit is substituted? Letter / Syllable
Does the key involve transposition (re-ordering)? No
Simple substitution ciphers (such as the ones described in this chapter) exhibit a one-
to-one correspondence between ciphertext symbols and the plaintext units they encode.
We shall study two types of ciphers: (a) letter substitution ciphers, and (b) syllable
substitution ciphers.
Letter Substitution Ciphers: In a letter substitution cipher, every letter in the natural
language (plaintext) sequence is replaced by a cipher token, according to some substitu-
tion key.
15
Figure 2.1: Original Uesugi cipher key in Japanese
For example, an English plaintext:
\HELLO WORLD ..."
may be enciphered as:
\NOEEI TIMEL ..."
according to the key:
P: ABCDEFGHIJKLMNOPQRSTUVWXYZ
C: XYZLOHANBCDEFGIJKMPQRSTUVW
If the recipients of the ciphertext message have the substitution key, they can use it
(in reverse) to recover the original plaintext. The goal of cryptanalysis is to guess the
original plaintext from the given cipher without any knowledge of the substitution key.
Syllable Substitution Ciphers: We also attack a dierent type of substitution cipher|
a Japanese syllable substitution cipher (the Uesugi cipher
1
), where every syllable in the
plaintext (Japanese) is substituted with a ciphertext token.
1
A particular Japanese syllable-substitution cipher from the Warring States Period, said to be employed
by General Uesugi Kenshin.
16
1 2 3 4 5 6 7
1 i ro ha ni ho he to
2 ti ri nu ru wo wa ka
3 yo ta re so tu ne na
4 ra mu u <> no o ku
5 ya ma ke hu ko e te
6 a sa ki yu me mi si
7 <> hi mo se su n <>
Figure 2.2: Transliterated version of the checkerboard-style key used for encrypting the
Uesugi cipher
This is a more dicult cipher system than English letter substitution ciphers. The Ue-
sugi cipher has more characters than English, it has no word boundaries, and even the cor-
rect key yields multiple decipherment candidates. This cipher employs the checkerboard-
style key shown in Figures 2.1 and 2.2. To encode a message, the sender looks up each
syllable in the key, and replaces it with a two-digit number. The rst digit is the column
index, and the second digit is the row index.
For example, the plaintext:
\wa ta ku si ..."
is enciphered as:
\62 23 74 76 ..."
The goal of cryptanalysis is to take an intercepted number sequence and guess a
plaintext for it, without the benet of the key.
The rest of the chapter mostly deals with letter substitution ciphers|prior work, new
methods for attacking such ciphers, and also a mathematical study of the decipherment
process. However, the same properties apply to syllable substitution ciphers and we also
show some empirical results on Japanese syllable substitution ciphers in Section 2.4.3.
17
2.2 Statistical Language Models for Decipherment
We use language models to attack letter and syllable substitution ciphers. For attacking
letter substitution ciphers, where the original plaintext is English, we look for the key
(among 26! possible ones) that, when applied to the ciphertext, yields the most English-
like result. We take \English-like" to mean most probable according to some statistical
language model, whose job is to assign a probability to any sequence of letters. According
to a 1-gram model of English, the probability of a plaintext p
1
:::p
n
is given by:
P (p
1
:::p
n
) =P (p
1
)P (p
2
):::P (p
n
)
This model assigns a probability to any letter sequence, and the probabilities of all letter
sequences sum to one. We collect letter probabilities (including space) from 50 million
words of text available from the Linguistic Data Consortium (Gra & Finch, 1994). We
also estimate 2- and 3-gram models using the same resources:
P (p
1
:::p
n
) = P (p
1
jSTART )P (p
2
jp
1
)P (p
3
jp
2
)
:::P (p
n
jp
n1
)P (ENDjp
n
)
P (p
1
:::p
n
) = P (p
1
jSTART )P (p
2
jSTARTp
1
)
P (p
3
jp
1
p
2
):::P (p
n
jp
n2
p
n1
) P (ENDjp
n1
p
n
)
18
For example, the probability of the plaintext phrase \the fox" according to a 1-gram,
2-gram and 3-gram language model of English is shown below:
2
P
1gram
(the fox) = P (t)P (h)P (e)P ( )P (f)P (o)P (x)
P
2gram
(the fox) = P (tj START)P (hj t)P (ej h)P ( j e)
P (fj )P (oj f)P (xj o)P (ENDj x)
P
3gram
(the fox) = P (tj START)P (hj START t)P (ej th)P ( j he)
P (fj e )P (oj f)P (xj fo)P (ENDj ox)
Unlike the 1-gram model, the 2-gram model will assign a low probability to the se-
quence \fx" because the probability P (xj f) is low. Of course, all these models are fairly
weak, as already known by (Shannon, 1949). We can further estimate the probability of
a whole English sentence or phrase in this manner. In comparison to the 1-gram model,
higher order n-gram models, which take context into account, are able to distinguish
between English and non-English phrases better, and can lead to better plaintext deci-
pherments. We also apply interpolation smoothing techniques when building higher order
n-gram language models.
2
Underscore ( ) represents the space character in plaintext and ciphertext messages.
19
2.3 Previous Work
There has been no previous work on deciphering syllable ciphers until now. However, a
number of papers have explored algorithms for automatically solving letter substitution
ciphers. Some use heuristic methods to search for the best deterministic key (Peleg &
Rosenfeld, 1979; Ganesan & Sherman, 1993; Jakobsen, 1995; Olson, 2007), often using
word dictionaries to guide that search. Others use Expectation Maximization (EM) to
search for the best probabilistic key using letter n-gram models (Knight et al., 2006). In
the following sections, we present our contributions and ndings and compare these with
the results presented in (Knight et al., 2006). Our decipherment methods can be applied
to both letter as well as syllable substitution ciphers and we show empirical results on
both types of ciphers. Before we describe the new methods, let us rst take a look at
how the EM method works.
EM Decipherment: Knight et al. (2006) proposes a noisy-channel model of cipher-
text production. First, a plaintext e is produced according to probability P(e). Then,
the plaintext is encoded as ciphertext c, according to probability P(cje). They estimate
an n-gram model for P(e) on separate plaintext data. They then adjust the P(cje) pa-
rameter values in order to maximize the probability of the observed (ciphertext) data.
This probability can be written as:
P (c) =
X
e
P (e)P (cje) (2.1)
The P(cje) quantity is the product of the probabilities of the individual token substi-
tutions that transform e into c, i.e., the substitution probabilities make up the guessed
20
key. Knight et al. (2006) propose the EM algorithm (Dempster et al., 1977) as a method
to guess the best probabilistic values for the key. EM is an iterative algorithm that im-
proves P(c) from one iteration to the next, until convergence. Once the key is settled on,
the Viterbi algorithm can search for the best decipherment:
arg max
e
P (ejc) = arg max
e
P (e)P (cje) (2.2)
Knight et al. (2006) also develop a technique for improving decipherment accuracy.
Prior to decoding, they stretch out the channel model probabilities, using the Viterbi
algorithm to instead search for:
arg max
e
P (ejc) = arg max
e
P (e)P (cje)
3
(2.3)
Finally, they provide an evaluation metric (number of guessed plaintext tokens that
match the original message), and they report results on English letter substitution deci-
pherment.
2.4 Probabilistic Decipherment
In (Ravi & Knight, 2009d), we present an improved EM method for deciphering letter
substitution ciphers. We show that this method can also be applied to syllable substi-
tution ciphers, and present decipherment results on the Uesugi cipher (a particular type
of Japanese syllable cipher). We follow a probabilistic approach, as suggested by Knight
et al. (2006), and illustrate results from an empirical study which shows how varying
21
dierent conditions in the probabilistic decipherment process can aect the decipherment
accuracy. We also present a new objective function which produces better decipherment
results than the original EM method (Knight et al., 2006) on both English and Japanese
substitution ciphers.
2.4.1 Model
Following the probabilistic approach in (Knight et al., 2006), we represent the plaintext
LM model P(e) as a weighted nite-state acceptor (WFSA) and the substitution model
P(cje) as a weighted nite-state transducer (WFST). We compose the two models and
run EM training to nd the best P(cje) table using the Carmel nite-state transducer
package (Graehl, 1997), a toolkit with an algorithm for EM training of weighted nite-
state transducers.
2.4.2 Inference: Modied EM Algorithm
We nd that higher-order n-gram models paradoxically generate worse decipherments,
according to the method of (Knight et al., 2006). Table 2.1 shows best decipherment
error rates associated with dierent LM orders (trained on the same data) for a 98-letter
English cipher using the original EM objective function. We observe that as we increase
the n-gram order of the LM (for example, 2-gram LM versus 7-gram LM), we nd that
decipherment surprisingly gets worse.
22
LM model EM(original) error-rate
2-gram 0.41
3-gram 0.59
5-gram 0.53
7-gram 0.65
Table 2.1: Decipherment error rates on a 98-letter English cipher using the original EM
objective function (Knight et al., 2006). All the LMs are trained on 1.4 million letters of
English data, and use 10 random restarts per point.
The consequence of the 7-gram model being so wrongly opinionated is that the sub-
stitution model probabilities learn to become more fuzzy than usual, in order to accom-
modate the LM's desire to produce certain strings. The learnt substitution table is much
more non-deterministic than the true substitution table (key).
We invent an adjustment to the EM objective function that solves this problem. Recall
from Equation 2.1 that EM's objective function is:
P (c) =
X
e
P (e)P (cje)
Here, the P(e) factor carries too much weight, so in order to reduce the \vote" that
the LM contributes to EM's objective function, we create a new objective function:
P (c) =
X
e
P (e)
0:5
P (cje) (2.4)
Note that this is substantially dierent from the proposal of (Knight et al., 2006) to
stretch out the substitution probabilities after decipherment has nished. Instead, we
actually modify the objective function EM uses during decipherment itself. Following
EM training, we use the Viterbi algorithm to search for the best decipherment. Using the
23
Plaintext: T H E A V E R A G E E N G L I S H M A N H A S S O D E E P A
R E V E R E N C E F O R A N T I Q U I T Y T H A T H E W O U L D
R A T H E R B E W R O N G T H A N R E C E N T P E T E R M C A R T H U R
Ciphertext: q i y m f y s m f y y f f y h e i a m f i m e e d g y y w m
s y f y s y f g y s d s m f q h w h h q d q i m q i y y d h y g
s m q i y s z y y s d f f q i m f s y g y f q w y q y s a g m s q i h s
Figure 2.3: Ciphertext sequence and the original English plaintext message for a 98-letter
Simple Substitution Cipher.
new EM method, we achieve much more accurate decipherments, especially on shorter
ciphers (empirical results are presented in the next section).
2.4.3 Experiments and Results
Letter Substitution Decipherment: We work with a 98-letter simple substitution
cipher. We create a ciphertext sequence by encrypting an original English plaintext
message with a randomly generated simple substitution key. During the encipherment
process, we preserve spaces between words. Figure 2.3 shows the entire ciphertext along
with the original plaintext used to create the cipher. Next, we present results from letter
substitution decipherment experiments on this cipher.
Eect of Language Models on Decipherment: We attack cipher lengths that are not
solved by low-order language models. In (Ravi & Knight, 2009d), we present an empir-
ical study of training sizes, perplexities, and memory requirements for n-gram language
models, and we relate language-model perplexity to decipherment accuracy.
Figure 2.4 shows the relationship between memory requirements (number of WFSA
transitions) and LM entropy in bits/character (perplexity = 2
entropy
) of various LMs on
some held-out English plaintext data. Note that for any particular memory size a machine
may have, we can select the LM order that gives the best perplexity.
24
2.4
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
LM quality (entropy): -log
2
P(e)/N
LM size (# of WFSA transitions)
2-gram
3-gram
4-gram
5-gram
6-gram
7-gram
Figure 2.4: Relationship between LM memory size and LM entropy for English letter
substitution decipherment. Plotted points represent language models trained on dierent
amounts of data and dierent n-gram orders.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
decipherment error rate
LM entropy : -log
2
P(e) / N
2-gram
3-gram
4-gram
5-gram
6-gram
7-gram
Figure 2.5: LM entropy vs. decipherment error rate (using various n-gram order LMs on
a 98-letter English cipher)
25
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6
decipherment error rate
Entropy : -log
2
P(c)/N
( EM objective function value = 2^entropy )
Figure 2.6: Eect of random restarts on decipherment on a 98-letter English cipher using
a 2-gram language model. Each point measures the result of a single EM run, in terms
of EM's objective function (x-axis) and decipherment error (y-axis).
Figure 2.5 shows a nice correlation between LM entropy and end-to-end decipherment
accuracy for experiments on the 98-letter English cipher. It shows that using better
language models (in terms of LM entropy) leads to lower error rates for decipherment.
Random Restarts for EM: Knight et al. (2006) employ uniform starting conditions for
P(cje) during EM training. We nd that dierent starting points for the EM training
result in radically dierent decipherment strings and accuracies. Figure 2.6 shows the
result of the uniform starting condition along with 29 random restarts using a 2-gram
English LM. Each point in the scatter-plot represents the results of EM from one starting
point. The x-axis gives the entropy in bits/character obtained at the end of an EM run,
and the y-axis gives the accuracy of the resulting decipherment.
26
LM model EM(original) error-rate EM(new objective fn.) error-rate
2-gram 0.41 0.43
3-gram 0.59 0.16
5-gram 0.53 0.11
7-gram 0.65 0.11
Table 2.2: Decipherment error rates on a 98-letter English cipher using (a) Original EM
objective function (Knight et al., 2006) (b) New objective function (square-rooting LM).
All the LMs are trained on 1.4 million letters of English data, and use 10 random restarts
per point.
We observe a general trend|when we locate a model with a good P(c), that model
also tends to generate a more accurate decipherment. This is good, because it means that
EM is maximizing something that is of extrinsic value. As seen in Figure 2.6, the best
P(c) does not guarantee the best accuracy. However, we are able to signicantly improve
on the uniform-start decipherment through random restarts.
New Decipherment Objective Function: Table 2.2 (third column) shows the improve-
ments we get from the new objective function (described in Section 2.4.2) at various
n-gram orders for the 98-letter English cipher. The results are much more in accord
with what we believe should happen, and gets us to where better n-gram models give us
lower decipherment error. In addition, the new objective function substantially improves
decipherment error over the methods of Knight et al. (2006).
Syllable Substitution Decipherment: Here, we work with the Uesugi cipher (a
Japanese syllable cipher) described in Section 2.1. We apply the same probabilistic deci-
pherment method with our new objective function (described in Section 2.4.2).
Figure 2.7 shows a nice correlation between LM entropy and end-to-end decipher-
ment accuracy for experiments on a 577-syllable Uesugi cipher. It shows that using
better Japanese language models (in terms of LM entropy) leads to lower error rates for
decipherment.
27
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2
decipherment error rate
LM entropy: -log
2
P(e) / N
2-gram
3-gram
5-gram
Figure 2.7: LM entropy vs. decipherment error rate (using various n-gram order LMs on
a 577-syllable Uesugi cipher)
LM model Long Cipher (577 syllables) Short Cipher (298 syllables)
EM(original) EM(new objective fn.) EM(originial) EM(new objective fn.)
2-gram 0.46 0.29 0.89 0.78
3-gram 0.12 0.05 0.96 0.78
5-gram 0.09 0.02 0.95 0.61
Table 2.3: Decipherment error rates on two dierent Uesugi ciphers using (a) Original
EM objective function (b) New objective function (square-rooting LM)
Table 2.3 shows the improvements we get from the new objective function in com-
parison to the original EM objective function at various n-gram orders for two Japanese
ciphers (a 577-syllable cipher and another shorter 298-syllable cipher).
To summarize, we showed how dierent factors in the decipherment process aect the
end results. We also presented a new objective function for probabilistic decipherment
that produces much better decipherments than the previous EM-based approach on both
letter as well as syllable substitution ciphers.
28
2.5 Integer Linear Programming (ILP) Decipherment
The EM-based decipherment techniques discussed earlier have proven to be eective for
solving long ciphers and yield high decipherment accuracies. But these methods have
some drawbacks|(1) they do not perform equally well on shorter ciphers, especially
when using low-order n-gram LMs, (2) EM always learns probabilistic key mappings
whereas the true key for simple substitution ciphers is deterministic (i.e., there is a 1-
to-1 correspondence between ciphertext and plaintext units), and (3) EM is an iterative
optimization technique which can easily get stuck in local minima, so we need to perform
random restarts. Next, we propose a novel non-probabilistic decipherment method based
on integer programming that yields better results for letter substitution decipherment
than the EM methods described in Sections 2.3 and 2.4.2.
In (Ravi & Knight, 2008), we introduce an exact method for solving substitution
ciphers that works well even with low-order letter n-gram models. This method enforces
global constraints using integer programming, and it guarantees that no decipherment key
is overlooked. We carry out extensive empirical experiments on English letter substitution
ciphers showing how decipherment accuracy varies as a function of cipher length and n-
gram order, and show that our accuracy rates far exceed those of EM-based methods.
2.5.1 Decipherment Objective
Given a ciphertext c
1
:::c
n
, we search for the key that yields the most probable plaintext
p
1
:::p
n
. For English letter substitution ciphers, there are 26! possible keys, too many to
enumerate. However, we can still nd the best one in a guaranteed fashion. We do this
29
by taking our most-probable-plaintext problem and casting it as an integer programming
problem.
Here is a sample integer linear programming problem:
variables: x;y
minimize:
2x +y
subject to:
x +y< 6:9
yx< 2:5
y> 1:1
We require thatx andy take on integer values. A solution can be obtained by typing
this integer program into the publicly available lp solve program, or the commercially
available CPLEX program, which yields the result: x = 4;y = 2. For more information
on integer and linear programming, see (Schrijver, 1998).
2.5.2 ILP Formulation and Optimization
Suppose we want to decipher with a 2-gram language model, i.e., we want to nd the key
that yields the plaintext of highest 2-gram probability. Given the ciphertext c
1
:::c
n
, we
create an integer programming problem as follows. First, we set up a network of possible
decipherments (Figure 2.8). Each of the 27 27 (n 1) links in the network is a binary
variable in the integer program|it must be assigned a value of either 0 or 1. We name
30
Figure 2.8: A decipherment network. The beginning of the ciphertext is shown at the top
of the gure (underscores represent spaces). Any left-to-right path through the network
constitutes a potential decipherment. The bold path corresponds to the decipherment
\decade". The dotted path corresponds to the decipherment \ababab". Given a cipher
length of n, the network has 27 27 (n 1) links and 27
n
paths. Each link corresponds
to a named variable in our integer program. Three links are shown with their names in
the gure.
these variables link
XYZ
, where X indicates the column of the link's source, and Y and
Z represent the rows of the link's source and destination (e.g. variables link
1aa
, link
1ab
,
link
5qu
, ...).
Each distinct left-to-right path through the network corresponds to a dierent de-
cipherment. For example, the bold path in Figure 1 corresponds to the decipherment
\decade". Decipherment amounts to turning some links \on" (assigning value 1 to the
link variable) and others \o" (assigning value 0). Not all assignments of 0's and 1's to
link variables result in a coherent left-to-right path, so we must place some \subject to"
constraints in our integer program. We also need to ensure that the chosen path imitates
the repetition pattern of the ciphertext. While the bold path in Figure 2.8 represents the
31
ne plaintext choice \decade", the dotted path represents the choice \ababab", which
is not consistent with the repetition pattern of the cipher \QWBSQW". To make sure
our substitutions obey a consistent key, we set up 27 27 = 729 new key
xy
variables to
represent the choice of key. These new variables are also binary, taking on values 0 or 1.
If variable key
aQ
= 1, that means the key maps plaintext a to ciphertext Q.
We set up an expression for the \minimize" part of the integer program. Recall that
we want to select the plaintext p
1
:::p
n
of highest probability. For the 2-gram language
model, the following are equivalent:
(a) Maximize P (p
1
:::p
n
)
(b) Maximize log
2
P (p
1
:::p
n
)
(c) Minimize log
2
P (p
1
jSTART )
log
2
P (p
2
jp
1
)
:::
log
2
P (p
n
jp
n1
)
log
2
P (ENDjp
n
)
We can guarantee this last outcome if we construct our minimization function as a sum
of 27 27 (n 1) terms, each of which is alink
XYZ
variable multiplied bylog
2
P (ZjY ):
Minimize link
1aa
log
2
P (aja)
+link
1ab
log
2
P (bja)
+link
1ac
log
2
P (cja)
+:::
32
variables:
link
ipr
1 if the ith cipher letter is deciphered as plaintext letter p AND the (i+1)th cipher letter is
deciphered as plaintext letter r
0 otherwise
keypq 1 if decipherment key maps plaintext letter p to ciphertext letter q
0 otherwise
minimize:
P
n1
i=1
P
p;r
link
ipr
logP(rjp) (2-gram probability of chosen plaintext)
subject to:
for all p:
P
r
keypr = 1 (each plaintext letter maps to exactly one ciphertext letter)
for all p:
P
r
keyrp = 1 (each ciphertext letter maps to exactly one plaintext letter)
key = 1 (cipher space character maps to plain space character)
for (i=1...n-2), for all r: [
P
p
link
ipr
=
P
p
link
(i+1)rp
]
(chosen links form a left-to-right path)
for (i=1...n-1), for all p:
P
r
link
irp
=keypc
i+1
(chosen links are consistent with chosen key)
Figure 2.9: Summary of how to build an integer program for any given ciphertext c
1
:::c
n
.
Solving the integer program will yield the decipherment of highest probability.
+link
5qu
log
2
P (ujq)
+:::
When we assign value 1 to link variables along some decipherment path, and 0 to all
others, this function computes the negative log probability of that path.
We also specify a set of constraints in our integer program which are applicable to
letter substitution decipherment. For example, we add constraints to enforce that the key
chosen by the integer programming method is deterministic, i.e., every plaintext letter
must map to exactly one ciphertext letter, and vice versa. Figure 2.9 summarizes the
integer program that we construct from a given ciphertext c
1
:::c
n
.
3
Variations on the
decipherment network yield 1-gram and 3-gram decipherment capabilities.
3
More details are described in (Ravi & Knight, 2008) on how to construct a decipherment integer
program from a given ciphertext.
33
Cipher Length EM method IP method
52-letters 85% 21%
414-letters 10% 0.5%
Table 2.4: Decipherment error rates on various ciphers with a 2-gram English language
model using (a) EM method (Knight et al., 2006), and (b) IP method (Ravi & Knight,
2008).
Once an integer program is generated by machine, we ask the commercially-available
CPLEX software
4
to solve it, and then we note whichkey
XY
variables are assigned value
1. Computing the optimal key with CPLEX is not fast|integer programming in the
general-case is an NP-Hard problem, and solution complexities for specic decipherment
cases depend on the number of variables and constraints encoded in the integer program.
However, it is possible to obtain less-than-optimal keys faster by interrupting the solver.
2.5.3 Experiments and Results
We create 50 ciphers each of lengths 2; 4; 8;:::; 256. We solve these with 1-gram, 2-gram,
and 3-gram language models. We record the average percentage of ciphertext tokens
decoded incorrectly. 50% error means half of the ciphertext tokens are deciphered wrong,
while 0% means perfect decipherment.
Figure 2.10 shows our automatic decipherment results on ciphers of various lengths.
We note that the integer programming solution obtained for the given objective function
is exact, not heuristic, so the decipherment error is not due to search error. Our use
of global key constraints also leads to accuracy that is superior to the EM method from
(Knight et al., 2006) described in Section 2.3. Table 2.4 compares the error rates obtained
when deciphering with a 2-gram English LM using (a) previous EM-based method, and
4
http://www.ilog.com/products/cplex
34
0
10
20
30
40
50
60
70
80
90
100
Decipherment Error (%)
Cipher Length (letters)
2 4 8 16 32 64 128 256 512 1024
1-gram
2-gram
3-gram
Figure 2.10: Average decipherment error using integer programming vs. cipher length,
for 1-gram, 2-gram and 3-gram models of English. Error bars indicate 95% condence
intervals.
(b) our IP method. On long ciphers (414-letter cipher), the IP method gets near-perfect
decipherment, as compared to 10% error obtained when using the EM method. At shorter
cipher lengths, we observe much higher improvements (a dierence of almost 64%) when
using our method.
We see (from Figure 2.10) that deciphering with 3-grams works well on ciphers of
length 64 or more. This conrms that such ciphers can be attacked with very limited
knowledge of English (no words or grammar) and little custom programming. The 1-gram
model works badly in this scenario, which is consistent with Bauer's (2006) observation
that for short texts, mechanical decryption on the basis of individual letter frequencies
does not work.
35
In the next section, we turn to an information-theoretic study of cipher systems. For
this study, it is important that we have a a solver that makes no search errors|otherwise
we may wonder whether our results bear on the model itself (as we intend) or are due to
deciencies in the search procedure. Our integer programming method guarantees that
the solution to any cipher will be optimal under the given model.
2.6 An Empirical Study of Shannon's Decipherment Theory
Next, we empirically explore the concepts in Shannon's (1949) paper on information
theory as applied to cipher systems. We provide quantitative plots for uncertainty in
decipherment, including the famous unicity distance, which estimates how long a cipher
must be to virtually eliminate such uncertainty.
2.6.1 Shannon Equivocation
Very short ciphers are hard to solve accurately. Shannon (1949) pinpointed an inherent
diculty with short ciphers, one that is independent of the solution method or type of
language model used; the cipher itself may not contain enough information for its proper
solution. For example, given a short cipher likeXYYX, we can never be sure if the answer
is peep, noon, anna, etc. Shannon dened a mathematical measure of our decipherment
uncertainty, which he called equivocation (now called entropy).
Shannon's Predictions for Message and Key Equivocations: Let C be a cipher,
M be the plaintext message it encodes, and K be the key by which the encoding takes
36
place. Before even seeingC, we can compute our uncertainty about the key K by noting
that there are 26! equiprobable keys:
5
H(K) = (26!) (1=26!)log
2
(1=26!)
= 88:4 bits
That is, any secret key can be revealed in 89 bits. When we actually receive a cipherC,
our uncertainty about the key and the plaintext message is reduced. Shannon described
our uncertainty about the plaintext message, letting m range over all decipherments:
H(MjC) = equivocation of plaintext message
=
X
m
P (mjC)log
2
P (mjC)
P (mjC) is probability of plaintext m (according to the language model) divided by
the sum of probabilities of all plaintext messages that obey the repetition pattern of C.
Shannon also describedH(KjC), the equivocation of key. This uncertainty is typically
larger than H(MjC), because a given message M may be derived from C via more than
one key, in case C does not contain all 26 letters of the alphabet.
Shannon (1949) used analytic means to roughly sketch the curves for H(KjC) and
H(MjC), which we reproduce in Figure 2.11. Shannon's curve is drawn for a human-level
language model, and the y-axis is given in \decimal digits" instead of bits.
5
(Shannon, 1948) The entropy associated with a set of possible events whose probabilities of occurrence
are p1;p2;:::;pn is given by H =
P
n
i=1
pilog2(pi).
37
Key Equivocation
Message Equivocation
Monday, March 23, 2009 Figure 2.11: Equivocation for simple substitution on English, for human-level language
model (Shannon, 1949).
Empirical Equivocations and Comparison with Shannon's Theoretical Predic-
tions: In (Ravi & Knight, 2008; Ravi & Knight, 2009a), we propose a method to compute
the equivocation measures empirically. While integer programming gives us a method to
nd the most probable decipherment without enumerating all keys, we do not know of an
ecient method to compute a fully summed equivocation without enumerating all keys.
Therefore, we sample up to 100,000 plaintext messages in the neighborhood of the most
probable decipherment and compute H(MjC) over that subset.
We compute H(KjC) by letting r(C) be the number of distinct letters in C, and
letting q(C) be (26r(C))!. Letting i range over our sample of plaintext messages, we
get:
38
H(KjC) = equivocation of key
=
X
i
q(C) (P (i)=q(C))log
2
(P (i)=q(C))
=
X
i
P (i)log
2
(P (i)=q(C))
=
X
i
P (i) (log
2
P (i)log
2
q(C))
=
X
i
P (i)log
2
P (i) +
X
i
P (i)log
2
q(C)
= H(MjC) +log
2
q(C)
For comparison to Shannon's analytic predictions, we plot in Figures 2.12 and 2.13
the average equivocations as we empirically observe them using our 1-, 2-, and 3-gram
language models. The shape of the key equivocation curve in Figure 2.12 follows Shannon,
except that it is curved from the start, rather than straight. The message equivocation
curve in Figure 2.13 follows Shannon's prediction, rising then falling. Because very short
ciphers have relatively few solutions (for example, a one-letter cipher has only 26), the
overall uncertainty is not that high. As the cipher gets longer, message equivocation
rises. At some point, it then decreases, as the cipher begins to reveal its secret through
patterns of repetition.
2.6.2 Unicity Distance
Shannon's analytic model (described in Section 2.6.1) also predicts a decline of message
equivocation towards zero as the cipher length increases. He denes the unicity distance
39
0
10
20
30
40
50
60
70
80
90
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
Equivocation of key (bits)
Cipher Length
1-gram
2-gram
3-gram
Figure 2.12: Average key equivocation observed (bits) vs. cipher length (letters), for
1-gram, 2-gram and 3-gram models of English.
0
10
20
30
40
50
60
70
80
90
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
Equivocation of message (bits)
Cipher Length
1-gram
2-gram
3-gram
Figure 2.13: Average message equivocation observed (bits) vs. cipher length (letters), for
1-gram, 2-gram and 3-gram models of English.
40
(U) as the cipher length at which we have virtually no more uncertainty about the
plaintext.
Shannon's Predictions for Unicity Distance: Using analytic means (and various
approximations), he gives a formula for unicity distance (U):
U = H(K)=(AB)
where:
A = bits per character of a 0-gram language model = 4.7
B = bits per character of the language model used to decipher
For a human-level language model (B 1:2), he concludes U 25, which is
conrmed by practice. For our language models, the formula gives:
U = 173 (1-gram)
U = 74 (2-gram)
U = 50 (3-gram)
These numbers are in the same ballpark as Bauer (2006), who gives 167, 74, and 59.
Comparison of Empirical Unicity Distance with Shannon's Predictions: We
note that these predicted unicity distances are a bit too rosy, according to our empirical
message equivocation curves. It can be seen that for real ciphers, the empirical unicity
point (from Figure 2.13) does not match the predicted numbers (above). The empirical
41
message equivocation does not reduce to zero at the predicted cipher lengths. For ex-
ample, the formula-based unicity distance for a 3-gram language model is 50, whereas in
practice for real ciphers this value can be as large as 128 (the 3-gram message equivoca-
tion curve is close to zero at this point in Figure 2.13). We believe that this dierence in
unicity values exists due to certain assumptions made by Shannon in the unicity compu-
tation for random ciphers (Shannon, 1949), which do not hold in practice for real ciphers.
Given a cipher that is longer than the unicity distance, we often observe two plaintext
solutions such that one is optimal under the given language model, and the other almost
matches the rst, except for a few letters. In this case, the second key lies close to the
optimum in the solution space, and therefore introduces uncertainty about the plaintext.
2.7 Conclusion
To summarize, in this chapter, we proposed novel methods for solving letter and syllable
substitution ciphers. We outlined new methods to improve decipherments produced by
the probabilistic EM method from Knight et al. (2006), that yield considerably higher
accuracies on both English letter substitution ciphers (30% improvement) as well as
Japanese syllable ciphers (7% improvement on long ciphers, and30% improvement on
short ciphers). We also introduced a novel method for deciphering letter substitution
ciphers with low-order models of English. This method, based on integer programming
(IP), requires very little custom coding and can perform an optimal search over the key
space. Using a 2-gram LM, the IP method achieves a much better performance than the
previous EM-based approach (Knight et al., 2006), achieving up to 10% improvement
42
in decipherment accuracy on a long cipher, and a much higher (64%) improvement on
shorter ciphers. We also empirically explored the concepts from Shannon's (1949) paper
on information theory as applied to cipher systems.
43
Chapter 3
Deciphering Complex Letter Substitution Ciphers
In the previous chapter, we tackled simple letter substitution ciphers. We now turn our
attention to other types of ciphers such as homophonic letter substitution ciphers. These
ciphers are more complex and present several new challenges for decipherment. In order
to solve such ciphers, we need to go beyond the simple n-gram language models employed
by our previous approaches and utilize additional information from external sources (for
example, English word dictionaries). We also require ecient inference algorithms that
can deal with the complexities involved in such decipherment tasks.
We present a new probabilistic decipherment approach using Bayesian inference with
sparse priors, which can be used to solve dierent types of substitution ciphers. Our
method (Ravi & Knight, 2011a) uses a decipherment model which combines information
from letter n-gram language models as well as word dictionaries. Bayesian inference is
performed on our model using an ecient sampling technique. We evaluate the quality of
the Bayesian decipherment output on simple and homophonic letter substitution ciphers
and show that unlike a previous approach, our method consistently produces almost 100%
accurate decipherments. The new method can be applied on more complex substitution
44
ciphers and we demonstrate its utility by cracking the famous Zodiac-408 cipher in a fully
automated fashion, which has never been done before.
3.1 Introduction
In letter substitution ciphers, (English) plaintext letters are replaced with cipher symbols
in order to generate the ciphertext sequence. Most decipherment work in this area has
focused on solving simple substitution ciphers and in Chapter 2, we already discussed
several decipherment techniques for tackling such ciphers. But there are other vari-
ants of substitution ciphers, such as homophonic ciphers, which display increasing levels
of diculty and present signicant challenges for decipherment (Section 3.2 details the
properties of such ciphers). The famous Zodiac serial killer used one such cipher system
for communication. In 1969, the killer sent a three-part cipher message to newspapers
claiming credit for recent shootings and crimes committed near the San Francisco area.
The 408-character message (Zodiac-408) was manually decoded by hand in the 1960's.
Solving such ciphers automatically using computers is a challenging task.
3.2 Homophonic Substitution Ciphers
A homophonic cipher uses a substitution key that replaces each plaintext letter with a
variety of substitutes|to encode a particular plaintext character, the key may contain
more than one cipher symbol choice.
For example, the English plaintext:
\H E L L O W O R L D ..."
45
may be enciphered as:
\65 82 51 84 05 60 54 42 51 45 ..."
according to the key:
A: 09 12 33 47 53 67 78 92
B: 48 81
...
E: 14 16 24 44 46 55 57 64 74 82 87 98
...
L: 51 84
...
Z: 02
Notice the non-determinism involved in the enciphering direction|the English letter
\L" is substituted using dierent symbols (51, 84) at dierent positions in the ciphertext.
These ciphers are more complex than simple substitution ciphers. Homophonic ci-
phers are generated via a non-deterministic encipherment process|the key is 1-to-many
in the enciphering direction. The number of potential cipher symbol substitutes for a
particular plaintext letter is often proportional to the frequency of that letter in the
plaintext language|for example, the English letter \E" is assigned more cipher symbols
than \Z". The objective of this is to
atten out the frequency distribution of ciphertext
symbols, making a frequency-based cryptanalysis attack dicult.
The substitution key is, however, deterministic in the decipherment direction|each
ciphertext symbol maps to a single plaintext letter. Since the ciphertext can contain
more than 26 types, we need a larger alphabet system|we use a numeric substitution
alphabet in our experiments.
46
The table shown below compares the properties of the two types of substitution ciphers
(simple and homophonic) with respect to dierent key characteristics.
Key Characteristics Simple
sub
Homophonic
sub
Is the key deterministic in the enciphering direction? Yes No
Is the key deterministic in the deciphering direction? Yes Yes
Does the key substitute one-for-one (symbol for symbol) in
the enciphering direction?
Yes Yes
Does the key substitute one-for-one in the deciphering direc-
tion?
Yes Yes
What linguistic unit is substituted? Letter / Syllable Letter
Does the key involve transposition (re-ordering)? No No
Homophonic Ciphers without spaces
In the previous two cipher systems, the word-boundary information was preserved in the
cipher. We now consider a more dicult homophonic cipher by removing space characters
from the original plaintext.
Consider the English plaintext from the previous example with word boundaries re-
moved. It now looks like this:
\HELLOWORLD ..."
and the corresponding ciphertext is:
\65 82 51 84 05 60 54 42 51 45 ..."
Without the word boundary information, typical dictionary-based decipherment at-
tacks fail on such ciphers.
47
Zodiac-408 Cipher: Homophonic ciphers without spaces have been used extensively
in the past to encrypt secret messages. One of the most famous homophonic ciphers in
history was used by the infamous Zodiac serial killer in the 1960's. The killer sent a
series of encrypted messages to newspapers and claimed that solving the ciphers would
reveal clues to his identity. The identity of the Zodiac killer remains unknown to date.
However, the mystery surrounding this has sparked much interest among cryptanalysis
experts and amateur enthusiasts.
The Zodiac messages include two interesting ciphers: (1) a 408-symbol homophonic
cipher without spaces (which was solved manually by hand), and (2) a similar looking
340-symbol cipher that has yet to be solved.
Here is a sample of the Zodiac-408 cipher message:
...
and the corresponding section from the original English plaintext message:
48
I L I K E K I L L I N G P E O P L
E B E C A U S E I T I S S O M U C
H F U N I T I S M O R E F U N T H
A N K I L L I N G W I L D G A M E
I N T H E F O R R E S T B E C A U
S E M A N I S T H E M O S T D A N
G E R O U E A N A M A L O F A L L
T O K I L L S O M E T H I N G G I
...
Besides the diculty with missing word boundaries and non-determinism associated
with the key, the Zodiac-408 cipher poses several additional challenges which makes
it harder to solve than any standard homophonic cipher. There are spelling mistakes
in the original message (for example, the English word \PARADISE" is misspelt as
\PARADICE") which can divert a dictionary-based attack. Also, the last 18 characters
of the plaintext message does not seem to make any sense (\EBEORIETEMETHHPITI").
3.3 Previous Work
The literature contains a lot of work on automatic decipherment methods for solving
simple letter-substitution ciphers. In Chapter 2, we already discussed many of these
along with our related contributions in this area. However, there has been very little work
published on deciphering homophonic ciphers automatically. Oranchak (2008) presents
a method for solving the Zodiac-408 cipher automatically with a dictionary-based attack
49
using a genetic algorithm. However, his method relies on using plaintext words from the
known solution to solve the cipher, which departs from a strict decipherment scenario.
Next, we describe our new Bayesian decipherment approach for tackling substitution
ciphers.
3.4 Bayesian Decipherment
Bayesian inference methods have become popular in natural language processing (Gold-
water & Griths, 2007; Finkel et al., 2005; Blunsom et al., 2009; Chiang et al., 2010).
Snyder et al. (2010) proposed a Bayesian approach in an archaeological decipherment sce-
nario. These methods are attractive for their ability to manage uncertainty about model
parameters and allow one to incorporate prior knowledge during inference. A common
phenomenon observed while modeling natural language problems is sparsity. For simple
letter substitution ciphers, the original substitution key exhibits a 1-to-1 correspondence
between the plaintext letters and cipher types. It is not easy to model such information
using conventional methods like EM. But we can easily specify priors that favor sparse
distributions within the Bayesian framework.
Objective: Given a ciphertext messagec
1
:::c
n
, the goal of decipherment is to uncover
the hidden plaintext messagep
1
:::p
n
. The size of the keyspace (i.e., number of possible key
mappings) that we have to navigate during decipherment is huge|a simple substitution
cipher has a keyspace size of 26!, whereas a homophonic cipher such as the Zodiac-408
cipher has 26
54
possible key mappings.
50
Here, we propose a novel approach for deciphering substitution ciphers using Bayesian
inference. Rather than enumerating all possible keys (26! for a simple substitution
cipher), our Bayesian framework requires us to sample only a small number of keys
during the decipherment process.
3.4.1 Model
Our decipherment method follows a noisy-channel approach. We are faced with a cipher-
text sequence c = c
1
:::c
n
and we want to nd the (English) letter sequence p = p
1
:::p
n
that maximizes the probability P (pjc).
We rst formulate a generative story to model the process by which the ciphertext
sequence is generated.
1. Generate an English plaintext sequence p =p
1
:::p
n
, with probability P (p).
2. Substitute each plaintext letter p
i
with a ciphertext token c
i
, with probability
P (c
i
jp
i
) in order to generate the ciphertext sequence c =c
1
:::c
n
.
We build a statistical English language model (LM) for the plaintext source model
P (p), which assigns a probability to any English letter sequence. Our goal is to estimate
the channel model parameters in order to maximize the probability of the observed
ciphertext c:
arg max
P (c) = arg max
X
p
P
(p;c) (3.1)
= arg max
X
p
P (p)P
(cjp) (3.2)
= arg max
X
p
P (p)
n
Y
i=1
P
(c
i
jp
i
) (3.3)
51
We estimate the parameters using Bayesian learning. In our decipherment frame-
work, a Chinese Restaurant Process (CRP) formulation is used to model both the source
and channel. The detailed generative story using CRPs is shown below:
1. i 1
2. Generate the English plaintext letter p
1
, with probability P
0
(p
1
)
3. Substitute p
1
with cipher token c
1
, with probability P
0
(c
1
je
1
)
4. i i + 1
5. Generate English plaintext letter p
i
, with probability
P
0
(p
i
jp
i1
) +C
i1
1
(p
i1
;p
i
)
+C
i1
1
(p
i1
)
6. Substitute p
i
with cipher token c
i
, with probability
P
0
(c
i
jp
i
) +C
i1
1
(p
i
;c
i
)
+C
i1
1
(p
i
)
7. With probability P
quit
, quit; else go to Step 4.
This denes the probability of any given derivation, i.e., any plaintext hypothesis
corresponding to the given ciphertext sequence. The base distributionP
0
represents prior
knowledge about the model parameter distributions. For the plaintext source model, we
use probabilities from an English language model and for the channel model, we specify a
uniform distribution (i.e., a plaintext letter can be substituted with any given cipher type
52
with equal probability). C
i1
1
represents the count of events occurring before plaintext
letter p
i
in the derivation (we call this the \cache"). and represent Dirichlet prior
hyperparameters over the source and channel models respectively. A large prior value
implies that words are generated from the base distribution P
0
, whereas a smaller value
biases words to be generated with reference to previous decisions inside the cache (favoring
sparser distributions).
3.4.2 Inference
Ecient inference via type sampling: We use a Gibbs sampling (Geman & Geman,
1984) method for performing inference on our model. We could follow a point-wise
sampling strategy, where we sample plaintext letter choices for every cipher token, one
at a time. But we already know that the substitution ciphers described here exhibit
determinism in the deciphering direction,
1
i.e., although we have no idea about the key
mappings themselves, we do know that there exists only a single plaintext letter mapping
for every cipher symbol type in the true key. So sampling plaintext choices for every
cipher token separately is not an ecient strategy|our sampler may spend too much
time exploring invalid keys (which map the same cipher symbol to dierent plaintext
letters).
Instead, we use a type sampling technique similar to the one proposed by Liang et
al. (2010). Under this scheme, we sample plaintext letter choices for each cipher symbol
type. In every step, we sample a new plaintext letter for a cipher type and update the
1
This assumption does not strictly apply to the Zodiac-408 cipher where a few cipher symbols exhibit
non-determinism in the decipherment direction as well.
53
entire plaintext hypothesis (i.e., plaintext letters at all corresponding positions) to re
ect
this change. For example, if we sample a new choice p
new
for a cipher symbol which
occurs at positions 4; 10; 18, then we update plaintext lettersp
4
;p
10
andp
18
with the new
choice p
new
.
Using the property of exchangeability, we derive an incremental formula for re-scoring
the probability of a new derivation based on the probability of the old derivation|when
sampling at positioni, we pretend that the area aected (within a context window around
i) in the current plaintext hypothesis occurs at the end of the corpus, so that both the
old and new derivations share the same cache.
2
While we may make corpus-wide changes
to a derivation in every sampling step, exchangeability allows us to perform scoring in an
ecient manner.
Combining letter n-gram language models with word dictionaries: Many ex-
isting probabilistic approaches use statistical letter n-gram language models of English
to assignP (p) probabilities to plaintext hypotheses during decipherment. Other decryp-
tion techniques rely on word dictionaries (using words from an English dictionary) for
attacking substitution ciphers.
Unlike previous approaches, our decipherment method combines information from
both sources|letter n-grams and word dictionaries. We build an interpolated word+n-
gram LM and use it to assignP (p) probabilities to any plaintext letter sequence p
1
:::p
n
.
3
2
The relevant context window that is aected when sampling at position i is determined by the word
boundaries to the left and right of i.
3
We set the interpolation weights for the word and n-gram LM as (0.9, 0.1). The word-based LM is
constructed from a dictionary consisting of 9,881 frequently occurring words collected from Wikipedia
articles. We train the letter n-gram LM on 50 million words of English text available from the Linguistic
Data Consortium.
54
The advantage is that it helps direct the sampler towards plaintext hypotheses that
resemble natural language|high probability letter sequences which form valid words
such as \H E L L O" instead of sequences like \`T X H R T". But in addition to this,
using letter n-gram information makes our model robust against variations in the original
plaintext (for example, unseen words or misspellings as in the case of Zodiac-408 cipher)
which can easily throw o dictionary-based attacks. Also, it is hard for a point-wise (or
type) sampler to \nd words" starting from a random initial sample, but easier to \nd
n-grams".
Sampling for ciphers without spaces: For ciphers without spaces, dictionaries are
hard to use because we do not know where words start and end. We introduce a new
sampling operator which counters this problem and allows us to perform inference using
the same decipherment model described earlier. In a rst sampling pass, we sample from
26 plaintext letter choices (e.g., \A", \B", \C", ...) for every cipher symbol type as
before. We then run a second pass using a new sampling operator that iterates over
adjacent plaintext letter pairs p
i1
;p
i
in the current hypothesis and samples from two
choices|(1) add a word boundary (space character \ ") between p
i1
and p
i
, or (2)
remove an existing space character between p
i1
and p
i
.
For example, given the English plaintext hypothesis \... A B O Y ...", there
are two sampling choices for the letter pair A,B in the second step. If we decide to add
a word boundary, our new plaintext hypothesis becomes \... A B O Y ...".
55
We compute the derivation probability of the new sample using the same ecient
scoring procedure described earlier. The new strategy allows us to apply Bayesian deci-
pherment even to ciphers without spaces. As a result, we now have a new decipherment
method that consistently works for a range of dierent types of substitution ciphers.
Decoding the ciphertext: After the sampling run has nished,
4
we choose the nal
sample as our English plaintext decipherment output.
3.5 Experiments and Results
Data: We run decipherment experiments on dierent types of letter substitution ciphers
(described in Sections 3.1 and 3.2). In particular, we work with the following three
ciphers:
(a) Simple Substitution Cipher: Here, we work with a 414-letter simple substitution
cipher. We encrypt an original English plaintext message using a randomly generated
simple substitution key to create the ciphertext. During the encipherment process,
we preserve spaces between words and use this information for decipherment|i.e.,
plaintext character \ " maps to ciphertext character \ ". Figure 3.1 (top) shows a
portion of the ciphertext along with the original plaintext used to create the cipher.
(b) Homophonic Cipher (with spaces): For our decipherment experiments on homo-
phonic ciphers, we use the same 414-letter English plaintext used in (a). We encrypt
4
For letter substitution decipherment we want to keep the language model probabilities xed during
training, and hence we set the prior on that model to be high ( = 10
4
). We use a sparse prior for the
channel ( = 0:01). We instantiate a key which matches frequently occurring plaintext letters to frequent
cipher symbols and use this to generate an initial sample for the given ciphertext and run the sampler
for 5000 iterations. We use a linear annealing schedule during sampling decreasing the temperature from
10! 1.
56
Plaintext: D E C I P H E R M E N T I S T H E A N A L Y S I S O F D O C U M E N T S
W R I T T E N I N A N C I E N T L A N G U A G E S W H E R E T H E ...
Ciphertext: i n g c m p n q s n w f c v f p n o w o k t v c v h u i h g z s n w f v
r q c f f n w c w o w g c n w f k o w a z o a n v r p n q n f p n ...
Bayesian solution: D E C I P H E R M E N T I S T H E A N A L Y S I S O F D O C U M E N T S
W R I T T E N I N A N C I E N T L A N G U A G E S W H E R E T H E ...
Plaintext: D E C I P H E R M E N T I S T H E A N A L Y S I S
O F D O C U M E N T S W R I T T E N I N ...
Ciphertext: 79 57 62 93 95 68 44 77 22 74 59 97 32 86 85 56 82 67 59 67 84 52 86 73 11
99 10 45 90 13 61 27 98 71 49 19 60 80 88 85 20 55 59 32 91 ...
Bayesian solution: D E C I P H E R M E N T I S T H E A N A L Y S I S
O F D O C U M E N T S W R I T T E N I N ...
Ciphertext:
Plaintext:
Bayesian solution (nal decoding): I L I K E K I L L I N G P E O P L E B E C A U S E
I T I S S O M U C H F U N I T I A M O R E F U N T
H A N K I L L I N G W I L D G A M E I N T H E F O
R R E S T B E C A U S E M A N I S T H E M O A T D
A N G E R T U E A N A M A L O F A L L ...
(with spaces shown): I L I K E K I L L I N G P E O P L E B E C A U S E
I T I S S O M U C H F U N I T I A M O R E
F U N T H A N K I L L I N G W I L D G A M E I N
T H E F O R R E S T B E C A U S E M A N I S T H E
M O A T D A N G E R T U E A N A M A L O F A L L ...
Figure 3.1: Samples from the ciphertext sequence, corresponding English plaintext mes-
sage and output from Bayesian decipherment (using word+3-gram LM) for three dierent
ciphers: (a) Simple Substitution Cipher (top), (b) Homophonic Substitution Cipher with
spaces (middle), and (c) Zodiac-408 Cipher (bottom).
57
Method LM Accuracy (%) on 414-
letter Simple Substitu-
tion Cipher
Accuracy (%) on 414-
letter Homophonic
Substitution Cipher
(with spaces)
Accuracy (%) on
Zodiac-408 Cipher
1. EM 2-gram 83.6 30.9
3-gram 99.3 32.6 0.3
(
28.8 with 100 restarts)
2. Bayesian 3-gram 100.0 95.2 23.0
word+2-gram 100.0 100.0
word+3-gram 100.0 100.0 97.8
Figure 3.2: Comparison of decipherment accuracies for EM versus Bayesian method when
using dierent language models of English on the three substitution ciphers: (a) 414-
letter Simple Substitution Cipher, (b) 414-letter Homophonic Substitution Cipher (with
spaces), and (c) the famous Zodiac-408 Cipher.
this message using a homophonic substitution key (available from http://www.simonsi
ngh.net/The Black Chamber/homophoniccipher.htm). As before, we preserve spaces
between words in the ciphertext. Figure 3.1 (middle) displays a section of the ho-
mophonic cipher (with spaces) and the original plaintext message used in our exper-
iments.
(c) Zodiac-408 Cipher: Figure 3.1 (bottom) displays the Zodiac-408 cipher (consisting
of 408 tokens, 54 symbol types) along with the original plaintext message.
Methods: For each cipher, we run and compare the output from two dierent decipher-
ment approaches:
1. EM Method using letter n-gram LMs following the approach of Knight et al. (2006).
They use the EM algorithm to estimate the channel parameters during decipher-
ment training. The given ciphertextc is then decoded by using the Viterbi algorithm
to choose the plaintext decoding p that maximizes P (p)P
(cjp)
3
, stretching the
channel probabilities.
58
2. Bayesian Decipherment method using word+n-gram LMs (novel approach de-
scribed in Section 3.4).
Evaluation: We evaluate the quality of a particular decipherment as the percentage of
cipher tokens that are decoded correctly.
Results: Figure 3.2 compares the decipherment performance for the EM method with
Bayesian decipherment (using type sampling and sparse priors) on three dierent types
of substitution ciphers. Results show that our new approach (Bayesian) outperforms
the EM method on all three ciphers, solving them completely. Even with a 3-gram
letter LM, our method yields a +63% improvement in decipherment accuracy over EM
on the homophonic cipher with spaces. We observe that the word+3-gram LM proves
highly eective when tackling more complex ciphers and cracks the Zodiac-408 cipher.
Figure 3.1 shows samples from the Bayesian decipherment output for all three ciphers.
For ciphers without spaces, our method automatically guesses the word boundaries for
the plaintext hypothesis.
For the Zodiac-408 cipher, we compare the performance achieved by Bayesian deci-
pherment under dierent settings:
Letter n-gram versus Word+n-gram LMs|Figure 3.2 shows that using a word+3-
gram LM instead of a 3-gram LM results in +75% improvement in decipherment
accuracy.
Sparse versus Non-sparse priors|We nd that using a sparse prior for the channel
model ( = 0.01 versus 1.0) helps for such problems and produces better decipher-
ment results (97.8% versus 24.0% accuracy).
59
Type versus Point-wise sampling|Unlike point-wise sampling, type sampling quickly
converges to better decipherment solutions. After 5000 sampling passes over the en-
tire data, decipherment output from type sampling scores 97.8% accuracy compared
to 14.5% for the point-wise sampling run.
5
We also perform experiments on shorter substitution ciphers. On a 98-letter simple
substitution cipher, EM using 3-gram LM achieves 41% accuracy, whereas the method
from Ravi and Knight (2009d) scores 84% accuracy. Our Bayesian method performs the
best in this case, achieving 100% with word+3-gram LM.
3.6 Conclusion
In this chapter, we presented a novel Bayesian decipherment approach that can eectively
solve a variety of substitution ciphers. Unlike previous approaches, our method combines
information from letter n-gram language models and word dictionaries and provides a ro-
bust decipherment model. We empirically evaluated the method on dierent substitution
ciphers and achieve perfect decipherments on all of them. Using Bayesian decipherment,
we can successfully solve the Zodiac-408 cipher|the rst time this is achieved by a fully
automatic method in a strict decipherment scenario.
5
Both sampling runs were seeded with the same random initial sample.
60
Chapter 4
Small Decipherment Models for Natural Language
Problems
In this chapter, we analyze decipherment approaches to various unsupervised problems
and compare them with respect to decipherment accuracy versus size of the model used
for decipherment. We show that for many unsupervised tasks related to natural language,
it is advantageous to use small models for decipherment.
First, we revisit letter substitution decipherment, a cryptanalysis problem that was
discussed previously in Chapter 2. We analyze some of the decipherment models that
were introduced to solve this problem, and correlate the decipherment performance with
the model size.
We then shift our focus from cryptanalysis to natural language problems and tackle un-
supervised tasks such as part-of-speech tagging (Section 4.2), supertagging (Section 4.3)
and word alignment (Section 4.4). For these natural language tasks, we introduce novel
decipherment methods that explicitly search for minimized models using integer program-
ming. We also present an alternative model minimization strategy|a fast and ecient
61
greedy approximation algorithm that can scale well to large data and problem sizes while
yielding high task accuracies.
On all the tasks discussed, our methods which rely on small models outperform ex-
isting state-of-the-art systems and produce good decipherment results. For letter substi-
tution decipherment, we show performance improvements ranging between 10-64% when
we apply our idea of keeping the model size small. On the unsupervised part-of-speech
tagging task, our novel method, which is based on the same idea yields a very high 92.3%
tagging accuracy, which is the best reported result on the task so far. For the supertag-
ging task, we achieve state-of-the-art performance yielding a 3-4% improvement over a
previous approach. We also show signicant improvements in accuracy (f-measure) on
word and sub-word alignment tasks, where a small-dictionary based approach yields us
performance improvements ranging from 9% to 63% over existing approaches.
4.1 Introduction
In Chapter 2, we discussed in detail two methods for solving simple letter substitution
ciphers:
1. EM method: A probabilistic decipherment method, where EM algorithm is used
to search for the key that maximizes the probability of seeing the given ciphertext
P(c). The nal substitution model P(cje) that is learnt from decipherment training
is non-deterministic, i.e., a plaintext letter can be mapped to multiple ciphertext
letters and vice-versa.
62
Decipherment Method Number of plaintext-
ciphertext mappings left in
model (after training)
Decipherment Error
Long Cipher Short Cipher Long Cipher Short Cipher
(414-letters) (52-letters) (414-letters) (52-letters)
1. EM Decipherment 76 54 10% 85%
2. IP Decipherment 23 19 0.5% 21%
Figure 4.1: Comparison of decipherment error versus model size (i.e., number of plaintext-
ciphertext mappings left in the model after training) for letter substitution decipherment
when using (1) EM method, and (2) IP method, with a 2-gram letter-based language
model. For EM decipherment, we compute the model size counting only those mappings
which occur with probability 0.0001.
2. IP method: A method based on integer programming, which explicitly searches for
a deterministic key (i.e., only a single ciphertext mapping exists for each plaintext
letter, and vice-versa) via constraints in the integer program.
If we count the mappings present in the models learnt by these two methods (shown in
Figure 4.1), we see that in the case of the IP method, explicitly searching for a determin-
istic key keeps the models small. IP searches over a smaller model space in comparison
to the EM method, which produces a probabilistic key (with three times more mappings
than IP) at the end of decipherment training. Comparing the accuracies on long and
short ciphers, we observe that the IP method performs better than EM. There are a few
reasons why the decipherment results from IP are better, but we make an important
observation|the size of the model chosen by IP is considerably smaller than the model
obtained after EM training, and this denitely has an eect on the decipherment end
results.
We see that when the system picks a smaller model (in terms of number of parameters
learnt), the resulting accuracy on the decipherment task also improves.
63
Now, let us look at some other unsupervised problems that t into our decipherment
framework, and where explicitly minimizing the model size helps. For the remainder of
this chapter (and in the coming chapters), we shift our focus from cryptanalysis decipher-
ment tasks to solving natural language problems.
NLP without Labeled Data: A Decipherment Approach
We note that many unsupervised problems in NLP can be modeled using similar deci-
pherment principles as described earlier (i.e., substitution operations on letters or words),
the only dierence is that the search space is much larger for NLP tasks. For example,
in part-of-speech (POS) tagging, decipherment involves substituting (cipher) words in
a sentence with their appropriate (plaintext) syntactic categories such as Noun, Verb,
Adjective, etc. In a supertagging task, words have to be substituted with complex lexi-
cal categories dened by grammar formalisms such as Combinatory Categorial Grammar
(CCG) (Steedman, 2000).
To tackle existing NLP problems, we devise novel decipherment techniques|a key
invention is the idea of \searching for minimized models" during unsupervised learning
which can be be applied to several unsupervised tasks. We show that this is an eective
decipherment strategy which yields state-of-the-art results on many fundamental NLP
problems|(1) Sequence labeling tasks such as unsupervised POS tagging and Supertag-
ging, and (2) Word alignment task (annotating a bilingual text with links connecting
words that have the same meanings).
64
For these tasks, we introduce novel methods that explicitly search for minimized
models using integer programming. Our empirical ndings show that minimizing model
size correlates highly with better solution quality.
4.2 Unsupervised Part-of-Speech Tagging
In recent years, we have seen increased interest in using unsupervised methods for at-
tacking dierent NLP tasks like part-of-speech (POS) tagging. Here, we describe a novel
method for the task of unsupervised POS tagging with a dictionary (Ravi & Knight,
2009c), one that explicitly searches for the smallest model that explains the data, and
then uses EM to set parameter values. We evaluate our method on a standard test corpus
and show that our approach performs better than existing state-of-the-art systems.
Task: We adopt the problem formulation of Merialdo (1994), in which we are given a
raw word sequence and a dictionary of legal tags for each word type. The goal is to tag
each word token so as to maximize accuracy against a gold tag sequence. To illustrate
the nature of this task, here is an example:
Given a word sequence (for example, the following English sentence):
the dog saw the cat on the mat
and a dictionary of legal POS tags for each word:
DT fN,Vg fN,Vg DT N P DT N
the goal is to produce a POS tag sequence:
DT N V DT N P DT N
65
In the above example, the word saw has two tagging possibilities (Noun, Verb), but
given the specic context the correct POS tag is Verb.
In Chapter 2, we remarked that decipherment can be considered as an unsupervised
tagging problem, where the task is to tag every ciphertext token with a corresponding
plaintext token. The unsupervised tagging task that we tackle here also ts into our
decipherment framework. Some of the key characteristics specic to this problem are
shown in the following table:
Key Characteristics Simple
sub
Homophonic
sub
Unsupervised POS Tagging
Is the key deterministic in the enciphering
direction?
Yes No No
Is the key deterministic in the deciphering
direction?
Yes Yes No
Does the key substitute one-for-one (symbol
for symbol) in the enciphering direction?
Yes Yes Yes
Does the key substitute one-for-one in the
deciphering direction?
Yes Yes Yes
What linguistic unit is substituted? Letter /
Syllable
Letter Word
Does the key involve transposition (re-
ordering)?
No No No
Other properties of the key that are specic
to the problem
1) Key \hints" are provided in the
form of a dictionary
2) No LM data is provided
Data: We use the standard test set from the literature for this task, a 24,115-word subset
of the Penn Treebank, for which a gold tag sequence is available. There are 5,878 word
types in this test set. We use the standard tag dictionary, consisting of 57,388 word/tag
pairs derived from the entire Penn Treebank. 8,910 dictionary entries are relevant to the
66
System Tagging accuracy
(%) on 24,115-
word corpus
1. Random baseline (for each word, pick a random tag from the alternatives
given by the word/tag dictionary)
64.6
2. EM with 2-gram tag model 81.7
3. EM with 3-gram tag model 74.5
4a. Bayesian method (Goldwater & Griths, 2007) 83.9
4b. Bayesian method with sparse priors (Goldwater & Griths, 2007) 86.8
5. CRF model trained using contrastive estimation (Smith & Eisner, 2005) 88.6
6. EM-HMM tagger provided with good initial conditions (Goldberg et al.,
)
91.4*
(*uses linguistic constraints and manual adjustments to the dictionary)
Figure 4.2: Previous results on unsupervised POS tagging using a dictionary (Merialdo,
1994) on the full 45-tag set.
5,878 word types in the test set. Per-token ambiguity is about 1.5 tags/token, yielding
approximately 10
6425
possible ways to tag the data. There are 45 distinct grammatical
tags. In this set-up, there are no unknown words.
4.2.1 Previous Work
Recently, this task has proven to be a useful testbed for evaluating various unsupervised
algorithms and hence a lot of work has been published in this area.
The classic Expectation Maximization (EM) algorithm has been shown to perform
poorly on POS tagging, when compared to other techniques such as Bayesian methods.
Figure 4.2 shows prior results for this problem. While the methods are quite dierent,
they all make use of two common model elements. One is a probabilistic n-gram tag model
P(t
i
jt
in+1
:::t
i1
), which we call the grammar. The other is a probabilistic word-given-tag
model P(w
i
jt
i
), which we call the dictionary.
67
The classic approach (Merialdo, 1994) is Expectation Maximization (EM), where we
estimate grammar and dictionary probabilities in order to maximize the probability of
the observed word sequence:
P (w
1
:::w
n
) =
X
t
1
:::tn
P (t
1
:::t
n
)P (w
1
:::w
n
jt
1
:::t
n
)
=
X
t
1
:::tn
n
Y
i=1
P (t
i
jt
i2
t
i1
)P (w
i
jt
i
)
The literature omits one other baseline, which is EM with a 2-gram tag model. Here
we obtain 81.7% accuracy, which is better than the 3-gram model. It seems that EM
with a 3-gram tag model runs amok with its freedom. In our experiments, we will limit
ourselves to a 2-gram tag model.
4.2.2 Decipherment using IP+EM Approach
We investigate the Viterbi tag sequence generated by EM training and count how many
distinct tag bigrams there are in that sequence. We call this the observed grammar size,
and it is 915. That is, in tagging the 24,115 test tokens, EM uses 915 of the available 45
45 = 2025 tag bigrams.
1
The advantage of the observed grammar size is that we can
compare it with the gold tagging's observed grammar size, which is 760. So we can safely
say that EM is learning a grammar that is too big, still abusing its freedom.
Small Models via Integer Programming: Bayesian sparse priors as shown in (Gold-
water & Griths, 2007) aim to create small models for the tagging task. We take a
1
We contrast observed size with the model size for the grammar, which we dene as the number of
P(t2jt1) entries in EM's trained tag model that exceed 0.0001 probability.
68
L8
L0
they can fish . I fish
L1
L2 L3
L4
L6
L5 L7
L9
L10
L11
START
PRO
AUX
V
N
PUNC
L0
they can fish . I fish
L1
L2
L1
L2 L3
L4
L6
L5 L7
L9
L10
L11
START
PRO
AUX
V
N
PUNC
d1 PRO-they
d2 AUX-can
d3 V-can
d4 N-fish
d5 V-fish
d6 PUNC-.
d7 PRO-I
g1 PRO-AUX
g2 PRO-V
g3 AUX-N
g4 AUX-V
g5 V-N
g6 V-V
g7 N-PUNC
g8 V-PUNC
g9 PUNC-PRO
g10 PRO-N
dictionary
variables
grammar
variables
Integer Program
Minimize: ∑
i=1…10
g
i
Constraints:
1. Single left-to-right path (at each node, flow in = flow out)
e.g., L
0
= 1
L
1
= L
3
+ L
4
2. Path consistency constraints (chosen path respects chosen
dictionary & grammar)
e.g., L
0
≤ d
1
L
1
≤ g
1
IP formulation
training text
link
variables
Figure 4.3: Integer Programming formulation for nding the smallest grammar that
explains a given word sequence. Here, we show a sample word sequence and the corre-
sponding IP network generated for that sequence.
dierent tack for the same problem and directly ask: What is the smallest model that
explains the text? Our approach is related to Minimum Description Length (MDL). We
formulate our question precisely by asking which tag sequence (of the 10
6425
available)
has the smallest observed grammar size. The answer is 459. That is, there exists a
tag sequence that contains 459 distinct tag bigrams, and no other tag sequence contains
fewer.
We obtain this answer by formulating the problem in an integer programming (IP)
framework. Figure 4.3 illustrates this with a small sample word sequence. We create a
network of possible taggings, and we assign a binary variable to each link in the network.
69
We create constraints to ensure that those link variables receiving a value of 1 form a left-
to-right path through the tagging network, and that all other link variables receive a value
of 0. We accomplish this by requiring the sum of the links entering each node to equal
to the sum of the links leaving each node. We also create variables for every possible tag
bigram and word/tag dictionary entry. We constrain link variable assignments to respect
those grammar and dictionary variables. For example, we do not allow a link variable to
\activate" unless the corresponding grammar variable is also \activated". Finally, we add
an objective function that minimizes the number of grammar variables that are assigned
a value of 1.
For solving the integer program, we use CPLEX software. Once we create an integer
program for the full test corpus, and pass it to CPLEX, the solver returns an objective
function value of 459.
CPLEX also returns a tag sequence via assignments to the link variables. However,
there are actually 10
4378
tag sequences compatible with the 459-sized grammar, and our
IP solver just selects one at random. We nd that of all those tag sequences, the worst
gives an accuracy of 50.8%, and the best gives an accuracy of 90.3%.
Combining Small Models with EM (IP+EM Method): Our IP formulation can
nd us a small model, but it does not attempt to t the model to the data. Fortunately,
we can use EM for that. We still give EM the full word/tag dictionary, but now we
constrain its initial grammar model to the 459 tag bigrams identied by IP. Starting with
uniform probabilities, EM nds a tagging that is 84.5% accurate, substantially better than
the 81.7% originally obtained with the fully-connected grammar. So we see a benet to
70
in on
IN IN
RP RP
word/tag dictionary RB RB
NN
FW
RBR
observed EM dictionary FW (358) RP (127)
RB (7)
observed IP+EM dictionary IN (349) IN (126)
RB (9) RB (8)
observed gold dictionary IN (355) IN (129)
RB (3) RP (5)
Figure 4.4: Examples of tagging obtained from dierent systems for prepositions in and
on.
our explicit small-model approach. While EM does not nd the most accurate sequence
consistent with the IP grammar (90.3%), it nds a relatively good one.
The IP+EM tagging (with 84.5% accuracy) has some interesting properties. First, the
dictionary we observe from the tagging is of higher quality (with fewer spurious tagging
assignments) than the one we observe from the original EM tagging. Figure 4.4 shows
some examples.
Note that we used a very small IP-grammar (containing only 459 tag bigrams) during
EM training. In the process of minimizing the grammar size, IP ends up removing many
good tag bigrams from our grammar set. Next, we proceed to recover some good tag
bigrams and expand the grammar in a restricted fashion by making use of the higher-
quality dictionary produced by the IP+EM method. We now run EM again on the full
grammar (all possible tag bigrams) in combination with this good dictionary (containing
fewer entries than the full dictionary). Unlike the original training with full grammar,
where EM could choose any tag bigram, now the choice of grammar entries is constrained
71
by the good dictionary model that we provide EM with. This allows EM to recover some
of the good tag pairs, and results in a good grammar-dictionary combination that yields
better tagging performance.
With these improvements in mind, we embark on an alternating scheme to nd better
models and taggings. We run EM for multiple passes,
2
and in each pass we alternately
constrain either the grammar model or the dictionary model. The procedure is simple
and proceeds as follows:
1. Run EM constrained to the last trained dictionary, but provided with a full gram-
mar.
2. Run EM constrained to the last trained grammar, but provided with a full dictio-
nary.
3. Repeat steps 1 and 2.
4.2.3 Experiments and Results
We run experiments on the standard test set for this task. The test data consists of 24,115
word tokens from the Penn Treebank and the tagset consists of 45 distinct grammatical
categories.
We notice signicant gains in tagging performance when doing model minimization
followed by the alternating EM technique described in the previous section. The tagging
accuracy increases at each step and nally settles at a high of 91.6%, which outperforms
2
For all experiments, EM training is allowed to run for 40 iterations or until the likelihood ratios
between two subsequent iterations reaches a value of 0.99999, whichever occurs earlier.
72
Model Tagging accuracy Observed size Model size
on 24,115-word
corpus
grammar(G), dictionary(D) grammar(G), dictionary(D)
1. EM baseline with full grammar +
full dictionary
81.7 G=915, D=6295 G=935, D=6430
2. EM constrained with minimized IP-
grammar + full dictionary
84.5 G=459, D=6318 G=459, D=6414
3. EM constrained with full grammar
+ dictionary from (2)
91.3 G=606, D=6245 G=612, D=6298
4. EM constrained with grammar from
(3) + full dictionary
91.5 G=593, D=6285 G=600, D=6373
5. EM constrained with full grammar
+ dictionary from (4)
91.6 G=603, D=6280 G=618, D=6337
+ with 100 random restarts for EM 91.8
+ trained on more data (entire PTB) 92.3
Figure 4.5: Percentage of word tokens tagged correctly by dierent models. The observed
sizes and model sizes of grammar (G) and dictionary (D) produced by these models are
shown in the last two columns.
the existing state-of-the-art systems for the 45-tag set. The system achieves a better
accuracy than the 88.6% from Smith and Eisner (2005), and even surpasses the 91.4%
achieved by Goldberg et al. () without using any additional linguistic constraints or
manual cleaning of the dictionary. Figure 4.5 shows the tagging performance achieved at
each step. We nd that the observed grammar also improves, growing from 459 entries
to 603 entries. The gure also shows the model's internal grammar and dictionary sizes.
Restarts and More Data: Multiple random restarts for EM, while not often empha-
sized in the literature, are key in this domain. When we extend our alternating EM
scheme and apply 100 random restarts at each step, we improve our tagging accuracy
from 91.6% to 91.8%.
As noted by Toutanova and Johnson (2008), there is no reason to limit the amount
of unlabeled data used for training the models. Their models are trained on the entire
Penn Treebank data (instead of using only the 24,115-token test data), and so are the
tagging models used by Goldberg et al. (). Previous results from Smith and Eisner (2005)
and Goldwater and Griths (2007) show that their models do not benet from using
73
more unlabeled training data. Because EM is ecient, we can extend our word-sequence
training data from the 24,115-token set to the entire Penn Treebank (973k tokens). We
run EM training again for Model 5 (the best model from Figure 4.5) but this time using
973k word tokens, and further increase our accuracy to 92.3%. This is our nal result on
the 45-tagset, and we note that it is higher than previously reported results.
3
Conclusion: To summarize our experiments on unsupervised tagging, the method (look-
ing for small models) that we proposed is simple|once an integer program is produced,
there are solvers available which directly give us the solution. In addition, we do not
require any complex parameter estimation techniques; we train our models using simple
EM, which proves to be ecient for this task, achieving the best performance (92.3%
accuracy) reported so far on this task. While some previous methods introduced for the
same task have achieved big tagging improvements using additional linguistic knowledge
or manual supervision, our models are not provided with any additional information.
4.3 Unsupervised Supertagging
Over the last few decades, researchers have proposed several linguistically motivated
theories and natural language grammar formalisms that seek to associate words with
rich linguistic descriptions (supertags). Some of these include lexicalized grammar for-
malisms such as Tree-Adjoining Grammar (TAG) (Joshi, 1988), Head-driven Phrase
Structure Grammar (HPSG) (Pollard & Sag, 1994), and Combinatory Categorial Gram-
mar (CCG) (Steedman, 2000). These frameworks have been increasingly used and proven
3
More details including experiments and results when using a coarser 17-tagset, incomplete dictionaries,
etc. can be found in (Ravi & Knight, 2009c).
74
benecial for many NLP applications ranging from syntactic parsing (Clark & Curran,
2006) to machine translation (Hassan et al., 2009). When using grammar formalisms
such as CCG for these applications, one of the fundamental tasks involved is to learn
supertaggers.
Task: Supertagging is a sequence labeling task like POS tagging, where (instead of POS
tags) the goal is to tag words with lexical categories as dened by a grammar formalism
like CCG. But unlike POS tagging, the tag labels used in supertagging are more complex
and convey detailed syntactic information. Here is an example:
word sequence: Vinken will join the board
POS tags: NNP MD VB DT N
CCG supertags: N (SnNP)/(SnNP) ((SnNP)/PP)/NP NP/N N
In the above sentence, the wordjoin has the POS categoryVB and the CCG category
((SnNP)/PP)/NP. The CCG category here is a complex, structured tag label indicating
that the word requires a noun phrase to its left, another to its right, and a prepositional
phrase to the right of that.
The CCG formalism also species a universal set of grammatical rules (such as ap-
plication and composition) that denes how categories may combine with one another to
produce the syntactic derivation for a sentence (see Steedman (2000) for details). For ex-
ample, given the supertag categories for each word in the following sentence, the grammar
allows derivations like:
75
Ed might see a cat
NP (SnNP)=(SnNP) (SnNP)=NP NP=N N
>B >
(SnNP)=NP NP
>
SnNP
>
S
We already showed how we can create accurate part-of-speech (POS) taggers using a
tag dictionary and unlabeled data (Section 4.2). This generally involves working with the
standard set of 45 POS tags employed in the Penn Treebank. The most ambiguous word
has 7 dierent POS tags associated with it. But automatically learning supertaggers
for lexicalized grammar formalisms such as CCG is a much more challenging task. For
example, CCGbank (Hockenmaier & Steedman, 2007) contains 1241 distinct supertags
(lexical categories) and the most ambiguous word has 126 supertags. This provides a
much more challenging starting point for the semi-supervised methods typically applied
to the task. Yet, this is an important task since creating grammars and resources for
CCG parsers for new domains and languages is highly labor- and knowledge-intensive.
From a decipherment perspective, the supertagging task involves substituting (cipher)
words in a sentence with their appropriate (plaintext) categories dened by the CCG
grammar formalism. This problem is identical to the unsupervised POS tagging problem
described in Section 4.2 with respect to the key characteristics. However, applying our
IP+EM method for POS tagging naively to CCG supertagging is intractable due to the
high level of ambiguity and huge search space involved.
76
In (Ravi et al., 2010a), we combine two complementary ideas for learning supertag-
gers from highly ambiguous lexicons: grammar-informed tag transitions and models min-
imized via integer programming. Each strategy on its own greatly improves performance
over basic Expectation Maximization (EM) training with a bitag Hidden Markov Model.
The strategies provide further error reductions when combined. We demonstrate their
cross-lingual eectiveness on CCGbank (English) and the Italian CCG-TUT corpus. We
also describe a new two-stage integer programming strategy that eciently deals with
the high degree of ambiguity on these datasets while obtaining the full eect of model
minimization.
Data
CCGbank: CCGbank was created by semi-automatically converting the Penn Tree-
bank to CCG derivations (Hockenmaier & Steedman, 2007). We use the standard splits
of the data used in semi-supervised tagging experiments (e.g. Banko and Moore (2004)):
sections 0-18 for training, 19-21 for development, and 22-24 for test.
CCG-TUT: CCG-TUT was created by semi-automatically converting dependencies
in the Italian Turin University Treebank to CCG derivations (Bos et al., 2009). It is
much smaller than CCGbank, with only 1837 sentences. It is split into three sections:
newspaper texts (NPAPER), civil code texts (CIVIL), and European law texts from
the JRC-Acquis Multilingual Parallel Corpus (JRC). For test sets, we use the rst 400
sentences of NPAPER, the rst 400 of CIVIL, and all of JRC. This leaves 409 and 498
77
Data Distinct Max Type ambig Tok ambig
CCGbank 1241 126 1.69 18.71
CCG-TUT
NPAPER+CIVIL 849 64 1.48 11.76
NPAPER 644 48 1.42 12.17
CIVIL 486 39 1.52 11.33
Table 4.1: Statistics for the training data used to extract lexicons for CCGbank and
CCG-TUT. Distinct: # of distinct lexical categories; Max: # of categories for the
most ambiguous word; Type ambig: per word type category ambiguity; Tok ambig:
per word token category ambiguity.
sentences from NPAPER and CIVIL, respectively, for training (to acquire a lexicon and
run EM). For evaluation, we use two dierent settings of train/test splits:
TEST 1 Evaluate on the NPAPER section of test using a lexicon extracted only from
NPAPER section of train.
TEST 2 Evaluate on the entire test using lexicons extracted from (a) NPAPER + CIVIL,
(b) NPAPER, and (c) CIVIL.
Table 4.1 shows statistics for supertag ambiguity in CCGbank and CCG-TUT. As a
comparison, the POS word token ambiguity in CCGbank is 2.2: the corresponding value
of 18.71 for supertags is indicative of the (challenging) fact that supertag ambiguity is
greatest for the most frequent words.
4.3.1 Previous Work
There is very little prior work on this task with one notable exception. Baldridge (2008)
uses grammar-informed initialization for HMM tag transitions based on the universal
combinatory rules of the CCG formalism and obtains 56.1% accuracy on ambiguous
78
word tokens from the CCGbank test data, a large improvement over the 33.0% accuracy
obtained with uniform initialization for tag transitions.
Next, we describe our approach from (Ravi et al., 2010a) for the task of learning
supertaggers from lexicons that have not been ltered in any way.
4
4.3.2 Grammar-Informed Initialization for Supertagging
Part-of-speech tags are atomic labels that in and of themselves encode no internal struc-
ture. In contrast, supertags are detailed, structured labels; a universal set of grammatical
rules denes how categories may combine with one another to project syntactic struc-
ture. Because of this, properties of the CCG formalism itself can be used to constrain
learning|prior to considering any particular language, grammar or data set. Baldridge
(2008) uses this observation to create grammar-informed tag transitions for a bitag HMM
supertagger based on two main properties. First, categories dier in their complexity and
less complex categories tend to be used more frequently. For example, two categories for
buy in CCGbank are (S[dcl]nNP)/NP and ((((S[b]nNP)/PP)/PP)/(S[adj]nNP))/NP; the
former occurs 33 times, the latter once. Second, categories indicate the form of cate-
gories found adjacent to them; for example, the category for sentential complement verbs
((SnNP)/S) expects an NP to its left and an S to its right.
Baldridge uses these properties to dene tag transition distributions that have higher
likelihood for simpler categories that are able to combine. For example, for the distribu-
tion p(t
i
jt
i1
=NP ), (SnNP)nNP is more likely than ((SnNP)/(N/N))nNP because both
4
See Banko and Moore (2004) for a description of how many early POS tagging papers in fact used a
number of heuristic cutos that greatly simplify the problem.
79
categories may combine with a preceding NP but the former is simpler. In turn, the
latter is more likely than NP: it is more complex but can combine with the preceding NP.
Finally, NP is more likely than (S/NP)/NP since neither can combine, but NP is simpler.
By starting EM with these tag transition distributions and an unltered lexicon (word-
to-supertag dictionary), Baldridge obtains a tagging accuracy of 56.1% on ambiguous
words|a large improvement over the accuracy of 33.0% obtained by starting with uniform
transition distributions. We refer to a model learned from basic EM (uniformly initialized)
as EM, and to a model with grammar-informed initialization as EM
GI
.
4.3.3 Minimized Models for Supertagging
The idea of searching for minimized models is related to classic Minimum Description
Length (MDL) (Barron et al., 1998), which seeks to select a small model that captures
the most regularity in the observed data. In Section 4.2, we implemented a method
based on this idea which directly minimizes the model using an integer programming
(IP) formulation and produced good results for POS tagging.
However, there are many challenges involved in using IP minimization for supertag-
ging. The 1241 distinct supertags in the tagset result in 1.5 million tag bigram entries in
the model and the dictionary contains almost 3.5 million word/tag pairs that are relevant
to the test data. The set of 45 POS tags for the same data yields 2025 tag bigrams and
8910 dictionary entries. We also wish to scale our methods to larger data settings than
the 24k word tokens in the test data used in the POS tagging task.
Our objective is to nd the smallest supertag grammar (of tag bigram types) that
explains the entire text while obeying the lexicon's constraints. However, our original IP
80
method from (Ravi & Knight, 2009c) is intractable for supertagging, so we propose a new
two-stage method that scales to the larger tagsets and data involved.
IP method for supertagging
Our goal for supertagging is to build a minimized model with the following objective:
IP
original
: Find the smallest supertag grammar (i.e., tag bigrams) that can
explain the entire text (the test word token sequence).
Using the full grammar and lexicon to perform model minimization results in a very
large, dicult to solve integer program involving billions of variables and constraints.
This renders the minimization objective IP
original
intractable. One way of combating
this is to use a reduced grammar and lexicon as input to the integer program. We do
this without further supervision by using the tag sequence output from the HMM model
trained using basic EM. This produces an observed grammar of distinct tag bigrams (G
obs
)
and lexicon of observed lexical assignments (L
obs
). For CCGbank, G
obs
and L
obs
have
12,363 and 18,869 entries, respectively|far less than the millions of entries in the full
grammar and lexicon.
However, even with the EM-reduced grammar and lexicon input, the IP-minimization
is still very hard to solve. We thus split it into two stages. The rst stage (Minimization
1) nds the smallest grammar G
min1
G that explains the set of word bigram types
observed in the data rather than the word sequence itself, and the second (Minimization
2) nds the smallest augmentation of G
min1
that explains the full word sequence.
81
Minimization 1 (Min1): We begin with a simpler minimization problem than the
original one (IP
original
), with the following objective:
IP
min1
: Find the smallest set of tag bigramsG
min1
G, such that there is at least
one tagging assignment possible for every word bigram type observed in the data.
We formulate this as an integer program (see (Ravi et al., 2010a) for details), creating
binary variables gvar
i
for every tag bigram g
i
=t
j
t
k
in G. Binary link variables connect
tag bigrams with word bigrams; these are restricted to the set of links that respect the
lexicon L provided as input.
The IP solver
5
solves the integer program and we extract the set of tag bigramsG
min1
based on the activated grammar variables. For the CCGbank test data,Min1 yields 2530
tag bigrams. However, a second stage is needed since there is no guarantee thatG
min1
can
explain the test data: it contains tags for all word bigram types, but it cannot necessarily
tag the full word sequence. Figure 4.6 illustrates this. Using only tag bigrams fromMin1
(shown in blue), there is no fully-linked tag path through the network. There are missing
links between wordsw
2
andw
3
and between wordsw
3
andw
4
in the word sequence. The
next stage lls in these missing links.
Minimization 2 (Min2): This stage uses the original minimization formulation for
the supertagging problemIP
original
, again using an integer programming method similar
to that we proposed in (Ravi & Knight, 2009c). If applied to the observed grammarG
obs
,
the resulting integer program is hard to solve.
6
However, by using the partial solution
5
We use the commercial CPLEX solver.
6
The solver runs for days without returning a solution.
82
:
:
ti tj
:
:
Input Grammar (G) word bigrams:
w1 w2
w2 w3
:
:
wi wj
:
:
MIN 1
:
:
ti tj
:
:
Input Grammar (G) word bigrams:
w1 w2
w2 w3
:
:
wi wj
:
:
word sequence: w1 w2 w3 w4 w5
t1
t2
t3
:
:
tk
supertags
tag bigrams chosen in first minimization step (G min1)
(does not explain the word sequence)
word sequence: w1 w2 w3 w4 w5
t1
t2
t3
:
:
tk
supertags
tag bigrams chosen in second minimization step (G min2)
MIN 2
IP Minimization 1
IP Minimization 2
Figure 4.6: Two-stage IP method for selecting minimized models for supertagging.
G
min1
obtained in Min1 the IP optimization speeds up considerably. We implement
this by xing the values of all binary grammar variables present in G
min1
to 1 before
optimization. This reduces the search space signicantly, and CPLEX nishes in just a
few hours. For the complete integer program and other implementation details, refer to
(Ravi et al., 2010a).
Figure 4.6 illustrates how Min2 augments the grammar G
min1
(links shown in blue)
with additional tag bigrams (shown in red) to form a complete tag path through the
network. The minimized grammar set in the nal solution G
min2
contains only 2810
entries, signicantly fewer than the original grammar G
obs
's 12,363 tag bigrams.
We note that the two-stage minimization procedure proposed here is not guaranteed
to yield the optimal solution to our original objective IP
original
. On the simpler task of
unsupervised POS tagging with a dictionary, we compared our method versus directly
83
solving IP
original
and found that the minimization (in terms of grammar size) achieved
by our method is close to the optimal solution for the original objective and yields the
same tagging accuracy far more eciently.
Fitting the minimized model: The IP-minimization procedure gives us a minimal
grammar, but does not t the model to the data. In order to estimate probabilities for
the HMM model for supertagging, we use the EM algorithm but with certain restrictions.
We build the transition model using only entries from the minimized grammar setG
min2
,
and instantiate an emission model using the word/tag pairs seen in L (provided as input
to the minimization procedure). All the parameters in the HMM model are initialized
with uniform probabilities, and we run EM for 40 iterations. The trained model is used
to nd the Viterbi tag sequence for the corpus. We refer to this model (where the EM
output (G
obs
, L
obs
) was provided to the IP-minimization as initial input) as EM+IP.
Bootstrapped minimization: The quality of the observed grammar and lexicon im-
proves considerably at the end of a single EM+IP run. Following Ravi and Knight (2009c)
we iteratively bootstrap a new EM+IP run using as input, the observed grammar G
obs
and lexiconL
obs
from the last tagging output of the previous iteration. We run this until
the chosen grammar set G
min2
does not change.
7
4.3.4 Combining Minimization with Grammar-Informed Initialization
There are two complementary ways to use grammar-informed initialization with the IP-
minimization approach: (1) using EM
GI
output as the starting grammar/lexicon and (2)
7
In our experiments, we run three bootstrap iterations.
84
using the tag transitions directly in the IP objective function. The rst takes advantage
of the earlier observation that the quality of the grammar and lexicon provided as initial
input to the minimization procedure can aect the quality of the nal supertagging out-
put. For the second, we modify the objective function used in the two IP-minimization
steps to be:
Minimize:
X
8g
i
2G
w
i
gvar
i
(4.1)
where,G is the set of tag bigrams provided as input to IP,gvar
i
is a binary variable in the
integer program corresponding to tag bigram (t
i1
;t
i
)2G, andw
i
is negative logarithm
of p
gii
(t
i
jt
i1
) as given by Baldridge (2008).
8
All other parts of the integer program
including the constraints remain unchanged, and, we acquire a nal tagger in the same
manner as described in the previous section. In this way, we combine the minimization
and GI strategies into a single objective function that nds a minimal grammar set while
keeping the more likely tag bigrams in the chosen solution. EM
GI
+IP
GI
is used to
refer to the method that uses GI information in both ways: EM
GI
output as the starting
grammar/lexicon and GI weights in the IP-minimization objective.
4.3.5 Experiments and Results
We compare the four strategies described in Sections 4.3.2, 4.3.3 and 4.3.4, summarized
below:
EM : HMM uniformly initialized, EM training.
EM+IP : IP minimization using initial grammar provided by EM.
8
Other numeric weights associated with the tag bigrams could be considered, such as 0/1 for uncom-
binable/combinable bigrams.
85
EM
GI
: HMM with grammar-informed initialization, EM training.
EM
GI
+IP
GI
: IP minimization using initial grammar/lexicon provided by EM
GI
and
additional grammar-informed IP objective.
For EM+IP and EM
GI
+IP
GI
, the minimization and EM training processes are iterated
until the resulting grammar and lexicon remain unchanged. Forty EM iterations are used
for all cases.
We also include a baseline which randomly chooses a tag from those associated with
each word in the lexicon, averaged over three runs.
Accuracy on ambiguous word tokens: We evaluate the performance in terms of
tagging accuracy with respect to gold tags for ambiguous words in held-out test sets for
English and Italian. We consider results with and without punctuation.
9
Recall that unlike much previous work, we do not collect the lexicon (tag dictionary)
from the test set: this means the model must handle unknown words and the possibility
of having missing lexical entries for covering the test set.
English CCGbank results
Accuracy on ambiguous tokens: Table 4.2 gives performance on the CCGbank test
sections. All models are well above the random baseline, and both of the strategies
individually boost performance over basic EM by a large margin. For the models using
9
The reason for this is that the \categories" for punctuation in CCGbank are for the most part not
actual categories; for example, the period \." has the categories \." and \S". As such, these supertags
are outside of the categorial system: their use in derivations requires phrase structure rules that are not
derivable from the CCG combinatory rules.
86
Model ambig ambig all all
-punc -punc
Random 17.9 16.2 27.4 21.9
EM 38.7 35.6 45.6 39.8
EM+IP 52.1 51.0 57.3 53.9
EM
GI
56.3 59.4 61.0 61.7
EM
GI
+IP
GI
59.6 62.3 63.8 64.3
Table 4.2: Supertagging accuracy for CCGbank sections 22-24. Accuracies are reported
for four settings|(1) ambiguous word tokens in the test corpus, (2) ambiguous word
tokens, ignoring punctuation, (3) all word tokens, and (4) all word tokens except punc-
tuation.
GI, accuracy ignoring punctuation is higher than for all almost entirely due to the fact
that \." has the supertags \." and S, and the GI gives a preference to S since it can in fact
combine with other categories, unlike \."|the eect is that nearly every sentence-nal
period (5.5k tokens) is tagged S rather than \.".
EM
GI
is more eective than EM+IP; however, it should be kept in mind that IP-
minimization is a general technique that can be applied to any sequence prediction task,
whereas grammar-informed initialization may be used only with tasks in which the in-
teractions of adjacent labels may be derived from the labels themselves. Interestingly,
the gap between the two approaches is greater when punctuation is ignored (51.0 vs.
59.4)|this is unsurprising because, as noted already, punctuation supertags are not ac-
tual categories, so EM
GI
is unable to model their distribution. Most importantly, the
complementary eects of the two approaches can be seen in the improved results for
EM
GI
+IP
GI
, which obtains about 3% better accuracy than EM
GI
.
Accuracy on all tokens: Table 4.2 also gives performance when taking all tokens
into account. The HMM when using full supervision obtains 87.6% accuracy (Baldridge,
87
Model TEST 1 TEST 2 (using lexicon from:)
NPAPER+CIVIL NPAPER CIVIL
Random 9.6 9.7 8.4 9.6
EM 26.4 26.8 27.2 29.3
EM+IP 34.8 32.4 34.8 34.6
EM
GI
43.1 43.9 44.0 40.3
EM
GI
+IP
GI
45.8 43.6 47.5 40.9
Table 4.3: Comparison of supertagging results for CCG-TUT. Accuracies are for ambigu-
ous word tokens in the test corpus, ignoring punctuation.
2008),
10
so the accuracy of 63.8% achieved by EM
GI
+IP
GI
nearly halves the gap between
the supervised model and the 45.6% obtained by basic EM semi-supervised model.
Italian CCG-TUT results
To demonstrate that both methods and their combination are language independent, we
apply them to the Italian CCG-TUT corpus. We wanted to evaluate performance out-
of-the-box because bootstrapping a supertagger for a new language is one of the main
use scenarios we envision: in such a scenario, there is no development data for changing
settings and parameters. Thus, we determined a train/test split beforehand and ran the
methods exactly as we had for CCGbank.
The results, given in Table 4.3, demonstrate the same trends as for English: basic
EM is far more accurate than random, EM+IP adds another 8-10% absolute accuracy,
and EM
GI
adds an additional 8-10% again. The combination of the methods generally
improves over EM
GI
, except when the lexicon is extracted from NPAPER+CIVIL.
10
A state-of-the-art, fully-supervised maximum entropy tagger (Clark & Curran, 2007) (which also uses
part-of-speech labels) obtains 91.4% on the same train/test split.
88
Conclusion: We have shown how two complementary strategies|grammar-informed tag
transitions and IP-minimization|for learning of supertaggers from highly ambiguous lex-
icons can be straightforwardly integrated. We verify the benets of both cross-lingually,
on English and Italian data. We also provide a new two-stage integer programming setup
for decipherment that allows model minimization to be tractable for supertagging without
sacricing the quality of the search for minimal bitag grammars.
4.4 Unsupervised Word / Sub-word Alignment
Unsupervised word alignment is another natural language task which ts into the de-
cipherment framework. Unlike previous problems, the key involves transposition or re-
ordering, which makes the problem harder. The table displayed here shows some of the
common key characteristics for the problem and compares it with previous decipherment
problems.
89
Key Characteristics Simple
sub
Homophonic
sub
Unsupervised
POS Tagging
Word / Sub-word
Alignment
Is the key deterministic in
the enciphering direction?
Yes No No No
Is the key deterministic in
the deciphering direction?
Yes Yes No No
Does the key substitute one-
for-one (symbol for symbol)
in the enciphering direction?
Yes Yes Yes No
Does the key substitute one-
for-one in the deciphering
direction?
Yes Yes Yes No
What linguistic unit is sub-
stituted?
Letter /
Syllable
Letter Word Morpheme/
Word/ Phrase
Does the key involve trans-
position (re-ordering)?
No No No Yes
Other properties of the key
that are specic to the prob-
lem
1) Key \hints" are
provided in the
form of a dictio-
nary
1) Parallel data
is available as
source/target lan-
guage sentence
pairs
2) No LM data is
provided
Here, we describe our work on word alignment from (Bodrumlu et al., 2009), which
shows that explicitly choosing small models during the decipherment process can yield
good results for such tasks. We develop a new objective function for word alignment that
measures the size of the bilingual dictionary induced by an alignment. A word alignment
that results in a small dictionary is preferred over one that results in a large dictionary.
90
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
...
...
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
...
...
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
Word
Alignment
Figure 4.7: Sample sentence pairs from an English-Spanish bilingual corpus|original
sentence pairs (left), and after they have been word-aligned (right).
In order to search for the alignment that minimizes this objective, we cast the problem
as one of integer linear programming. We then extend our objective function to align
corpora at the sub-word level, which we demonstrate on a small Turkish-English corpus.
Word Alignment Task: Word alignment is the problem of annotating a bilingual text
with links connecting words that have the same meanings. Figure 4.7 shows sample in-
put/output from a word alignment exercise for English-Spanish bitext. Here, the English
word associates is linked to the Spanish word associados, and both of them share the
same meaning.
Word alignment has several downstream consumers. One is machine translation (MT),
where programs extract translation rules from word-aligned corpora (Och & Ney, 2004;
Galley et al., 2004; Chiang, 2007; Quirk et al., 2005). Other downstream processes
exploit dictionaries derived by alignment, in order to translate queries in cross-lingual IR
(Sch onhofen et al., 2008) or re-score candidate translation outputs (Och et al., 2004).
91
4.4.1 Previous Work
In the past, researchers have explored various methods for automatic word alignment
using machines. Probabilistic generative models like IBM 1-5 (Brown et al., 1993), HMM
(Vogel et al., 1996), ITG (Wu, 1997), and LEAF (Fraser & Marcu, 2007a) dene formulas
for P(fj e) or P(e, f), with hidden alignment variables. EM algorithms estimate dictionary
and other probabilities in order to maximize those quantities. One can ask for the Viterbi
alignment, which maximizes P(alignmentj e, f). Discriminative models, e.g. (Taskar
et al., 2005), set parameters instead to maximize alignment accuracy against a hand-
aligned development set. EMD training (Fraser & Marcu, 2006) combines generative and
discriminative elements. Dierent accuracy metrics have been proposed, e.g., (Och &
Ney, 2003; Fraser & Marcu, 2007b; Ayan & Dorr, 2006).
Alignment accuracy is still low for many language pairs, and most practitioners still
use 1990s algorithms to align their data. It stands to reason that we have not yet seen
the last word in alignment models. Another weakness of current systems is that they
only align full words. With few exceptions, e.g. (Snyder & Barzilay, 2008), they do not
align at the sub-word level, making them much less useful for agglutinative languages.
4.4.2 Minimized Models for Word Alignment
We present a new objective function for word alignment|we search for the legal align-
ment that minimizes the size of the induced bilingual dictionary. By dictionary size, we
mean the number of distinct word-pairs linked in the corpus alignment. We can immedi-
ately investigate how dierent alignments stack up, according to this objective function.
Figure 4.8 shows samples from the gold alignment for the English-Spanish corpus from
92
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
...
...
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
Figure 4.8: Gold alignment samples.
The induced bilingual dictionary has 28
distinct entries, including garcia/garcia,
are/son, are/estan, not/no, the/los, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 2: Gold alignment. The induced bilingual dic-
tionary has 28 distinct entries, including garcia/garcia,
are/son, are/estan, not/no, has/tiene, etc.
!"#$%" "&'("))*$%"+,)(
!"#$%" - ")*$%"'*)
.%)("))*$%"+,)("#,(&*+()+#*&!
)/)(")*$%"'*)(&*()*&(0/,#+,)
$"#1*) !"#$%" .")(+.#,,("))*$%"+,)(
$"#1*) !"#$%" +%,&,(+#,)(")*$%"'*)
!"#$%" .")("($*23"&-("1)*
!"#$%" +"24%,& +%,&,(/&"(,23#,)"
%+)($1%,&+)("#,("&!#-
)/) $1%,&+,) ,)+"& ,&0"'"'*)
+.,("))*$%"+,)("#,("1)*("&!#-
1*)(")*$%"'*)(+"24%,& ,)+"& ,&0"'"'*)
+.,($1%,&+)("&'(+.,("))*$%"+,)("#,(,&,2%,)
1*)($1%,&+,)(-(1*)(")*$%"'*)()*&(,&,2%!*)
+.,($*23"&-(.")(+.#,,(!#*/3)
1"(,23#,)"(+%,&,(+#,)(!#/3*)
%+)(!#*/3)("#,(%&(,/#*3,
)/)(!#/3*)(,)+"& ,&(,/#*3"
+.,(2*',#&(!#*/3)(),11()+#*&!(3."#2"$,/+%$"1)
1*)(!#/3*)(2*',#&*)(5,&',&(2,'%$%&")(0/,#+,)
+.,(!#*/3)('*(&*+(),11(6,&6"&%&,
1*)(!#/3*)(&*(5,&',&(6"&6"&%&"
+.,()2"11(!#*/3)("#,(&*+(2*',#&
1*)(!#/3*)(3,7/,&*) &*()*&(2*',#&*)
Figure 3: IP alignment. The induced bilingual dictionary
has 28 distinct entries.
...
...
Figure 4.9: IP alignment samples. The
induced bilingual dictionary has 28 dis-
tinct entries.
Figure 4.7, which results in 28 distinct bilingual dictionary entries. By contrast, a mono-
tone alignment induces 39 distinct entries, due to less re-use.
Next we look at how to automatically ri
e through all legal alignments to nd the
one with the best score. What is a legal alignment? For now, we consider it to be one
where:
Every foreign word is aligned exactly once (Brown et al., 1993).
Every English word has either 0 or 1 alignments (Melamed, 1997).
We formulate our integer program (IP) as follows. We set up two types of binary variables:
Alignment link variables. If link-i-j-k = 1, that means in sentence pairi, the foreign
word at position j aligns to the English words at position k.
Bilingual dictionary variables. If dict-f-e = 1, that means word pair (f, e) is \in"
the dictionary.
93
We constrain the values of link variables to satisfy the two alignment conditions listed
earlier. We also require that if link-i-j-k = 1 (i.e., we've decided on an alignment link),
then dict-f
ij
-e
ik
should also equal 1 (the linked words are recorded as a dictionary entry).
11
We do not require the converse|just because a word pair is available in the dictionary,
the aligner does not have to link every instance of that word pair. For example, if an
English sentence has two the tokens, and its Spanish translation has two la tokens, we
should not require that all four links be active|in fact, this would con
ict with the 1-1
link constraints and render the integer program unsolvable. The IP reads as follows:
minimize:
P
f;e
dict-f-e
subject to:
8
i;j
P
k
link-i-j-k = 1
8
i;k
P
j
link-i-j-k 1
8
i;j;k
link-i-j-k dict-f
ij
-e
ik
Once we generate the integer program, we use the CPLEX software package to solve it.
On our Spanish-English corpus, the CPLEX solver obtains a minimal objective function
value of 28. To get the second-best solution, we add a constraint to our IP requiring the
sum of the n variables active in the previous solution to be less than n, and we re-run
CPLEX. This forces CPLEX to choose dierent variable settings on the second go-round.
We repeat this procedure to get an ordered list of solutions.
11
fij is the jth foreign word in the ith sentence pair.
94
Method Dictionary size f-score
Gold 28 100.0
Monotone 39 68.9
IBM-1 (Brown et al., 1993) 30 80.3
IBM-4 (Brown et al., 1993) 29 86.9
IP 28 95.9
Figure 4.10: Comparison of dierent word alignment systems in terms of dictionary size
and alignment accuracy.
Experiments and Results
We nd that there are 8 distinct solutions that yield the same objective function value
of 28. Figure 4.9 shows samples from one of these. This alignment is not bad, consider-
ing that word-order information is not encoded in the IP. We can now compare several
alignments in terms of both dictionary size and alignment accuracy. For accuracy, we
represent each alignment as a set of tuples < i;j;k >, where i is the sentence pair, j is
a foreign index, and k is an English index. We use these tuples to calculate a balanced
f-score against the gold alignment tuples.
12
The results are shown in Figure 4.10. The
last line in the gure shows an average f-score over the 8 tied IP solutions which yields
the best performance on this task.
Figure 4.11 further investigates the connection between our objective function and
alignment accuracy. We sample up to 10 alignments at each of several objective function
values v, by rst adding a constraint that dict variables add to exactly v, then iterating
the n-best list procedure above. We stop when we have 10 solutions, or when CPLEX
fails to nd another solution with value v. The gure clearly shows that minimizing the
dictionary size leads to better alignments.
12
P = proportion of proposed links that are in gold, R = proportion of gold links that are proposed,
and f-score = 2PR/(P+R).
95
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Average alignment accuracy
(f-score)
Number of bilingual dictionary entries in solution
(objective function value)
28 30 32 34 36 38 40 42 44 46
Figure 4.11: Relationship between IP objective (x-axis = size of induced bilingual dic-
tionary) and alignment accuracy (y-axis = f-score). Small alignment dictionaries tend to
produce better results.
Turkish English
yururum i walk
yururler they walk
Figure 4.12: Two Turkish-English sentence pairs.
4.4.3 Minimized Models for Sub-Word Alignment
Turkish is a dicult language for NLP applications using standard techniques. It is an
agglutinative language that can express in a single word (e.g., yurumuyoruz) what might
require many words in another language (e.g., we are not walking). Naively breaking
on whitespace results in a very large vocabulary for Turkish, and it ignores the multi-
morpheme structure inside Turkish words.
96
Turkish English
onlari gordum i saw them
gidecekler they will go
onu stadyumda gordum i saw him in the stadium
ogretmenlerim tiyatroya yurudu my teachers walked to the theatre
cocuklar yurudu the girls walked
babam restorana gidiyor my father is walk ing to the restaurant
. . . . . .
Figure 4.13: A Turkish-English corpus produced by an English grammar pipelined with
an English-to-Turkish tree-to-string transducer.
Consider the tiny Turkish-English corpus in Figure 4.12. Even a non-Turkish speaker
might plausibly align yurur to walk, um to I, and ler to they. However, none of the popular
machine aligners is able to do this, since they align at the whole-word level. MT system
designers sometimes employ language-specic word breakers before alignment, though
these are hard to build and maintain, and they are usually specic not only to language A,
but to language A when being translated to B. Good unsupervised monolingual morpheme
segmenters are also available (Goldsmith, 2001; Creutz & Lagus, 2005a), though again,
these do not do joint inference of alignment and word segmentation.
We extend our objective function straightforwardly to sub-word alignment. To test
our extension, we construct a Turkish-English corpus of 1616 sentence pairs using an
English regular tree grammar (RTG) combined with an English-to-Turkish tree-to-string
transducer. A fragment of the parallel corpus is shown in Figure 4.13. Because we will
concentrate on nding Turkish sub-words, we manually break o the English sub-word
-ing, by rule, as seen in the last line of the gure.
This is a small corpus, but good for demonstrating our concept. By automatically
tracing the internal operation of the tree transducer, we also produce a gold alignment
97
n % Turkish types % Turkish tokens
with n morphemes with n morphemes
1 23.1% 35.5%
2 63.5% 61.6%
3 13.4% 2.9%
Figure 4.14: Corpus statistics showing percentage of Turkish word types and tokens
occurring with dierent morpheme counts.
for the corpus. We use the gold alignment to tabulate the number of morphemes per
Turkish word (shown in Figure 4.14).
Naturally, these statistics imply that standard whole-word aligners will fail. By inspecting
the corpus, we nd that 26.8 is the maximium f-score available to whole-word alignment
methods.
Now we adjust our IP formulation. We broaden the denition of legal alignment to
include breaking any foreign word (token) into one or more sub-word (tokens). Each
resulting sub-word token is aligned to exactly one English word token, and every English
word aligns to 0 or 1 foreign sub-words. Our dict-f-e variables now relate Turkish sub-
words to English words. The rst sentence pair in Figure 4.12 would have previously
contributed two dict variables; now it contributes 44, including things like dict-uru-walk.
We consider an alignment to be a set of tuples < i;j1;j2;k >, where j1 and j2 are
start and end indices into the foreign character string. We create align-i-j1-j2-k variables
that connect Turkish character spans with English word indices. Alignment variables
constrain dictionary variables as before, i.e., an alignment link can only \turn on" when
licensed by the dictionary.
We previously constrained every Turkish word to align to something. However, we do
not want every Turkish character span to align|only the spans explicitly chosen in our
98
you go to his office
onun ofisi- -ne gider- -sin
Gold alignment IP sub-word alignment
you go to his office
onun ofisi- -ne gider- -sin
my teachers ran to their house
ogretmenler- -im onlarin evi- -ne kostu
my teachers ran to their house
ogretmenler- -im onlarin evi- -ne kostu
i saw him
onu gordu- -m
i saw him
onu gordu- -m
we go to the theatre
tiyatro- -ya gider- -iz
we go to the theatre
tiyatro- -ya gider- -iz
they walked to the store
magaza- -ya yurudu- -ler
they walked to the store
magaza- -ya yurudu- -ler
my aunt goes to their house
hala- -m onlarin evi- -ne gider
my aunt goes to their house
hal- -am onlarin evi- -ne gider
1.
2.
3.
5.
6.
4.
Figure 4.15: Sample gold and (initial) IP sub-word alignments on our Turkish-English
corpus. Dashes indicate where the IP search has decided to break Turkish words in the
process of aligning. For examples, the word magazaya has been broken into magaza- and
-ya.
word segmentation. For a coherent segmentation, the set of active span variables must
cover all Turkish letter tokens in the corpus, and no pair of spans may overlap each other.
Experiments and Results
With our simple objective function, we obtain an f-score of 61.4 against the gold standard.
Sample gold and IP alignments are shown in Figure 4.15.
Allowing NULL alignments for English words: The last two incorrect alignments
Figure 4.15 are instructive. The system has decided to align English the to the Turkish
99
noun morphemes tiyatro and magaza, and to leave English nouns theatre and store un-
aligned. This is a tie-break decision. It is equally good for the objective function to leave
the unaligned instead|either way, there are two relevant dictionary entries.
We x this problem by introducing a special NULL Turkish token, and by modifying
the IP to require every English token to align (either to NULL or something else). This
introduces a cost for failing to align an English token x to Turkish, because a new x/NULL
dictionary entry will have to be created. (The NULL token itself is unconstrained in how
many tokens it may align to.)
Under this scheme, the last two incorrect alignments in Figure 4.15 induce four rele-
vant dictionary entries (the/tiyatro, the/magaza, theatre/NULL, store/NULL) while the
gold alignment induces only three (the/NULL, theatre/tiyatro, store/magaza), because
the/NULL is re-used. The gold alignment is therefore now preferred by the IP optimizer.
There is a rippling eect, causing the system to correct many other decisions as well.
This revision raises the alignment f-score from 61.4 to 83.4.
Figure 4.16 summarizes our alignment results. In the table, \Dictionary" refers to the
size of the induced dictionary, and \Sub-words" refers to the number of induced Turkish
sub-word tokens.
Our search for an optimal IP solution is not fast. It takes 1-5 hours to perform sub-
word alignment on the Turkish-English corpus (1616 sentence pairs). Of course, obtaining
optimal alignments under IBM Model 4 is expensively NP-complete (Raghavendra &
Maji, 2006), so practical Model 4 systems make search approximations (Brown et al.,
1993).
100
Method Dictionary Sub-words f-score
Gold (sub-word) 67 8102 100.0
Monotone (word) 512 4851 5.5
IBM-1 (word) 220 4851 21.6
IBM-4 (word) 230 4851 20.3
IP (word) 107 4851 20.1
IP (sub-word, initial) 60 7418 61.4
IP (sub-word, revised) 65 8105 83.4
Figure 4.16: Comparison of dierent systems on the Turkish-English sub-word alignment
task in terms of dictionary size, count of Turkish sub-word tokens and f-score accuracy.
Conclusion: We have presented a novel objective function for alignment based on mini-
mizing dictionary entries, and we have applied it to whole-word and sub-word alignment
problems. We believe there are good future possibilities for this work such as extending
the notion of legal alignment to cover n-to-m and discontinuous alignments, developing a
fast and approximate search algorithm, and testing on large-scale bilingual corpora.
4.5 Occam's Razor and MDL
For the tasks described in Sections 4.2, 4.3 and 4.4, we developed methods which explicitly
try to search for small models. This idea relates to the famous Occam's Razor principle
which sates that, to explain any observed phenomenon, choose a hypothesis that makes
as few assumptions as possible, or in other words, choose the simplest explanation.
Our idea also resembles the classic Minimum Description Length (MDL) approach
for model selection (Barron et al., 1998). In MDL, there is a single objective function
to (1) maximize the likelihood of observing the data, and at the same time (2) minimize
the length of the model description (which depends on the model size). However, the
search procedure for MDL is non-trivial and for our unsupervised tasks such as tagging
101
(Section 4.2), we have not found a direct objective function which we can optimize and
produce good tagging results.
In the past, only a few approaches utilizing MDL have been shown to work for natural
language applications. These approaches employ heuristic search methods with MDL for
the task of unsupervised learning of morphology of natural languages (Goldsmith, 2001;
Creutz & Lagus, 2002; Creutz & Lagus, 2005b). The method proposed here is the rst
application of the MDL idea to multiple natural language problems (such as POS tagging,
supertagging and word/sub-word alignment), and the rst to use an integer programming
formulation rather than heuristic search techniques.
4.6 Fast, Approximate Algorithms for Model Minimization
So far, we have seen how several unsupervised tasks in NLP can be tackled with decipher-
ment approaches based on a single underlying idea|\searching for minimized models".
All the methods we discussed use integer programming (IP) formulations which provide
us the benet of exact solutions (in terms of the minimization objective function) and
also result in high task accuracies. Unfortunately, using the IP framework for model
minimization also comes with a cost|solving an integer program is an NP-Complete
problem. This can prove to be very slow or even intractable in some cases, especially
when dealing with large data sizes. If we were to design fast, approximation algorithms
which can achieve the same objective without sacricing on performance then this can
be benecial for scaling our current decipherment methods to bigger problems.
102
In (Ravi et al., 2010b), we propose a novel two-stage greedy approximation scheme
that optimizes the same minimization objective function and replaces the IP for model
minimization. We use the unsupervised part-of-speech tagging task (from Section 4.2) as
a benchmark to test the eectiveness of our approach and compare it to the original IP
method. Our method runs much faster than IP and can easily scale to large data sizes,
while yielding highly accurate tagging results. We also compare our method against
standard EM training, and show that we consistently obtain better tagging accuracies on
test data of varying sizes for English and Italian.
4.6.1 Model Minimization Formulated as a Path Problem
The complexity of the model minimization step from Section 4.2.2 (Ravi & Knight, 2009c)
and its proposed approximate solution can be best understood if we formulate it as a path
problem in a graph.
Let w = w
0
;w
1
;:::;w
N
;w
N+1
, be a word sequence where w
1
;:::;w
N
are the input
word tokens . Let T =fT
1
;:::;T
K
g
S
fT
0
;T
K+1
g be the xed set of all possible tags. T
0
andT
K+1
are special tags that we add for convenience. These would be thestart andend
tags that one typicalls adds to the HMM lattice. The tag dictionary D contains entries
of the form (w
i
;T
j
) for all the possible tags T
j
that word token w
i
can have. We add
entries (w
0
;T
0
) and (w
K+1
;T
K+1
) toD. Given this input, we now create a directed graph
G(V;E). Let C
0
;C
1
:::;C
K+1
be columns of nodes in G, where column C
i
corresponds
to word token w
i
.8i = 0;:::;N + 1 and8j = 0;:::;K + 1, we add node C
i;j
in column
C
i
if (w
i
;T
j
)2D. Now,8i = 0;:::;N, we create directed edges from every node inC
i
to
every node inC
i+1
. Each of these edgese = (C
i;j
;C
i+1;k
) is given the label (T
j
;T
k
) which
103
T0
T1
T2
T3
T4
w0 w1 w2 w3 w4 w5
T0,T1
T0,T3
T1,T2
T1,T2 T2,T1
T2,T2
T3,T2
T3,T4
T2,T4
T2,T3
T2,T2
T1,T3
C0,0
C1,1
C1,3
C2,2
C3,1
C3,2 C4,2
C4,3
C5,4
word sequence:
POS tags
Initial graph: G (V, E)
Figure 4.17: Graph instantiation for the MinTagPath problem.
corresponds to a tag bigram. This creates our directed graph. Letl(e) be the tag bigram
label of edges e2 E. For every path P from C
0;0
to C
N+1;K+1
, we say that P uses an
edge label (or tag bigram) (T
j
;T
k
) if there exists an edgee inP such thatl(e) = (T
j
;T
k
).
We can now formulate the the optimization problem as: Find the smallest set S of tag
bigrams such that there exists at least one path from C
0;0
to C
N+1;K+1
using only the tag
bigrams in S . Let us call this the Minimal Tag Bigram Path (MinTagPath) problem.
Figure 4.17 shows an example graph where the input word sequence isw
1
;:::;w
4
and
T =fT
1
;:::;T
3
g is the input tagset. We add the start/end word tokensfw
0
;w
5
g and
corresponding tagsfT
0
;T
4
g. The edges in the graph are instantiated according to the
word/tag dictionary D provided as input. The node and edge labels are also illustrated
in the graph. Our goal is to nd a path from C
0;0
to C
5;4
using the smallest set of tag
bigrams.
104
Problem complexity: In (Ravi et al., 2010b), we prove that the MinTagPath problem
can be solved in polynomial time using a simple brute force approach, but the time
complexity incurred is prohibitively large. For the Penn Treebank, K = 45 and the
worst case running time would be 10
637
N. Evidently, for all practical purposes, this
approach is intractable.
4.6.2 Greedy Model Minimization
We do not know of an ecient, exact algorithm to solve the MinTagPath problem. There-
fore, we present a simple and fast two-stage greedy approximation scheme. Notice that
an optimal pathP (or any path) covers all the input words i.e., every word token w
i
has
one of its possible taggings in P . Exploiting this property, in the rst phase, we set our
goal to cover all the word tokens using the least possible number of tag bigrams. This
can be cast as a set cover problem (Garey & Johnson, 1979) and we use the set cover
greedy approximation algorithm in this stage. The output tag bigrams from this phase
might still not allow any path from C
0;0
to C
N+1;K+1
. So we carry out a second phase,
where we greedily add a few tag bigrams until a path is created.
Phase 1: Greedy Set Cover
In this phase, our goal is to cover all the word tokens using the least number of tag
bigrams. The covering problem is exactly that of set cover. Let U =fw
0
;:::;w
N
+ 1g
be the set of elements that needs to be covered, in this case, the word tokens. For all tag
bigrams (T
i
;T
j
)2B, we dene the corresponding covering set S
T
i
;T
j
=fw
n
: (w
n
;T
i
)2
D^ (C
n;i
;C
n+1;j
)2E^l(C
n;i
;C
n+1;j
) = (T
i
;T
j
)
W
(w
n
;T
j
)2D^ (C
n1;i
;C
n;j
)2E^
105
Algorithm 1 Set Cover : Phase 1
Denitions
Dene CAND : Set of candidate covering sets in the current iteration
Dene U
rem
: Number of elements in U remaining to be covered
Dene E
S
T
i
;T
j
: Current eective cost of a set
Dene Itr : Iteration number
Initializations
LET CAND =X
LET CHOSEN =
LET U
rem
=U
LET Itr = 0
LET E
S
T
i
;T
j
=
1
jS
T
i
;T
j
j
,8 S
T
i
;T
j
2CAND
while U
rem
6= do
Itr + +
Dene
^
S
Itr
= argmin
S
T
i
;T
j
2CAND
E
S
T
i
;T
j
CHOSEN = CHOSEN
S
^
S
Itr
Remove
^
S
Itr
from CAND
Remove all the current elements in
^
S
Itr
from U
rem
Remove all the current elements in
^
S
Itr
from S
T
i
;T
j
2CAND
8 S
T
i
;T
j
2CAND, E
S
T
i
;T
j
=
1
jS
T
i
;T
j
j
end while
return CHOSEN
l(C
n1;i
;C
n;j
) = (T
i
;T
j
)g. Let the set of covering sets beX. We assign a cost of 1 to each
covering set. The goal is to select a setCHOSENX such that
S
S
T
i
;T
j
2CHOSEN
=U,
minimizing the total cost of CHOSEN. This corresponds to covering all the words
with the least possible number of tag bigrams. We now use the greedy approximation
algorithm for set cover. The pseudo code is shown in Algorithm 1.
For the graph shown in Figure 4.17, here are a few possible covering sets (S
T
i
;T
j
) and
their initial eective costs (E
S
T
i
;T
j
).
S
T
0
;T
1
=fw
0
;w
1
g, E
S
T
0
;T
1
= 1=2
S
T
1
;T
2
=fw
1
;w
2
;w
3
;w
4
g, E
S
T
1
;T
2
= 1=4
106
S
T
2
;T
2
=fw
2
;w
3
;w
4
g, E
S
T
2
;T
2
= 1=3
:::
In every iteration Itr of Algorithm 1, we pick a set
^
S
Itr
that is most cost eective.
The elements that
^
S
Itr
covers are then removed from all the remaining candidate sets
andU
rem
and the eectiveness of the candidate sets is recalculated for the next iteration.
The algorithm stops when all elements of U i.e., all the word tokens are covered. Let,
B
CHOSEN
=f(T
i
;T
j
) : S
T
i
;T
j
2 CHOSENg, be the set of tag bigrams that have been
chosen by set cover. Now, we check, using BFS, if there exists a path from C
0;0
to
C
N+1;K+1
using only the tag bigrams in B
CHOSEN
. If not, then we have to add tag
bigrams to B
CHOSEN
to enable a path. To accomplish this, we carry out the second
phase of this scheme with another greedy approach.
For the example graph in Figure 4.17, one possible solution B
CHOSEN
=f(T
0
;T
1
),
(T
1
;T
2
), (T
2
;T
4
)g.
Phase 2: Greedy Path Completion
We dene the graph G
CHOSEN
(V
0
;E
0
) G(V;E) that contains the edges e2 E such
l(e)2B
CHOSEN
.
LetB
CAND
=BnB
CHOSEN
, be the current set of candidate tag bigrams that can be
added to the nal solution which would create a path. We would like to know how many
holes a particular tag bigram (T
i
;T
j
) can ll. We dene a hole as an edge e such that
e 2 GnG
CHOSEN
and9 e
0
;e
00
2 G
CHOSEN
such that tail(e
0
) = head(e) ^ tail(e) =
head(e
00
).
107
T0
T1
T2
T3
T4
w0 w1 w2 w3 w4 w5
T0,T1
T1,T2
T1,T2 T2,T1
T2,T4
C0,0
C1,1
C1,3
C2,2
C3,1
C3,2 C4,2
C4,3
C5,4
word sequence:
POS tags
T0,T1
T1,T2
T2,T4
Tag bigrams chosen after Phase 1
(BCHOSEN)
Hole in graph: Edge e = (C 2,2, C3,1)
Graph after Phase 1: GCHOSEN (V’, E’)
Figure 4.18: Graph constructed with tag bigrams chosen in Phase 1 of the MIN-GREEDY
method.
Figure 4.18 illustrates the graph G
CHOSEN
using tag bigrams from the example so-
lution to Phase 1 (Section 4.6.2). The dotted edge (C
2;2
;C
3;1
) represents a hole, which
has to be lled in the current phase in order to complete a path from C
0;0
to C
5;4
.
In Algorithm 2, we dene the eectiveness of a candidate tag bigram H(T
i
;T
j
) to be
the number of holes it covers. In every iteration, we pick the most eective tag bigram,
ll the holes and recalculate the eectiveness of the remaining candidate tag bigrams.
Algorithm 2 returnsB
FINAL
, the nal set of chosen tag bigrams. It terminates when
a path has been found.
Fitting the Model
Once the greedy algorithm terminates and returns a minimized grammar of tag bigrams,
we follow our previous approach from (Ravi & Knight, 2009c) and t the minimized
108
Algorithm 2 Greedy Path Complete : Phase 2
Dene B
FINAL
: Final set of tag bigrams selected by the two-phase greedy approach
LET B
FINAL
=B
CHOSEN
LET H(T
i
;T
j
) =jfegj such that l(e) = (T
i
;T
j
) and e is a hole,8 (T
i
;T
j
)2B
CAND
while @ path P from C
0;0
to C
N+1;K+1
using only (T
i
;T
j
)2B
CHOSEN
do
Dene (
^
T
i
;
^
T
j
) = argmax
(T
i
;T
j
)2B
CAND
H(T
i
;T
j
)
B
FINAL
= B
FINAL
S
(
^
T
i
;
^
T
j
)
Remove (
^
T
i
;
^
T
j
) from B
CAND
G
CHOSEN
=G
CHOSEN
S
feg such that l(e) = (T
i
;T
j
)
8 (T
i
;T
j
)2B
CAND
, Recalculate H(T
i
;T
j
)
end while
return B
FINAL
model to the data using the alternating EM strategy. The alternating EM iterations are
terminated when the change in the size of the observed grammar (i.e., the number of
unique tag bigrams in the tagging output) is 5%. We call our method MIN-GREEDY.
4.6.3 Experiments and Results
English POS Tagging
Data: We use a standard test set (consisting of 24,115 word tokens from the Penn
Treebank) for the POS tagging task (described in Section 6.1). The tagset consists of 45
distinct tag labels and the dictionary contains 57,388 word/tag pairs derived from the
entire Penn Treebank. Per-token ambiguity for the test data is about 1.5 tags/token. In
addition to the stadard 24k dataset, we also train and test on larger data sets of 48k,
96k, 193k, and the entire Penn Treebank (973k).
109
Method Tagging accuracy (%)
when training & testing on:
24k 48k 96k 193k PTB (973k)
EM 81.7 81.4 82.8 82.0 82.3
IP 91.6 89.3 89.5 91.6
MIN-GREEDY 91.6 88.9 89.4 89.1 87.1
Figure 4.19: Results for unsupervised English POS tagging with a dictionary for dierent
data sizes when using a set of 45 tags. (
IP method does not scale to large data).
Methods: We perform comparative evaluations for POS tagging using three dierent
methods:
1. EM: Training a bigram HMM model using EM algorithm.
2. IP: Minimizing grammar size using integer programming, followed by EM training
(Ravi & Knight, 2009c).
3. MIN-GREEDY: Minimizing grammar size using the Greedy method described in
Section 4.6.2, followed by EM training.
Results: Figure 4.19 shows the tagging performance (word token accuracy %) achieved
by the three methods on the standard test (24k tokens) as well as Penn Treebank test
(PTB = 973k tokens). On the 24k test data, the MIN-GREEDY method achieves a high
tagging accuracy comparable to the previous best from the IP method. However, the IP
method does not scale well which makes it infeasible to run this method in a much larger
data setting (the entire Penn Treebank). MIN-GREEDY on the other hand, faces no such
problem and in fact it achieves high tagging accuracies on all four datasets, consistently
beating EM by signicant margins. When tagging all the 973k word tokens in the Penn
110
Test set Eciency
(running time in secs.)
IP MIN-GREEDY
24k test 312 24
48k test 167 57
96k test 384 145
193k test 2347 331
PTB (973k) test
1485
Figure 4.20: Comparison of MIN-GREEDY versus MIN-GREEDY approach in terms of
eciency (running time in seconds) for dierent data sizes. All the experiments were run
on a single machine with a 64-bit, 2.4 GHz AMD Opteron 850 processor.
Treebank data, it produces an accuracy of 87.1% which is much better than EM (82.3%)
run on the same data.
Figure 4.20 compares the running time eciency for the IP method versus MIN-
GREEDY method as we scale to larger datasets. The gure shows that the greedy
approach can scale comfortably to large data sizes, and a complete run on the entire
Penn Treebank data nishes in just 1485 seconds. In contrast, the IP method does not
scale well|it takes 312 seconds to nish on the 24k test (versus 24 seconds for MIN-
GREEDY) and on the larger PTB test data, the IP solver runs for more than 3 hours
without returning a solution.
It is interesting to see that for the 24k dataset, the greedy strategy nds a grammar
set (containing only 478 tag bigrams). We observe that MIN-GREEDY produces 452 tag
bigrams in the rst minimization step (phase 1), and phase 2 adds another 26 entries,
yielding a total of 478 tag bigrams in the nal minimized grammar set. That is almost
as good as the optimal solution (459 tag bigrams from IP) for the same problem. But
MIN-GREEDY clearly has an advantage since it runs much faster than IP (as shown
in Figure 4.20). Figure 4.21 shows a plot with the size of the observed grammar (i.e.,
111
400
600
800
1000
1200
1400
1600
Observed grammar size (# of tag bigrams)
in final tagging output
Size of test data (# of word tokens)
24k 48k 96k 193k PTB (973k)
EM
IP
Greedy
Figure 4.21: Comparison of observed grammar size (# of tag bigram types) in the nal
English tagging output from EM, IP and MIN-GREEDY.
number of tag bigram types in the nal tagging output) versus the size of the test data
for EM, IP and MIN-GREEDY methods. The gure shows that unlike EM, the other two
approaches reduce the grammar size considerably and we observe the same trend even
when scaling to larger data. Minimizing the grammar size helps remove many spurious
tag combinations from the grammar set, thereby yielding huge improvements in tagging
accuracy over the EM method (Figure 4.19). We observe that for the 193k dataset, the
nal observed grammar size is greater for IP than MIN-GREEDY. This is because the
alternating EM steps following the model minimization step add more tag bigrams to the
grammar.
112
Test set Speedup Optimality Ratio
24k test 13.0 0.96
48k test 2.9 0.98
96k test 2.7 0.98
193k test 7.1 0.93
Figure 4.22: Speedup versus Optimality ratio computed for the model minimization step
(when using MIN-GREEDY over IP) on dierent English datasets.
Method Tagging accuracy (%) Number of unique tag bigrams
in nal tagging output
EM 83.4 1195
IP 88.0 875
MIN-GREEDY 88.0 880
Figure 4.23: Results for unsupervised Italian POS tagging with a dictionary using a set
of 90 tags.
We compute the optimality ratio (with respect to grammar size) versus speedup (run-
ning time) achieved in the minimization step for the two approaches. Figure 4.22 illus-
trates that our solution is nearly optimal for all data settings with signicant speedup.
Italian POS Tagging
We also compare the three approaches for Italian POS tagging and show results.
Data: We use the Italian CCG-TUT corpus (Bos et al., 2009), which contains 1837
sentences. It has three sections: newspaper texts, civil code texts and European law texts
from the JRC-Acquis Multilingual Parallel Corpus. For our experiments, we use the POS-
tagged data from the CCG-TUT corpus which uses a set of 90 tags. We created a tag
dictionary consisting of 8,733 word/tag pairs derived from the entire corpus (42,100 word
tokens). We then created a test set consisting of 926 sentences (21,878 word tokens) from
the original corpus. The per-token ambiguity for the test data is about 1.6 tags/token.
113
Results: Figure 4.23 shows the results on Italian POS tagging. We observe that MIN-
GREEDY achieves signicant improvements in tagging accuracy over the EM method
and comparable to IP method. This also shows that the idea of model minimization is a
general-purpose technique for such applications and provides good tagging accuracies on
other languages as well.
4.7 Conclusion
To summarize, we nd that keeping models sizes small using prior linguistic knowledge
(as in tagging with a dictionary) or problem-specic constraints (as in word alignment)
seems to be a promising idea for decipherment. In unsupervised task settings, rather than
fully enumerating all possible parameters and searching over the entire model space, it
is useful to restrict the decipherment process to instead search over a smaller sub-space.
The idea is simple|when faced with uncertainty over a set of possible model choices for
explaining some given data, pick one that is small in size and at the same time can model
all observations seen in the data. It is encouraging to nd that this idea also results
in good performance for the various NLP tasks that we have looked at. Here is a brief
review of the accuracy improvements achieved on NLP tasks owing to our methods for
keeping models small during decipherment:
Letter substitution decipherment = +10-64%
Unsupervised POS tagging with a dictionary = +1-4% improvement over state-
of-the-art systems, achieving a high 92.3% accuracy without using any additional
linguistic knowledge.
114
Unsupervised supertagging (using CCG) with a dictionary = +3-4 % improvement
over the previous approach (Baldridge, 2008), without using any additional linguis-
tic knowledge.
Word/Sub-word alignment = +9-16% (on word alignment) and +63% (on sub-word
alignment) over existing approaches (GIZA++ implementation) using IBM Models.
115
Chapter 5
Phonetic Decipherment (Transliteration)
We now discuss the application of decipherment to another important NLP task where
we have to deal with multiple languages. In this chapter, we present our work from
(Ravi & Knight, 2009b)|a decipherment method for translating names and technical
terms across dierent languages, without using parallel resources. We frame the name
translation task in a novel manner by casting it as a phonetic decipherment problem
and show that it is possible to learn cross-language phoneme mapping tables using only
monolingual resources. We compare our method to an earlier approach that uses parallel
data for training (which is the case with most existing methods), and show that we can
achieve good results on a standard name translation task with the use of non-parallel
data alone (which is surprisingly only 26% worse in terms of accuracy compared to the
parallel-trained system).
5.1 Introduction
Following letter and syllable substitution ciphers described in the previous chapter, we
now move on to phonetic substitution ciphers, where the substitution operations during
116
encipherment/decipherment take place at the level of phonemes (linguistic units of sound),
instead of a letter or a syllable. The table below shows the key characteristics of the
phonetic substitution ciphers (Phonetic
sub
) that we describe in this chapter comparing it
to the ciphers discussed previously.
Key Characteristics Simple
sub
Homophonic
sub
Unsupervised
POS Tagging
Word / Sub-word
Alignment
Phonetic
sub
(Machine
Transliteration)
Is the key deterministic in
the enciphering direction?
Yes No No No No
Is the key deterministic in
the deciphering direction?
Yes Yes No No No
Does the key substitute one-
for-one (symbol for symbol)
in the enciphering direction?
Yes Yes Yes No No
Does the key substitute one-
for-one in the deciphering
direction?
Yes Yes Yes No No
What linguistic unit is sub-
stituted?
Letter /
Syllable
Letter Word Morpheme/ Word/
Phrase
Phoneme(s)
Does the key involve trans-
position (re-ordering)?
No No No Yes No
Phonetic ciphers have clear applications in speech and natural language processing
involving phonological analysis, and one such task is machine transliteration. Machine
transliteration refers to the transport of names and terms between languages with dif-
ferent writing systems and phoneme inventories. We can frame this task as a phonetic
117
decipherment problem, where learning a ciphertext-to-plaintext key mapping corresponds
to learning sound translation patterns between languages.
For example, in order to transliterate names from English to Japanese such as the
one shown below:
ABRAHAM $ 6I/@
pronounced as: EY B R AH HH AE M e e b u r a h a m u
some of these key mappings learnt during the decipherment might look like this:
English EY ! Japanesefe, e e, ...g
English B ! Japanesefb, b u, ...g
English L ! Japanesefr, r u, ...g
Decipherment under the conditions of transliteration is substantially more dicult
than solving letter-substitution ciphers (Knight et al., 2006; Ravi & Knight, 2008; Ravi
& Knight, 2009d) or phoneme-substitution ciphers described in (Knight & Yamada, 1999).
This is because the target table contains signicant non-determinism, and because each
symbol has multiple possible fertilities, which introduces uncertainty about the length of
the target string.
5.2 Previous Work on Machine Transliteration
Recently there has been a large amount of interesting work done in the area of machine
transliteration, and the literature has outgrown being citable in its entirety. Much of this
118
work focuses on back-transliteration, which tries to restore a name or term that has been
transported into a foreign language. Here, there is often only one correct target spelling|
for example, given jyon.kairu (the name of a U.S. Senator transported to Japanese),
we must output \Jon Kyl", not \John Kyre" or any other variation.
There are many techniques for transliteration and back-transliteration, and they vary
along a number of dimensions:
phoneme substitution vs. character substitution
heuristic vs. generative vs. discriminative models
manual vs. automatic knowledge acquisition
We explore the third dimension, where we see several techniques in use:
Manually-constructed transliteration models, e.g., (Hermjakob et al., 2008).
Models constructed from bilingual dictionaries of terms and names, e.g., (Knight &
Graehl, 1998; Huang et al., 2004; Li et al., 2004; Yoon et al., 2007; Li et al., 2007;
Karimi et al., 2007; Sherif & Kondrak, 2007b; Goldwasser & Roth, 2008b).
Extraction of parallel examples from bilingual corpora, using bootstrap dictionaries
e.g., (Sherif & Kondrak, 2007a; Goldwasser & Roth, 2008a).
Extraction of parallel examples from comparable corpora, using bootstrap dictio-
naries, and temporal and word co-occurrence, e.g., (Sproat et al., 2006; Klementiev
& Roth, 2008).
119
WFSA - A WFST - B
English word
sequence
English sound
sequence
( SPENCER ABRAHAM )
( S P EH N S ER
EY B R AH HH AE M )
WFST - C WFST - D
Japanese sound
sequence
( ス ペ ン サ ー ・ エ ー ブ ラ ハ ム )
Japanese katakana
sequence
( S U P E N S A A
E E B U R A H A M U )
Figure 5.1: Model used for back-transliteration of Japanese katakana names and terms
into English. The model employs a four-stage cascade of weighted nite-state transduc-
ers (Knight & Graehl, 1998).
Extraction of parallel examples from web queries, using bootstrap dictionaries, e.g.,
(Nagata et al., 2001; Oh & Isahara, 2006; Kuo et al., 2006; Wu & Chang, 2007).
Comparing terms from dierent languages in phonetic space, e.g., (Tao et al., 2006;
Goldberg & Elhadad, 2008).
In the next section, we describe our experiment framework, followed by details about
the task that we are trying to attack in Section 5.4. We then investigate methods to
acquire transliteration mappings from non-parallel sources (Ravi & Knight, 2009b).
5.3 Probabilistic Machine Transliteration
We follow (Knight & Graehl, 1998) in tackling back-transliteration of Japanese katakana
expressions into English. Knight and Graehl (1998) developed a four-stage cascade of
nite-state transducers, shown in Figure 5.1.
WFSA A - produces an English word sequencew with probability P(w) (based on
a unigram word model).
WFST B - generates an English phoneme sequence e corresponding to w with
probability P(ejw).
120
e j P(jje) e j P(jje) e j P(jje) e j P(jje) e j P(jje) e j P(jje) e j P(jje) e j P(jje)
AA o 0.49 AY a i 0.84 EH e 0.94 HH h 0.95 L r 0.62 OY o i 0.89 SH sh y 0.33 V b 0.75
a 0.46 i 0.09 a 0.03 w 0.02 r u 0.37 o e 0.04 sh 0.31 b u 0.17
o o 0.02 a 0.03 h a 0.02 o 0.04 y u 0.17 w 0.03
a a 0.02 i y 0.01 i 0.04 ssh y 0.12 a 0.02
a y 0.01 sh i 0.04
ssh 0.02
e 0.01
: : : : : : : :
: : : : : : : :
AW a u 0.69 DH z 0.87 G g 0.66 K k 0.53 OW o 0.57 S s u 0.43 UW u u 0.67 ZH j y 0.43
a w 0.15 z u 0.08 g u 0.19 k u 0.2 o o 0.39 s 0.37 u 0.29 j i 0.29
a o 0.06 a z 0.04 gg u 0.1 kk u 0.16 o u 0.02 sh 0.08 y u 0.02 j 0.29
a 0.04 g y 0.03 kk 0.05 u 0.05
u u 0.02 gg 0.01 k y 0.02 ss 0.02
o o 0.02 g a 0.01 k i 0.01 ssh 0.01
o 0.02
Figure 5.2: Samples from the phonemic substitution table learnt from 3343 parallel En-
glish/Japanese phoneme string pairs. English phonemes are in uppercase, Japanese in
lowercase. Mappings with P(jje)> 0:01 are shown.
WFST C - transforms the English phoneme sequence into a Japanese phoneme
sequence j according to a model P(jje).
WFST D - writes out the Japanese phoneme sequence into Japanese katakana
characters according to a model P(kjj).
Using the cascade in the reverse (noisy-channel) direction, we can translate new
katakana names and terms into English. The only transducer that requires training
in our experimental framework is WFST C.
Results from Parallel Training: We evaluate the performance of the parallel-trained
system from Knight and Graehl (1998). We re-implement their basic method by in-
stantiating a densely-connected version of WFST C with all 1-to-1 and 1-to-2 phoneme
connections between English and Japanese. We then train the WFST C model on 3343
phoneme string pairs from a bilingual dictionary, using the EM algorithm. Figure 5.2
shows the phonemic substitution table (sample mappings) learnt from parallel training.
121
We use this parallel-trained WFST C model and apply it to the task of translating 100
U.S. Senator names from Japanese to English. We obtain 40% error, roughly matching
the performance observed in (Knight & Graehl, 1998).
5.4 Transliterating without Parallel Data using Phonetic
Decipherment
Task and Data: Our task is to learn the mappings in Figure 5.2, but without parallel
data, and to test those mappings in end-to-end transliteration. We imagine our problem as
one faced by monolingual English speaker wandering around Japan, reading a multitude
of katakana signs, listening to people speak Japanese, and eventually deciphering those
signs into English. To quote Warren Weaver in this context:
\When I look at a corpus of Japanese katakana, I say to myself, this is really
written in English, but it has been coded in some strange symbols. I will now
proceed to decode."
Our larger motivation is to move toward easily-built transliteration systems for all
language pairs, regardless of parallel resources. While Japanese/English transliteration
has its own particular features, we believe it is a reasonable starting point.
Our monolingual resources are:
1. 43717 unique Japanese katakana sequences collected from web newspaper data. We
split multi-word katakana phrases on the center-dot (\") character, and select a
122
A A CH I D O CH E N J I N E B A D A W A K O B I A
A A K U P U R A Z A CH E S : W A N K A PP U
A A N D O : O P U T I K U S U W A N T E N P O
A A T I S U T O D E K O R A T I B U : W A S E R I N
A A T O S E R I N A P I S U T O N D E T O M O R U T O P I I T A A Y U N I O N
A I A N B I R U E P I G U R A M U P I KK A A Y U N I TT O SH I S U T E M U
A I D I I D O E R A N D O P I N G U U Y U U
A I K E N B E R I I : P I P E R A J I N A M I D O :
A J I A K A PP U J Y A I A N TS U P I S A :
A J I T O J Y A Z U P I U R A Z E N E R A R U E A K O N
A K A SH I A K O O S U : P O I N T O Z E R O
A K U A M Y U U Z E U M U : Z O N B I I Z U
: : : :
: : : :
Figure 5.3: Some Japanese phoneme sequences generated from the monolingual katakana
corpus using WFST D.
nal corpus of 9350 unique sequences. We add monolingual Japanese versions of
the 2008 U.S. Senate roster.
2. The CMU pronunciation dictionary of English, with 112,151 entries.
3. The English gigaword corpus. Knight and Graehl (1998) already use frequently-
occurring capitalized words to build the WFSA A component of their four-stage
cascade.
We seek to use our English knowledge (derived from 2 and 3) to decipher the Japanese
katakana corpus (1) into English. Figure 5.3 shows a portion of the Japanese corpus,
which we transform into Japanese phoneme sequences using the monolingual resource of
WFST D. We note that the Japanese phoneme inventory contains 39 unique (\cipher-
text") symbols, compared to the 40 English (\plaintext") phonemes.
Our goal is to compare and evaluate the WFST C model learnt under two dierent
scenarios|(a) using parallel data, and (b) using monolingual data. For each experiment,
we train only the WFST C model and then apply it to the name transliteration task|
decoding 100 U.S. Senator names from Japanese to English using the automata shown in
Figure 5.1. For all experiments, we keep the rest of the models in the cascade (WFSA A,
123
WFST B, and WFST D) unchanged. We evaluate on whole-name error-rate (maximum
of 100/100) as well as normalized word edit distance, which gives partial credit for getting
the rst or last name correct.
Acquiring Phoneme Mappings from Non-Parallel Data
Our main data consists of 9350 unique Japanese phoneme sequences, which we can con-
sider as a single long sequence j. As suggested by Knight et al. (2006), we explain
the existence of j as the result of someone initially producing a long English phoneme
sequence e, according to P(e), then transforming it into j, according to P(jje). The
probability of our observed data P(j) can be written as:
P (j) =
X
e
P (e)P (jje)
We take P(e) to be some xed model of monolingual English phoneme production, repre-
sented as a weighted nite-state acceptor (WFSA). P(jje) is implemented as the initial,
uniformly-weighted WFST C described in Section 2, with 15320 phonemic connections.
We next maximize P(j) by manipulating the substitution table P(jje), aiming to
produce a result such as shown in Figure 5.2. We accomplish this by composing the
English phoneme model P(e) WFSA with the P(jje) transducer. We then use the EM
algorithm to train just the P(jje) parameters (inside the composition that predictsj), and
guess the values for the individual phonemic substitutions that maximize the likelihood
of the observed data P(j). We allow EM to run until the P(j) likelihood ratio between
124
Phonemic Substitution Model Name Transliteration Error
whole-name error norm. edit distance
1 e!j =f 1-to-1, 1-to-2g 40 25.9
+ EM aligned with parallel data
2a e!j =f 1-to-1, 1-to-2g 100 100.0
+ decipherment training with 2-gram English P(e)
2b e!j =f 1-to-1, 1-to-2g 98 89.8
+ decipherment training with 2-gram English P(e)
+ consonant-parity
2c e!j =f 1-to-1, 1-to-2g 94 73.6
+ decipherment training with 3-gram English P(e)
+ consonant-parity
2d e!j =f 1-to-1, 1-to-2g 77 57.2
+ decipherment training with a word-based English P(e)
+ consonant-parity
2e e!j =f 1-to-1, 1-to-2g 73 54.2
+ decipherment training with a word-based English P(e)
+ consonant-parity
+ initialize mappings having consonant matches with
higher probability weights
2f e!j =f 1-to-1, 1-to-2g 66 49.3
+ decipherment training with a word-based English P(e)
trained on more data (98k versus 76k) English sequences
+ consonant-parity
+ initialize mappings having consonant matches with
higher probability weights
Figure 5.4: Results on name transliteration obtained when using the phonemic substitu-
tion model trained under dierent scenarios|(1) parallel training data, (2a-e) using only
monolingual resources.
subsequent training iterations reaches 0.9999, and we terminate early if 200 iterations are
reached.
Finally, we decode our test set of U.S. Senator names. Following Knight et al. (2006),
we stretch out the P(jje) model probabilities after decipherment training and prior to
decoding our test set, by cubing their values.
Baseline P(e) Model
Clearly, we can design P(e) in a number of ways. We might expect that the more the
system knows about English, the better it will be able to decipher the Japanese. Our
125
baseline P(e) is a 2-gram phoneme model trained on phoneme sequences from the CMU
dictionary. The second row (2a) in Figure 5.4 shows results when we decipher with this
xed P(e). This approach performs poorly and gets all the Senator names wrong.
Consonant Parity
When training under non-parallel conditions, we nd that we would like to keep our
WFST C model small, rather than instantiating a fully-connected model. In the super-
vised case, parallel training allows the trained model to retain only those connections
which were observed from the data, and this helps eliminate many bad connections from
the model. In the unsupervised case, there is no parallel data available to help us make
the right choices.
We therefore use prior knowledge and place a consonant-parity constraint on the
WFST C model. Prior to EM training, we throw out any mapping from the P(jje) sub-
stitution model that does not have the same number of English and Japanese consonant
phonemes. This is a pattern that we observe across a range of transliteration tasks. Here
are examples of mappings where consonant parity is violated:
K => a N => e e EH => s a EY => n
Modifying the WFST C in this way leads to better decipherment tables and better results
for the U.S. Senator task. Normalized edit distance drops from 100 to just under 90 (row
2b in Figure 5.4).
126
Better English Models
We observe that better knowledge about English during training results in the generation
of more English-like decipherments. Moving from a 2-gram to 3-gram English phoneme
model for P(e) produces considerable improvements in accuracy (row 2c in Figure 5.4).
Also, when using phoneme n-gram P(e) models for decipherment, we nd that many of
the Japanese phoneme test sequences are decoded into English phoneme sequences (such
as \IH K R IH N" and \AE G M AH N") that are not valid words. To help the phonemic
substitution model automatically learn what constitutes a globally valid English sequence,
we build a word-based P(e) from English phoneme sequences in the CMU dictionary
and use this model for decipherment training (details are described in (Ravi & Knight,
2009b)). Using the word-based English phoneme model produces the best result so far
on the phonemic substitution task with non-parallel data (row 2d in Figure 5.4). It gets
23 out of 100 Senator names exactly right, with a much lower normalized edit distance
(57.2). We have managed to achieve this performance using only monolingual data.
To summarize, the quality of the English phoneme model used in decipherment train-
ing has a large eect on the learnt P(jje) phonemic substitution table (i.e., probabilities
for the various phoneme mappings within the WFST C model), which in turn aects the
quality of the back-transliterated English output produced when decoding Japanese.
5.5 Experiments and Results
Figure 5.4 shows all our results on the name transliteration task and compares the perfor-
mance under dierent training conditions|(1) using parallel data (row 1), and (2) using
127
!"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5
!"#$#%&' ()""*+,-.%/0*" 1&"&''*'12)%*,#+
3"%#%$
45678-9::;<
=*+#>2*"?*%,-
@A*,2)B-9C
=*+#>2*"?*%,-
@A*,2)B-DC
=*+#>2*"?*%,
@A*,2)BEC
!"#$%&
'%()*+
F1GH(GI-
.JI.K.A-
F1GH(GI-
.JI.K.A
F1GH(GI
67689:.)
;!:*6)*<=9=) F1GH(GI-
.JI.K.A-
,-'.&
/00
=.HLGM-.7.7.- =.HLGM:*:)>: >=?6:)6:;2) ;*@6?A.B)6:;2) 7:.A6886):C:*=)
123#&
/)%4
N.OHG-.MM.I=- N.OHG-.MM.I= *@=A*6)D=@.) N.OHGCE?7) N.OHG-.MM.I=-
567!&
8%0!
A.P-J.Q(QF- A.P-J.Q(QF 966;6)D:96;) A.PC==>;) A.PF=*<;)
8(&9:6; J!J-JGHHG33- J!J-JGHHG33 9E);*@6?A.B) D:*>)CA88A.B) C=2@)C6..A.B)
<=>?&
@3A#
R!FG1K-JL=GH- R!FG1K-JL=GH @=<;6)8:C=?) D:!:.)CA76.) R!FG1K-JL=GH-
<2?&
B#C5#
RGSS-JLH5.A.H- RGSS-JLH5.A.H D=@.)!F6AFF6? RGSSC6.D:9A.) RGSSC6.D:9A.)
?)#7&
D%E#@%7
SI.H7-
M.Q3GHJGI5
SI.H7
8:<26.C:*@
;<.)F8A.2) F?:.*6)
M.Q3GHJGI5-
F?:.*6)
M.Q3GHJGI5-
!,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH
!"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5
!"#$#%&' ()""*+,-.%/0*" 1&"&''*'12)%*,#+
3"%#%$
45678-9::;<
=*+#>2*"?*%,-
@A*,2)B-9C
=*+#>2*"?*%,-
@A*,2)B-DC
=*+#>2*"?*%,
@A*,2)BEC
!"#$%&
'%()*+
F1GH(GI-
.JI.K.A-
F1GH(GI-
.JI.K.A
F1GH(GI
67689:.)
;!:*6)*<=9=) F1GH(GI-
.JI.K.A-
,-'.&
/00
=.HLGM-.7.7.- =.HLGM:*:)>: >=?6:)6:;2) ;*@6?A.B)6:;2) 7:.A6886):C:*=)
123#&
/)%4
N.OHG-.MM.I=- N.OHG-.MM.I= *@=A*6)D=@.) N.OHGCE?7) N.OHG-.MM.I=-
567!&
8%0!
A.P-J.Q(QF- A.P-J.Q(QF 966;6)D:96;) A.PC==>;) A.PF=*<;)
8(&9:6; J!J-JGHHG33- J!J-JGHHG33 9E);*@6?A.B) D:*>)CA88A.B) C=2@)C6..A.B)
<=>?&
@3A#
R!FG1K-JL=GH- R!FG1K-JL=GH @=<;6)8:C=?) D:!:.)CA76.) R!FG1K-JL=GH-
<2?&
B#C5#
RGSS-JLH5.A.H- RGSS-JLH5.A.H D=@.)!F6AFF6? RGSSC6.D:9A.) RGSSC6.D:9A.)
?)#7&
D%E#@%7
SI.H7-
M.Q3GHJGI5
SI.H7
8:<26.C:*@
;<.)F8A.2) F?:.*6)
M.Q3GHJGI5-
F?:.*6)
M.Q3GHJGI5-
!,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH
!"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5
!"#$#%&' ()""*+,-.%/0*" 1&"&''*'12)%*,#+
3"%#%$
45678-9::;<
=*+#>2*"?*%,-
@A*,2)B-9C
=*+#>2*"?*%,-
@A*,2)B-DC
=*+#>2*"?*%,
@A*,2)BEC
!"#$%&
'%()*+
F1GH(GI-
.JI.K.A-
F1GH(GI-
.JI.K.A
F1GH(GI
67689:.)
;!:*6)*<=9=) F1GH(GI-
.JI.K.A-
,-'.&
/00
=.HLGM-.7.7.- =.HLGM:*:)>: >=?6:)6:;2) ;*@6?A.B)6:;2) 7:.A6886):C:*=)
123#&
/)%4
N.OHG-.MM.I=- N.OHG-.MM.I= *@=A*6)D=@.) N.OHGCE?7) N.OHG-.MM.I=-
567!&
8%0!
A.P-J.Q(QF- A.P-J.Q(QF 966;6)D:96;) A.PC==>;) A.PF=*<;)
8(&9:6; J!J-JGHHG33- J!J-JGHHG33 9E);*@6?A.B) D:*>)CA88A.B) C=2@)C6..A.B)
<=>?&
@3A#
R!FG1K-JL=GH- R!FG1K-JL=GH @=<;6)8:C=?) D:!:.)CA76.) R!FG1K-JL=GH-
<2?&
B#C5#
RGSS-JLH5.A.H- RGSS-JLH5.A.H D=@.)!F6AFF6? RGSSC6.D:9A.) RGSSC6.D:9A.)
?)#7&
D%E#@%7
SI.H7-
M.Q3GHJGI5
SI.H7
8:<26.C:*@
;<.)F8A.2) F?:.*6)
M.Q3GHJGI5-
F?:.*6)
M.Q3GHJGI5-
!,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH
!"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5
!"#$#%&' ()""*+,-.%/0*" 1&"&''*'12)%*,#+
3"%#%$
45678-9::;<
=*+#>2*"?*%,-
@A*,2)B-9C
=*+#>2*"?*%,-
@A*,2)B-DC
=*+#>2*"?*%,
@A*,2)BEC
!"#$%&
'%()*+
F1GH(GI-
.JI.K.A-
F1GH(GI-
.JI.K.A
F1GH(GI
67689:.)
;!:*6)*<=9=) F1GH(GI-
.JI.K.A-
,-'.&
/00
=.HLGM-.7.7.- =.HLGM:*:)>: >=?6:)6:;2) ;*@6?A.B)6:;2) 7:.A6886):C:*=)
123#&
/)%4
N.OHG-.MM.I=- N.OHG-.MM.I= *@=A*6)D=@.) N.OHGCE?7) N.OHG-.MM.I=-
567!&
8%0!
A.P-J.Q(QF- A.P-J.Q(QF 966;6)D:96;) A.PC==>;) A.PF=*<;)
8(&9:6; J!J-JGHHG33- J!J-JGHHG33 9E);*@6?A.B) D:*>)CA88A.B) C=2@)C6..A.B)
<=>?&
@3A#
R!FG1K-JL=GH- R!FG1K-JL=GH @=<;6)8:C=?) D:!:.)CA76.) R!FG1K-JL=GH-
<2?&
B#C5#
RGSS-JLH5.A.H- RGSS-JLH5.A.H D=@.)!F6AFF6? RGSSC6.D:9A.) RGSSC6.D:9A.)
?)#7&
D%E#@%7
SI.H7-
M.Q3GHJGI5
SI.H7
8:<26.C:*@
;<.)F8A.2) F?:.*6)
M.Q3GHJGI5-
F?:.*6)
M.Q3GHJGI5-
!,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH
Figure 5.5: Results for end-to-end name transliteration. This gure shows the correct
answer, the answer obtained by training mappings on parallel data (Knight & Graehl,
1998), and various answers obtained by deciphering non-parallel data. Method 1 uses a
2-gram P(e), Method 2 uses a 3-gram P(e), and Method 3 uses a word-based P(e).
only monolingual data (rows 2a-2f). Rows 2a-2d shows the performance improvement
(i.e., decrease in whole-name error as well as normalized edit distance) that we achieve
when transliterating without parallel data using the methods described here.
Sample end-to-end transliterations are illustrated in Figure 5.5. The gure shows how
the transliteration results from non-parallel training improve steadily as we use stronger
decipherment techniques. We note that in one case (LAUTENBERG), the decipherment
mapping table leads to a correct answer where the mapping table derived from parallel
data does not. Because parallel data is limited, some of the necessary mappings may
occur with very low probability or may not be seen at all in the data.
128
Improving Decipherment Results Further
In (Ravi & Knight, 2009b), we show that varying some other factors within the decipher-
ment process can yield additional improvements in transliteration accuracy. The last two
rows in Figure 5.4 summarize a few of these techniques which lead to better end-to-end
transliterations|(a) using better initialization weights for the P(jje) model prior to EM
training, and (b) using more English data to build the word-based P(e) model. The last
row in Figure 5.4 shows our best transliteration results on the Senator task with non-
parallel data, getting 34 out of 100 Senator names exactly right (66% whole-name errors,
and 49.3 word edit distance error). This also puts us within reach of the parallel-trained
system's performance (40% whole-name errors, and 25.9 word edit distance error) without
using a single English/Japanese pair for training.
5.6 Comparable versus Non-Parallel Corpora
We also present decipherment results when using comparable corpora for training the
WFST C model. We use English and Japanese phoneme sequences derived from a par-
allel corpus containing 2,683 phoneme sequence pairs to construct comparable corpora
(such that for each Japanese phoneme sequence, the correct back-transliterated phoneme
sequence is present somewhere in the English data) and apply the same decipherment
strategy using a word-based English model. The table below compares the translitera-
tion results for the U.S. Senator task, when using comparable versus non-parallel data
for decipherment training. While training on comparable corpora does have benets and
129
reduces the whole-name error to 59% on the Senator task, it is encouraging to see that
our best decipherment results using only non-parallel data comes close (66% error).
English/Japanese Corpora Error on name transliteration task
(# of phoneme sequences) whole-name error normalized word
edit distance
Comparable Corpora (English = 2608, Japanese = 2455) 59 41.8
Non-Parallel Corpora (English = 98000, Japanese = 9350) 66 49.3
5.7 Conclusion
We have presented a method for attacking machine transliteration problems without
parallel data. We developed phonemic substitution tables trained using only monolingual
resources and demonstrated their performance in an end-to-end name transliteration task.
We showed that consistent improvements in transliteration performance are possible with
the use of strong decipherment techniques, and our best system (66% whole-name error,
and 49.3 word edit distance) achieves signicant improvements over the baseline system.
130
Chapter 6
Deciphering Foreign Language: Statistical Language
Translation without Parallel Data
So far, we progressed from tackling simple decipherment problems, like letter substitution
ciphers, to increasingly complex tasks like unsupervised tagging, word alignment and
machine transliteration. We now turn our attention to a much more complex task|
automatic language translation (or machine translation, MT).
Machine translation is a fundamental problem in NLP, one that has received consider-
able attention from the research community since the early 1950's. Current state-of-the-
art MT systems require huge amounts of (parallel) bilingual data for training translation
systems. But such data does not exist for all language pairs and domains. Using hu-
man annotation to create new bilingual resources is not a scalable solution. If we can
develop unsupervised techniques to decipher MT using non-parallel data, we can make
huge progress.
In this chapter, we tackle the task of machine translation (MT) without parallel
training data (Ravi & Knight, 2011b). We frame the MT problem as a decipherment
task, treating the foreign text as a cipher for English and present novel methods for
131
training translation models from non-parallel text. It may seem unintuitive to frame this
as a decipherment problem (i.e., treat foreign-language text as a code for English), but
the principles are the same as for letter and phoneme substitution decipherment. This is a
much harder decipherment task than previous ones, and it poses new technical challenges:
(1) scalability due to large corpora sizes and huge translation tables, (2) non-determinism
in translation mappings (a word in English can have multiple translations in Spanish, and
vice-versa), and (3) re-ordering of words or phrases (e.g., \el (THE) a~ no (YEAR) pasado
(LAST)" in Spanish translates to \THE LAST YEAR" in English). To tackle these
problems, we design new, ecient decipherment strategies and combine them with some
of our previous ideas|for example, using linguistic knowledge to x some connections
apriori is an eective strategy to navigate the huge search space and search for small-sized
models.
We perform extensive evaluations on several test corpora comparing the translation
output quality from various MT systems and show that our methods achieve good results
compared to parallel-trained systems. We also empirically study the \worthiness" of
parallel versus non-parallel data for MT (in terms of accuracies achieved). Our work
is the rst successful attempt at \MT without the use of bilingual resources" in the
history of statistical language translation. Initial results are encouraging, suggesting
that decipherment-based approaches hold promise and could potentially benet future
MT research in a big way. As successful work develops along this line, we expect more
domains and language pairs to be conquered by MT.
132
6.1 Introduction
Bilingual corpora are a staple of statistical machine translation (SMT) research. From
these corpora, we estimate translation model parameters: word-to-word translation ta-
bles, fertilities, distortion parameters, phrase tables, syntactic transformations, etc. Start-
ing with the classic IBM work (Brown et al., 1993), training has been viewed as a max-
imization problem involving hidden word alignments (a) that are assumed to underlie
observed sentence pairs (e;f):
arg max
Y
e;f
P
(fje) (6.1)
= arg max
Y
e;f
X
a
P
(f;aje) (6.2)
Brown et al. (1993) give various formulas that boilP
(f;aje) down to the specic param-
eters to be estimated.
Of course, for many language pairs and domains, parallel data is not available. In
this work, we address the problem of learning a full translation model from non-parallel
data, and we use the learned model to translate new foreign strings.
How can we learn a translation model from non-parallel data? Intuitively, we try to
construct translation model tables which, when applied to observed foreign text, consis-
tently yield sensible English. This is essentially the same approach taken by cryptanalysts
and epigraphers when they deal with source texts.
133
In our case, we observe a large number of foreign strings f, and we apply maximum
likelihood training:
arg max
Y
f
P
(f) (6.3)
Following Weaver (1955), we imagine that this corpus of foreign strings \is really written
in English, but has been coded in some strange symbols," thus:
arg max
Y
f
X
e
P (e)P
(fje) (6.4)
The variable e ranges over all possible English strings, and P (e) is a language model
built from large amounts of English text that is unrelated to the foreign strings. Re-
writing for hidden alignments, we get:
arg max
Y
f
X
e
P (e)
X
a
P
(f;aje) (6.5)
Note that this formula has the same free P
(f;aje) parameters as expression (6.2).
We seek to manipulate these parameters in order to learn the same full translation model.
We note that for each f, not only is the alignment a still hidden, but now the English
translation e is hidden as well.
A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here
P (e) actually plays a central role in training translation model parameters. To distinguish
the two, we refer to Equation (6.5) as decipherment, rather than decoding. Figure 6.1
compares the framework for training MT systems with and without parallel training data.
134
e
English
P(f | e)
f
Spanish
English-to-Spanish
Translation Model
?
MT using parallel data
sentence1: LAST YEAR
sentence2: THIS WEEK
sentence3: THE CURRENT FISCAL YEAR
sentence4: THE FIRST OF JANUARY 1989
...
...
sentence1: el año pasado
sentence2: esta semana
sentence3: el año fiscal en curso
sentence4: el primero de enero 1989
...
...
P(e)
e
English
P(f | e)
f Spanish corpus
Por cuarto año consecutivo
noviembre 3 , 1987
el último trimestre
noviembre pasado. 1
el segundo año
Hace tan sólo tres años
el fin de semana
...
...
TWO YEARS AGO
LAST FRIDAY
THE P AST TWO YEARS
THE LATEST QUARTER
RECENT WEEKS
THE NINE MONTHS
...
English
Language Model
English-to-Spanish
Translation Model
?
MT without parallel data
English corpus
Language Model Training
TRAINING
Train parameters ! to maximize probability
of observed (e, f) pairs:
argmax ! P! (e, f ) " argmax ! #e,f P! (f | e)
Train parameters ! to maximize probability of
observed foreign text f:
argmax ! P! (f ) " argmax ! $ e P! (e, f)
" argmax ! $ e P(e) . P! (f | e)
TRAINING
Figure 6.1: MT with/without parallel data
The contributions of this work are:
We give rst results for training a full translation model from non-parallel text,
and we apply the model to translate previously-unseen text. This work is thus
distinguished from prior work on extracting or augmenting partial lexicons using
non-parallel corpora (Rapp, 1995; Fung & McKeown, 1997; Koehn & Knight, ;
Haghighi et al., 2008). It also contrasts with self-training (McClosky et al., 2006),
which requires a parallel seed and often does not engage in iterative maximization.
We develop novel methods to deal with large-scale vocabularies inherent in MT
problems.
6.2 Previous Work
MT using comparable or non-parallel corpora: There has been some prior work
on extracting bilingual lexical connections from non-parallel or comparable corpora for
dierent language pairs (Fung, 1995; Rapp, 1995; Koehn & Knight, ; Koehn & Knight,
2001; Munteanu & Marcu, 2002; Haghighi et al., 2008). There is also some previous
135
work on mining parallel sentence pairs to generate a comparable corpora for MT training
(Munteanu et al., 2004). In contrast, we tackle a much more challenging task of training
a full translation model from non-parallel text without the use of any bilingual seed
lexicons. We discuss the challenges involved and present novel methods and results on
this task in Section 6.4.
6.3 Word Substitution Decipherment
Before we tackle machine translation without parallel data, we rst solve a simpler
problem|word substitution decipherment. Here, we do not have to worry about hid-
den alignments since there is only one alignment.
6.3.1 Problem Formulation
In a word substitution cipher, every word in the natural language (plaintext) sequence is
substituted by a cipher token, according to a substitution key. The key is deterministic|
there exists a 1-to-1 mapping between cipher units and the plaintext words they encode.
For example, the following English plaintext sequences:
I SAW THE BOY .
THE BOY RAN .
may be enciphered as:
xyzz fxyy crqq tmnz lxwz
crqq tmnz gdxx lxwz
136
according to the key:
THE ! crqq, SAW ! fxyy, RAN ! gdxx,
. ! lxwz, BOY ! tmnz, I ! xyzz
The goal of word substitution decipherment is to guess the original plaintext from
given cipher data without any knowledge of the substitution key. The table below com-
pares the key characteristics of word substitution ciphers (Word
sub
) with ciphers discussed
previously.
Key Characteristics Simple
sub
Homophonic
sub
Unsupervised
POS Tagging
Word / Sub-word
Alignment
Phonetic
sub
(Machine
Transliteration)
Word
sub
Is the key deterministic in
the enciphering direction?
Yes No No No No Yes
Is the key deterministic in
the deciphering direction?
Yes Yes No No No Yes
Does the key substitute one-
for-one (symbol for symbol)
in the enciphering direction?
Yes Yes Yes No No Yes
Does the key substitute one-
for-one in the deciphering
direction?
Yes Yes Yes No No Yes
What linguistic unit is sub-
stituted?
Letter /
Syllable
Letter Word Morpheme/ Word/
Phrase
Phoneme(s) Word
Does the key involve trans-
position (re-ordering)?
No No No Yes No No
Other properties of the key
that are specic to the prob-
lem
1) Key \hints"
are provided in
the form of a dic-
tionary
1) Parallel data
is available as
source/target lan-
guage sentence
pairs
1) Large plain-
text / ciphertext
vocabulary sizes
(10
2
-10
6
word
types)
2) No LM data is
provided
137
Word substitution decipherment is a good test-bed for unsupervised statistical NLP
techniques for two reasons|(1) we face large vocabularies and corpora sizes typically seen
in large-scale MT problems, so our methods need to scale well, (2) similar decipherment
techniques can be applied for solving NLP problems such as unsupervised part-of-speech
tagging. There has been no prior work reported in the literature for solving word substi-
tution ciphers automatically. Here, we show for the rst time novel methods for attacking
word substitution ciphers on a large-scale.
Probabilistic decipherment: Our decipherment method follows a noisy-channel ap-
proach. We rst model the process by which the ciphertext sequence c = c
1
:::c
n
is
generated. The generative story for decipherment is described here:
1. Generate an English plaintext sequence e =e
1
:::e
n
, with probability P (e).
2. Substitute each plaintext word e
i
with a ciphertext token c
i
, with probability
P
(c
i
je
i
) in order to generate the ciphertext sequence c =c
1
:::c
n
.
We model P (e) using a statistical word n-gram English language model (LM). During
decipherment, our goal is to estimate the channel model parameters . Re-writing Equa-
tions (6.3) and (6.4) for word substitution decipherment, we get:
arg max
Y
c
P
(c) (6.6)
= arg max
Y
c
X
e
P (e)
n
Y
i=1
P
(c
i
je
i
) (6.7)
138
Challenges: Unlike letter substitution ciphers (having only 26 plaintext letters), here
we have to deal with large-scale vocabularies (10k-1M word types) and corpora sizes
(100k cipher tokens). This poses some serious scalability challenges for word substitution
decipherment.
We propose novel methods that can deal with these challenges eectively and solve
word substitution ciphers:
1. EM solution: We would like to use the Expectation Maximization (EM) algo-
rithm (Dempster et al., 1977) to estimate from Equation (6.7), but EM training
is not feasible in our case. First, EM cannot scale to such large vocabulary sizes
(running the forward-backward algorithm for each iteration requires O(V
2
) time).
Secondly, we need to instantiate the entire channel and resulting derivation lattice
before we can run EM, and this is too big to be stored in memory. So, we introduce
a new training method (Iterative EM) that xes these problems.
2. Bayesian decipherment: We also propose a novel decipherment approach using
Bayesian inference. Typically, Bayesian inference is very slow when applied to
such large-scale problems. Our method overcomes these challenges and does fast,
ecient inference using (a) a novel strategy for selecting sampling choices, and (b)
a parallelized sampling scheme.
In the next two sections, we describe these methods in detail.
139
6.3.2 Iterative EM Decipherment
We devise a method which overcomes memory and running time eciency issues faced
by EM. Instead of instantiating the entire channel model (with all its parameters), we
iteratively train the model in small steps. The training procedure is described here:
1. Identify the top K frequent word types in both the plaintext and ciphertext data.
Replace all other word tokens with Unknown. Now, instantiate a small channel
with just (K + 1)
2
parameters and use the EM algorithm to train this model to
maximize likelihood of cipher data.
2. Extend the plaintext and ciphertext vocabularies from the previous step by adding
the nextK most frequent word types (so the new vocabulary size becomes 2K +1).
Regenerate the plaintext and ciphertext data.
3. Instantiate a new (2K+1)(2K+1) channel model. From the previous EM-trained
channel, identify all the e!c mappings that were assigned a probability P (cje)>
0.5. Fix these mappings in the new channel, i.e. set P (cje) = 1.0. From the new
channel, eliminate all other parameters e!c
j
associated with the plaintext word
type e (where c
j
6=c). This yields a much smaller channel with size < (2K + 1)
2
.
Retrain the new channel using EM algorithm.
4. Goto Step 2 and repeat the procedure, extending the channel size iteratively in each
stage.
Finally, we decode the given ciphertextc by using the Viterbi algorithm to choose the
plaintext decoding e that maximizes P (e)P
trained
(cje)
3
, stretching the channel proba-
bilities (Knight et al., 2006).
140
6.3.3 Bayesian Decipherment
Now, we propose a novel decipherment approach using Bayesian learning. Our method
holds several other advantages over the EM approach|(1) inference using smart sam-
pling strategies permits ecient training, allowing us to scale to large data/vocabulary
sizes, (2) incremental scoring of derivations during sampling allows ecient inference
even when we use higher-order n-gram LMs, (3) there are no memory bottlenecks since
the full channel model and derivation lattice are never instantiated during training, and
(4) prior specication allows us to learn skewed distributions that are useful here|word
substitution ciphers exhibit 1-to-1 correspondence between plaintext and cipher types.
We use the same generative story as before for decipherment, except that we use
Chinese Restaurant Process (CRP) formulations for the source and channel probabilities.
We use an English word bigram LM as the base distribution (P
0
) for the source model and
specify a uniformP
0
distribution for the channel.
1
We perform inference using point-wise
Gibbs sampling (Geman & Geman, 1984). We dene a sampling operator that samples
plaintext word choices for every cipher token, one at a time. Using the exchangeability
property, we eciently score the probability of each derivation in an incremental fashion.
In addition, we make further improvements to the sampling procedure which makes it
faster.
Smart sample-choice selection: In the original sampling step, for each cipher token
we have to sample from a list of all possible plaintext choices (10k-1M English words).
1
For word substitution decipherment, we want to keep the language model probabilities xed during
training, and hence we set the prior on that model to be high ( = 10
4
). We use a sparse Dirichlet prior
for the channel ( = 0:01). We use the output from Iterative EM decoding (using 101 x 101 channel)
as initial sample and run the sampler for 2000 iterations. During sampling, we use a linear annealing
schedule decreasing the temperature from 1! 0:08.
141
There are 100k cipher tokens in our data which means we have to perform 10
9
sampling
operations to make one entire pass through the data. We have to then repeat this process
for 2000 iterations. Instead, we now reduce our choices in each sampling step.
Say that our current plaintext hypothesis contains English words X, Y and Z at
positionsi1,i andi+1 respectively. In order to sample at positioni, we choose the top
K English wordsY ranked byP (X Y Z), which can be computed oine from a statistical
word bigram LM. If this probability is 0 (i.e., X and Z never co-occurred), we randomly
pick K words from the plaintext vocabulary. We set K = 100 in our experiments. This
signicantly reduces the sampling possibilities (10k-1M reduces to 100) at each step and
allows us to scale to large plaintext vocabulary sizes without enumerating all possible
choices at every cipher position.
2
Parallelized Gibbs sampling: Secondly, we parallelize our sampling step using a Map-
Reduce framework. In the past, others have proposed parallelized sampling schemes for
topic modeling applications (Newman et al., 2009). In our method, we split the entire
corpus into separate chunks and we run the sampling procedure on each chunk in parallel.
At the end of each sampling iteration, we combine the samples corresponding to each
chunk and collect the counts of all events|this forms our cache for the next sampling
iteration. In practice, we observe that the parallelized sampling run converges quickly
and runs much faster than the conventional point-wise sampling|for example, 3.1 hours
2
Since we now sample from an approximate distribution, we have to correct this with the Metropolis-
Hastings algorithm. But in practice we observe that samples from our proposal distribution are accepted
with probability > 0:99, so we skip this step.
142
(using 10 nodes) versus 11 hours for one of the word substitution experiments. We also
notice a higher speedup when scaling to larger vocabularies.
3
Decoding the ciphertext: After the sampling run has nished, we choose the nal
sample and extract a trained version of the channel model P
(cje) from this sample
following our technique described in Chiang et al. (2010). We then use the Viterbi
algorithm to choose the English plaintext e that maximizes P (e)P
trained
(cje)
3
.
6.3.4 Experiments and Results
Data: For the word substitution experiments, we use two corpora:
Temporal expression corpus containing short English temporal expressions such as
\THE NEXT MONTH", \THE LAST THREE YEARS", etc. The cipher data
contains 5000 expressions (9619 tokens, 153 word types). We also have access to
a separate English corpus (which is not parallel to the ciphertext) containing 125k
temporal expressions (242k word tokens, 201 word types) for LM training.
Transtac corpus containing full English sentences. The data consists of 10k cipher
sentences (102k tokens, 3397 word types); and a plaintext corpus of 402k English
sentences (2.7M word tokens, 25761 word types) for LM training. We use all the
cipher data for decipherment training but evaluate on the rst 1000 cipher sentences.
3
Type sampling could be applied on top of our methods to further optimize performance. But more
complex problems like MT do not follow the same principles (1-to-1 key mappings) as seen in word
substitution ciphers, which makes it dicult to identify type dependencies.
143
Method Decipherment Accuracy (%)
Temporal expr. Transtac
9k 100k
0. EM with 2-gram LM 87.8 Intractable
1. Iterative EM
with 2-gram LM 87.8 70.5 71.8
2. Bayesian
with 2-gram LM 88.6 60.1 80.0
with 3-gram LM 82.5
Figure 6.2: Comparison of word substitution decipherment results using (1) Iterative
EM, and (2) Bayesian method. For the Transtac corpus, decipherment performance is
also shown for dierent training data sizes (9k versus 100k cipher tokens).
The cipher data was originally generated from English text by substituting each En-
glish word with a unique cipher word. We use the plaintext corpus to build an English
word n-gram LM, which is used in the decipherment process.
Evaluation: We compute the accuracy of a particular decipherment as the percentage
of cipher tokens that were correctly deciphered from the whole corpus. We run the
two methods (Iterative EM
4
and Bayesian) and then compare them in terms of word
substitution decipherment accuracies.
Results: Figure 6.2 compares the word substitution results from Iterative EM and
Bayesian decipherment. Both methods achieve high accuracies, decoding 70-90% of the
two word substitution ciphers. Overall, Bayesian decipherment (with sparse priors) per-
forms better than Iterative EM and achieves the best results on this task. We also observe
4
For Iterative EM, we start with a channel of size 101x101 (K=100) and in every pass we iteratively
increase the vocabulary sizes by 50, repeating the training procedure until the channel size becomes
351x351.
144
C: 3894 9411 4357 8446 5433
O: a diploma that's good .
D: a fence that's good .
C: 8593 7932 3627 9166 3671
O: three families living here ?
D: three brothers living here ?
C: 6283 8827 7592 6959 5120 6137 9723 3671
O: okay and what did they tell you ?
D: okay and what did they tell you ?
C: 9723 3601 5834 5838 3805 4887 7961 9723 3174 4518 9067
4488 9551 7538 7239 9166 3671
O: you mean if we come to see you in the afternoon after ve
you'll be here ?
D: i mean if we come to see you in the afternoon after thirty
you'll be here ?
...
Figure 6.3: Comparison of the original (O) English plaintext with output from Bayesian
word substitution decipherment (D) for a few samples cipher (C) sentences from the
Transtac corpus.
that both methods benet from better LMs and more (cipher) training data. Figure 6.3
shows sample outputs from Bayesian decipherment.
6.4 Machine Translation as a Decipherment Task
We now turn to the problem of MT without parallel data. From a decipherment perspec-
tive, machine translation is a much more complex task than word substitution decipher-
ment and poses several technical challenges: (1) scalability due to large corpora sizes and
huge translation tables, (2) non-determinism in translation mappings (a word can have
multiple translations), (3) re-ordering of words or phrases, (4) a single word can translate
into a phrase, and (5) insertion/deletion of words.
Some of the key characteristics associated with the MT decipherment task are shown
here:
145
Key Characteris-
tics
Simple
sub
Homophonic
sub
Unsupervised
POS Tagging
Word /
Sub-word
Alignment
Phonetic
sub
(Machine
Translitera-
tion)
Word
sub
Machine Trans-
lation
Is the key determin-
istic in the encipher-
ing direction?
Yes No No No No Yes No
Is the key determin-
istic in the decipher-
ing direction?
Yes Yes No No No Yes No
Does the key sub-
stitute one-for-one
(symbol for symbol)
in the enciphering
direction?
Yes Yes Yes No No Yes No
Does the key substi-
tute one-for-one in
the deciphering di-
rection?
Yes Yes Yes No No Yes No
What linguistic unit
is substituted?
Letter /
Syllable
Letter Word Morpheme/
Word/ Phrase
Phoneme(s) Word Word/ Phrase
Does the key in-
volve transposition
(re-ordering)?
No No No Yes No No Yes
Other properties of
the key that are spe-
cic to the problem
1) Key \hints"
are provided in
the form of a
dictionary
1) Parallel data
is available as
source/target
language sen-
tence pairs
1) Large plain-
text / cipher-
text vocabulary
sizes (10
2
-10
6
word types)
1) Involves word
insertions, dele-
tions, etc.
2) No LM data
is provided
2) Large plain-
text / cipher-
text vocabulary
sizes (10
2
-10
6
word types)
146
6.4.1 Problem Formulation
We formulate the MT decipherment problem as|given a foreign textf (i.e., foreign word
sequences f
1
:::f
m
) and a monolingual English corpus, our goal is to decipher the foreign
text and produce an English translation.
Probabilistic decipherment: Unlike parallel training, here we have to estimate the
translation modelP
(fje) parameters using only monolingual data. During decipherment
training, our objective is to estimate the model parameters in order to maximize the
probability of the foreign corpus f. From Equation (6.4) we have:
arg max
Y
f
X
e
P (e)P
(fje)
For P (e), we use a word n-gram LM trained on monolingual English data. We then
estimate parameters of the translation model P
(fje) during training. Next, we present
two novel decipherment approaches for MT training without parallel data.
1. EM Decipherment: We propose a new translation model for MT decipherment
which can be eciently trained using the EM algorithm.
2. Bayesian Decipherment: We introduce a novel method for estimating IBM
Model 3 parameters without parallel data, using Bayesian learning. Unlike EM,
this method does not face any memory issues and we use sampling to perform
ecient inference during training.
147
6.4.2 EM Decipherment
For the translation model P
(fje), we would like to use a well-known statistical model
such as IBM Model 3 and subsequently train it using the EM algorithm. But without
parallel training data, EM training for IBM Model 3 becomes intractable due to (1)
scalability and eciency issues because of large-sized fertility and distortion parameter
tables, and (2) the resulting derivation lattices become too big to be stored in memory.
Instead, we propose a simpler generative story for MT without parallel data. Our
model accounts for (word) substitutions, insertions, deletions and local re-ordering during
the translation process but does not incorporate fertilities or global re-ordering. We
describe the generative process here:
1. Generate an English string e =e
1
:::e
l
, with probability P (e).
2. Insert a NULL word at any position in the English string, with uniform probability.
3. For each English word tokene
i
(including NULLs), choose a foreign word translation
f
i
, with probability P
(f
i
je
i
). The foreign word may be NULL.
4. Swap any pair of adjacent foreign words f
i1
;f
i
, with probability P
(swap). We
set this value to 0.1.
5. Output the foreign string f =f
1
:::f
m
, skipping over NULLs.
We use the EM algorithm to estimate all the parameters in order to maximize
likelihood of the foreign corpus. Finally, we use the Viterbi algorithm to decode the foreign
sentence f and produce an English translation e that maximizes P (e)P
trained
(fje).
148
Linguistic knowledge for decipherment: To help limit translation model size and
deal with data sparsity problem, we use prior linguistic knowledge. We use identity
mappings for numeric values (for example, \8" maps to \8"), and we split nouns into
morpheme units prior to decipherment training (for example, \YEARS"! \YEAR"
\+S").
Whole-segment Language Models: When using word n-gram models of English for
decipherment, we nd that some of the foreign sentences are decoded into sequences (such
as \THANK YOU TALKING ABOUT ?") that are not good English. This stems from
the fact that n-gram LMs have no global information about what constitutes a valid
English segment. To learn this information automatically, we build a P (e) model that
only recognizes English whole-segments (entire sentences or expressions) observed in the
monolingual training data. We then use this model (in place of word n-gram LMs) for
decipherment training and decoding.
6.4.3 Bayesian Method
Brown et al. (1993) provide an ecient algorithm for training IBM Model 3 translation
model when parallel sentence pairs are available. But we wish to perform IBM Model 3
training under non-parallel conditions, which is intractable using EM training. Instead,
we take a Bayesian approach.
149
Following Equation (6.5), we represent the translation model as P
(f;aje) in terms of
hidden alignments a. Recall the generative story for IBM Model 3 translation which has
the following formula:
P
(f;aje) =
l
Y
i=0
t
(f
a
j
je
i
)
l
Y
i=1
n
(
i
je
i
)
m
Y
a
j
6=0;j=1
d
(a
j
ji;l;m)
l
Y
i=0
i
!
1
0
!
m
0
0
p
0
1
p
m2
0
0
(6.8)
The alignment a is represented as a vector; a
j
=i implies that the foreign word f
j
is
produced by the English word e
i
during translation.
Bayesian Formulation: Our goal is to learn the probability tables t (translation pa-
rameters)n (fertility parameters), d (distortion parameters), andp (English NULL word
probabilities) without parallel data. In order to apply Bayesian inference for decipher-
ment, we model each of these tables using a Chinese Restaurant Process (CRP) formula-
tion. For example, to model the translation probabilities, we use the formula:
t
(f
j
je
i
) =
P
0
(f
j
je
i
) +C
history
(e
i
;f
j
)
+C
history
(e
i
)
(6.9)
where, P
0
represents the base distribution (which is set to uniform) and C
history
represents the count of events occurring in the history (cache). Similarly, we use CRP
formulations for the other probabilities (n, d and p). We use sparse Dirichlet priors for
150
all these models (i.e., low values for ) and plug these probabilities into Equation (6.8)
to get P
(f;aje).
Sampling IBM Model 3: We use point-wise Gibbs sampling to estimate the IBM
Model 3 parameters. The sampler is seeded with an initial English sample translation
and a corresponding alignment for every foreign sentence. We dene several sampling
operators, which are applied in sequence one after the other to generate English samples
for the entire foreign corpus. Some of the sampling operators are described below:
TranslateWord(j): Sample a new English word translation for foreign wordf
j
, from
all possibilities (including NULL).
SwapSegment(i
1
;i
2
): Swap the alignment links for English words e
i
1
and e
i
2
.
JoinWords(i
1
;i
2
): Eliminate the English worde
i
1
and transfer its links to the word
e
i
2
.
During sampling, we apply each of these operators to generate a new derivation e;a
for the foreign text f and compute its score as P (e)P
(f;aje). These small-change
operators are similar to the heuristic techniques used for greedy decoding by German
et al. (2001). But unlike the greedy method, which can easily get stuck, our Bayesian
approach guarantees that once the sampler converges we will be sampling from the true
posterior distribution.
As with Bayesian decipherment for word substitution, we compute the probability of
each new derivation incrementally, which makes sampling ecient. We also apply blocked
sampling on top of point-wise sampling|we treat all occurrences of a particular foreign
151
sentence as a single block and sample a single derivation for the entire block. We also
parallelize the sampling procedure (as described in Section 6.3.3).
5
Choosing the best translation: Once the sampling run nishes, we select the nal
sample and extract the corresponding English translations for every foreign sentence.
This yields the nal decipherment output.
6.4.4 MT Experiments and Results
Data: We work with the Spanish/English language pair and use the following corpora
in our MT experiments:
Time corpus: We mined English newswire text on the Web and collected 295k
temporal expressions such as \LAST YEAR", \THE FOURTH QUARTER", \IN
JAN 1968", etc. We rst process the data and normalize numbers and names of
months/weekdays|for example, \1968" is replaced with \NNNN", \JANUARY"
with \[MONTH]", and so on. We then translate the English temporal phrases
into Spanish using an automatic translation software (Google Translate) followed
by manual annotation to correct mistakes made by the software. We create the
following splits out of the resulting parallel corpus:
TRAIN (English): 195k temporal expressions (7588 unique), 382k word tokens,
163 types.
5
For Bayesian MT decipherment, we set a high prior value on the language model (10
4
) and use
sparse priors for the IBM 3 model parameters t;n;d;p (0:01; 0:01; 0:01; 0:01). We use the output from
EM decipherment as the initial sample and run the sampler for 2000 iterations, during which we apply
annealing with a linear schedule (2! 0.08).
152
TEST (Spanish): 100k temporal expressions (2343 unique), 204k word tokens, 269
types.
OPUS movie subtitle corpus: This is a large open source collection of parallel cor-
pora available for multiple language pairs (Tiedemann, 2009). We downloaded the
parallel Spanish/English subtitle corpus which consists of aligned Spanish/English
sentences from a collection of movie subtitles. For our MT experiments, we se-
lect only Spanish/English sentences with frequency > 10 and create the following
train/test splits:
TRAIN (English): 19770 sentences (1128 unique), 62k word tokens, 411 word
types.
TEST (Spanish): 13181 sentences (1127 unique), 39k word tokens, 562 word types.
Both Spanish/English sides of TRAIN are used for parallel MT training, whereas
decipherment uses only monolingual English data for training LMs.
MT Systems: We build and compare dierent MT systems under two training scenarios:
1. Parallel training using: (a) MOSES, a phrase translation system (Koehn et al.,
2007) widely used in MT literature, and (b) a simpler version of IBM Model
3 (without distortion parameters) which can be trained tractably using the
strategy of Knight and Al-Onaizan (1998).
2. Decipherment without parallel data using: (a) EM method (from Section 6.4.2),
and (b) Bayesian method (from Section 6.4.3).
153
Method Decipherment Accuracy
Time expr. OPUS subtitle
1a. Parallel training (MOSES)
with 2-gram LM 5.6 (85.6) 26.8 (63.6)
with 5-gram LM 4.7 (88.0)
1b. Parallel training (IBM 3 without
distortion)
with 2-gram LM 10.1 (78.9) 29.9 (59.6)
with whole-segment LM 9.0 (79.2)
2a. Decipherment (EM)
with 2-gram LM 37.6 (44.6) 67.2 (15.3)
with whole-segment LM 28.7 (48.7) 65.1 (19.3)
2b. Decipherment (Bayesian IBM 3)
with 2-gram LM 34.0 (30.2) 66.6 (15.1)
Figure 6.4: Comparison of Spanish/English MT performance on the Time and OPUS
test corpora achieved by various MT systems trained under (1) parallel|(a) MOSES,
(b) IBM 3 without distortion, and (2) decipherment settings|(a) EM, (b) Bayesian.
The scores reported here are normalized edit distance values with BLEU scores shown in
parentheses.
Evaluation: All the MT systems are run on the Spanish test data and the quality of the
resulting English translations are evaluated using two dierent measures|(1) Normalized
edit distance score (Navarro, 2001),
6
and (2) BLEU (Papineni et al., 2002), a standard
MT evaluation measure.
Results: Figure 6.4 compares the results of various MT systems (using parallel versus
decipherment training) on the two test corpora in terms of edit distance scores (a lower
score indicates closer match to the gold translation). The gure also shows the corre-
sponding BLEU scores in parentheses for comparison (higher scores indicate better MT
output).
6
When computing edit distance, we account for substitutions, insertions, deletions as well as local-swap
edit operations required to convert a given English string into the (gold) reference translation.
154
We observe that even without parallel training data, our decipherment strategies
achieve MT accuracies comparable to parallel-trained systems. On the Time corpus,
the best decipherment (Method 2a in the gure) achieves an edit distance score of 28.7
(versus 4.7 for MOSES). Better LMs yield better MT results for both parallel and deci-
pherment training|for example, using a segment-based English LM instead of a 2-gram
LM yields a 24% reduction in edit distance and a 9% improvement in BLEU score for
EM decipherment.
We also investigate how the performance of dierent MT systems vary with the size of
the training data. Figure 6.5 plots the BLEU scores versus training sizes for dierent MT
systems on the Time corpus. Clearly, using more training data yields better performance
for all systems. However, higher improvements are observed when using parallel data in
comparison to decipherment training which only uses monolingual data. We also notice
that the scores do not improve much when going beyond 10,000 training instances for
this domain.
It is interesting to quantify the value of parallel versus non-parallel data for any given
MT task. In other words, \how much non-parallel data is worth how much parallel data in
order to achieve the same MT accuracy?" Figure 6.5 provides a reasonable answer to this
question for the Spanish/English MT task described here. We see that deciphering with
100k monolingual Spanish temporal expressions yields the same performance as training
with around 200-500 parallel English/Spanish expressions. This is the rst attempt at
such a quantitative comparison for MT and our results are encouraging. We envision that
further developments in unsupervised methods will help reduce this gap further.
155
Figure 6.5: Comparison of training data size versus MT accuracy in terms of BLEU score
under dierent training conditions: (1) Parallel training|(a) MOSES, (b) IBM Model
3 without distortion, and (2) Decipherment without parallel data using EM method
(from Section 6.4.2).
6.5 Conclusion
Our work is the rst attempt at doing MT without parallel data. We discussed several
novel decipherment approaches for achieving this goal. Along the way, we developed
ecient training methods that can deal with large-scale vocabularies and data sizes.
7
7
This material is based in part upon work supported by the Defense Advanced Research Projects
Agency (DARPA). Any opinion, ndings and conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily re
ect the views of DARPA.
156
Chapter 7
Conclusions and Future Work
In this thesis, we introduced the decipherment framework for unsupervised learning. The
main idea behind this is that the ambiguities associated with several problems of inter-
est can be characterized by certain basic properties. For a given task, we rst identify
and model these properties (referred to as key operations) using the decipherment frame-
work and subsequently develop algorithms to solve the task in an unsupervised manner.
Our decipherment algorithms include (1) probabilistic methods (e.g., Expectation Maxi-
mization, Bayesian inference, etc.), and (2) integer programming methods, and (3) other
approaches including approximate optimization techniques.
Decipherment for Cryptanalysis: We showed the eectiveness of our algorithms on
cryptanalysis tasks such as solving simple letter substitution ciphers (Chapter 2) as well as
complex homophonic ciphers like the Zodiac-408 cipher (Chapter 3). Unlike conventional
cryptanalysis techniques, our methods require only minimal plaintext knowledge and yet
are able to solve these ciphers perfectly. In Chapter 2, we also used some of these tech-
niques to help us quantify the value of non-parallel data and for the rst time, empirically
study theoretical decipherment bounds introduced by Claude Shannon in 1949.
157
Deciphering Natural Language without Labeled Data: We then showed that sim-
ilar decipherment principles can be used to model existing problems in natural language
processing. Depending on the key characteristics (nature of key operations) associated
with these tasks, we developed unsupervised decipherment algorithms which can solve
these tasks without using any labeled data. We applied these decipherment techniques to
tackle several fundamental tasks such as unsupervised part-of-speech tagging (Chapter 4,
Section 4.2), supertagging (Chapter 4, Section 4.3), word/sub-word alignment (Chapter 4,
Section 4.4), machine transliteration (Chapter 5) and machine translation (Chapter 6).
Minimized Models: To deal with the ambiguities inherent in many natural language
tasks, our decipherment algorithms (in Chapter 4) rely on a single underlying principle|
\searching for minimized models" during unsupervised learning. This idea has proven
to be an eective decipherment strategy which yields state-of-the-art results on many
fundamental NLP problems|(1) sequence labeling tasks such as unsupervised part-of-
speech tagging and supertagging, and (2) word/sub-word alignment tasks. Our empirical
ndings show that \minimizing model size" correlates highly with better solution quality.
Although it has proven to be a benecial decipherment strategy, minimizing model size is
a dicult optimization problem. In Chapter 4, we presented several strategies to achieve
model minimization|exact methods using integer programming as well as approximation
algorithms that can scale better to large datasets and complex problems.
158
7.1 Contributions
To recap, here is a list of the major contributions of this thesis:
1. We presented novel methods for solving letter and syllable substitution ciphers,
which provide accuracy improvements ranging from 10-64% over existing methods,
and we also compared empirical results with Shannon's mathematical theory of
decipherment.
2. We introduced the idea of using \small model sizes" for decipherment. Following
this notion, we presented novel methods for solving unsupervised problems like part-
of-speech (POS) tagging with a dictionary, supertagging with a dictionary and word
alignment. These methods explicitly search for small models during decipherment.
For unsupervised POS tagging, the new proposed approach yields a very high 92.3%
tagging accuracy for English, which is the best reported result so far on this task.
On the supertagging task, we achieve 3-4% improvement over existing state-of-
the-art approaches on multiple languages. We also showed that we can achieve
signicant accuracy (f-measure) improvements ranging from 9% to 63% when using
our approach for the word/sub-word alignment tasks over existing approaches.
3. We tackled phonetic ciphers and applied techniques to an existing NLP task|
machine transliteration. In comparison to current transliteration approaches (which
use parallel data for training), we provided a decipherment approach to perform
machine transliteration using only monolingual resources. The method is not con-
strained by the availability of parallel resources, and hence can work with any
language pair. We showed that using this method, it is possible to achieve good
159
performance (26% lower accuracy than a parallel-trained system) on a standard
Japanese/English name-transliteration task without using any parallel data.
4. We presented ecient decipherment algorithms that can scale to large vocabulary
(and data) sizes and demonstrated their eectiveness on a word substitution deci-
pherment task, achieving over 80% accuracy.
5. Finally, we developed novel decipherment techniques to tackle a more complex NLP
task|machine translation without parallel corpora. We presented empirical results
for automatic language translation in two dierent domains and showed that unlike
existing parallel-trained systems, our methods do not require bilingual resources
for training but still yield good translations. In addition, we provided empirical
studies which show how dierent factors (such as monolingual training data sizes,
parallel versus non-parallel corpora, etc.) aect the decipherment process and have
a bearing on the end-accuracies for the translation task.
At the beginning of this thesis, we posed several research challenges pertaining to
the limitation of using labeled resources for training NLP systems. The key conclusion
we have reached through this thesis work is that the decipherment framework described
here can contribute signicantly towards addressing this challenge. Using the models
and algorithms presented here, we have shown that it is feasible to build state-of-the-art
NLP systems without the presence of human annotation. Our decipherment algorithms
are unsupervised and language-independent, thereby extending their benets to multiple
domains and languages. In addition, the framework also allows one to incorporate prior
linguistic knowledge which may be useful for modeling certain tasks. These may be added
160
as hard constraints (as in the integer programming framework) or as soft constraints (for
example, Bayesian sparse priors help in learning skewed model distributions) and can
yield further improvements in task accuracies.
For the statistical language translation task, we showed for the rst time that it is
possible to generate good translations without using any parallel data. Our initial results
are encouraging, suggesting that decipherment-based approaches hold promise and may
impact future MT research in a big way. A key potential benet of this research is
to make machine translation cheaper and more widely available for new domains and
language pairs.
To conclude, we believe this thesis work has opened up several interesting avenues for
future research and we list some of these in the following section.
7.2 Future Work
Letter Substitution Decipherment: In Chapters 2 and 3, we devised statistical nat-
ural language and machine learning techniques to solve cryptanalysis tasks such as letter
substitution decipherment for simple as well as complex ciphers.
In the future, we envision interesting synergies between cryptography and natural
language processing. It would be interesting to see if an analytical framework (akin to
Shannon's described in Chapter 2) can be designed to answer questions such as \How
much data is needed to get good results on a given natural language task?" Additionally,
our decipherment techniques could prove useful to cryptographers and archaeologists|for
161
tackling unsolved ciphers (e.g., the Zodiac-340 cipher) and aiding humans in deciphering
lost or unknown languages and scripts (e.g., the famous Voynich manuscript).
Decipherment for Natural Language Tasks: Our work has shown that many un-
supervised problems in NLP can be modeled using similar decipherment principles (sub-
stitution operations on letters or words), while the only dierence is that the search
space is much larger for NLP tasks. For example, in part-of-speech tagging, decipher-
ment involves substituting (cipher) words in a sentence with their appropriate (plaintext)
syntactic categories such as Noun, Verb, Adjective, etc..
With additional research, similar decipherment techniques may be used to solve other
related problems in NLP (e.g., recognizing named-entities within a text, identifying the
particular sense of a word that is used in a given context, etc.) and in other disci-
plines such as computational biology (e.g., DNA sequencing) and nancial domains (e.g.,
extracting information about companies from nancial newspapers). Many of these prob-
lems follow similar principles and can be modeled using the same key operations that we
have discussed here.
Phonetic Decipherment (Transliteration): Phonetic substitution ciphers involve
substitution operations at the level of phonemes (linguistic units of sound) instead of let-
ters or words. We already saw an interesting NLP application for phonetic decipherment|
transliteration (translating names and terms across languages).
Besides transliteration, several other problems t the description of phonetic ciphers
thereby opening up interesting directions for future exploration|(1) in archaeological
decipherment, to \make the text speak" without the benet of a pronunciation dictionary,
162
(2) for speech recognition, to decipher speech to text when we only have access to a
speech broadcast and uncorrelated written text, and nally (3) in machine-aided linguistic
discovery|for example, to discover connections between languages which share similar
sound patterns.
Language Translation as a Decipherment Task: Automatic language translation (or
machine translation, MT) is one of the most fundamental problems in NLP. In Chapter 6,
we successfully deciphered a foreign Spanish text into English without using any parallel
data. This is the rst successful attempt at unsupervised MT in the history of statistical
language translation which does not rely on any bilingual resources. Our initial results
are encouraging, suggesting that decipherment-based approaches hold promise and may
impact future MT research in a big way.
A key benet of this research is to make machine translation cheaper and more widely
available. Since language is one of the fundamental modes of human communication,
building universal tools for automatic language translation can bridge the language barrier
between people across the globe. We foresee similar unsupervised methods being applied
to many more language pairs in the future. Another important research avenue for this
work will be domain adaptation|hybrid approaches combining knowledge from general
parallel data along with decipherment models trained on domain-specic monolingual
data can help improve translation for new domains. There is scope for future research in
this area along several directions|more advanced unsupervised decipherment algorithms,
syntax-based translation models for unsupervised MT (to transform English syntax trees
into foreign strings), and hierarchical Bayesian models for decipherment.
163
Bibliography
Ayan, Necip F., & Dorr, Bonnie J. (2006). Going beyond AER: An extensive analysis
of word alignments and their impact on MT. Proceedings of the 21st International
Conference on Computational Linguistics and 44th Annual Meeting of the Association
for Computational Linguistics (ACL-COLING) (pp. 9{16). Sydney, Australia.
Baldridge, Jason (2008). Weakly supervised supertagging with grammar-informed initial-
ization. Proceedings of the 22nd International Conference on Computational Linguistics
(COLING) (pp. 57{64). Manchester, UK.
Banko, Michele, & Moore, Robert C. (2004). Part of speech tagging in context. Pro-
ceedings of the International Conference on Computational Linguistics (COLING) (pp.
556{561). Geneva, Switzerland.
Barron, Andrew, Rissanen, Jorma, & Yu, Bin (1998). The minimum description
length principle in coding and modeling. IEEE Transactions on Information Theory,
44(6):2743{2760.
Bauer, Friedrich L. (2006). Decrypted secrets: Methods and maxims of cryptology.
Springer-Verlag.
Blunsom, Phil, Cohn, Trevor, Dyer, Chris, & Osborne, Miles (2009). A Gibbs sampler
for phrasal synchronous grammar induction. Proceedings of the Joint Conference of
the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the Asian Federation of Natural Language Processing
(ACL-IJCNLP) (pp. 782{790). Suntec, Singapore.
Bodrumlu, Tugba, Knight, Kevin, & Ravi, Sujith (2009). A new objective function for
word alignment. Proceedings of the NAACL/HLT Workshop on Integer Programming
for Natural Language Processing (pp. 28{35). Boulder, Colorado.
Bos, Johan, Bosco, Cristina, & Mazzei, Alessandro (2009). Converting a dependency
treebank to a categorial grammar treebank for Italian. Proceedings of the Eighth Inter-
national Workshop on Treebanks and Linguistic Theories (TLT8) (pp. 27{38). Milan,
Italy.
Brown, Peter, Della Pietra, Vincent, Della Pietra, Stephen, & Mercer, Robert (1993). The
mathematics of statistical machine translation: Parameter estimation. Computational
Linguistics, 19(2):263{311.
164
Chiang, David (2007). Hierarchical phrase-based translation. Computational Linguistics,
33(2):201{228.
Chiang, David, Graehl, Jonathan, Knight, Kevin, Pauls, Adam, & Ravi, Sujith (2010).
Bayesian inference for nite-state transducers. Proceedings of the Conference of the
North American Chapter of the Association for Computational Linguistics - Human
Language Technologies (NAACL-HLT) (pp. 447{455). Los Angeles, California.
Clark, Stephen, & Curran, James (2006). Partial training for a lexicalized-grammar
parser. Proceedings of the Human Language Technology Conference of the NAACL,
Main Conference (pp. 144{151). New York City, USA.
Clark, Stephen, & Curran, James (2007). Wide-coverage ecient statistical parsing with
CCG and log-linear models. Computational Linguistics, 33(4):493{552.
Creutz, Mathias, & Lagus, Krista (2002). Unsupervised discovery of morphemes. Proceed-
ings of the ACL Workshop on Morphological and Phonological Learning (pp. 21{30).
Morristown, NJ, USA.
Creutz, Mathias, & Lagus, Krista (2005a). Inducing the morphological lexicon of a natural
language from unannotated text. Proceedings of the International and Interdisciplinary
Conference on Adaptive Knowledge Representation and Reasoning (AKRR) (pp. 106{
113). Espoo, Finland.
Creutz, Mathias, & Lagus, Krista (2005b). Unsupervised morpheme segmentation and
morphology induction from text corpora using Morfessor 1.0. Publications in Computer
and Information Science, Report A81, Helsinki University of Technology, March.
Dempster, Arthur P., Laird, Nan M., & Rubin, Donald B. (1977). Maximum likelihood
from incomplete data via the EM algorithm. Journal of the Royal Statistical Society
Series, 39(4):1{38.
Finkel, Jenny, Grenager, Trond, & Manning, Christopher (2005). Incorporating non-
local information into information extraction systems by Gibbs sampling. Proceedings
of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL)
(pp. 363{370). Michigan, USA.
Fraser, Alexander, & Marcu, Daniel (2006). Semi-supervised training for statistical word
alignment. Proceedings of the 21st International Conference on Computational Lin-
guistics and 44th Annual Meeting of the Association for Computational Linguistics
(ACL-COLING) (pp. 769{776). Sydney, Australia.
Fraser, Alexander, & Marcu, Daniel (2007a). Getting the structure right for word align-
ment: LEAF. Proceedings of the Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
(pp. 51{60). Prague, Czech Republic.
Fraser, Alexander, & Marcu, Daniel (2007b). Measuring word alignment quality for
statistical machine translation. Computational Linguistics, 33(3):293{303.
165
Fung, Pascal (1995). Compiling bilingual lexicon entries from a non-parallel English-
Chinese corpus. Proceedings of the Third Annual Workshop on Very Large Corpora
(pp. 173{183). Cambridge, Massachusetts, USA.
Fung, Pascal, & McKeown, Kathleen (1997). Finding terminology translations from non-
parallel corpora. Proceedings of the Fifth Annual Workshop on Very Large Corpora
(pp. 192{202). Beijing, China.
Galley, Michel, Hopkins, Mark, Knight, Kevin, & Marcu, Daniel (2004). Proceedings
of the Human Language Technology Conference of the North American Chapter of the
Association for Computational Linguistics (HLT-NAACL) (pp. 273{280). Boston, Mas-
sachusetts, USA.
Ganesan, Ravi, & Sherman, Alan T. (1993). Statistical techniques for language recogni-
tion: An introduction and guide for cryptanalysts. Cryptologia, 17(4):321{366.
Garey, Michael R., & Johnson, David S. (1979). Computers and intractability: A guide
to the theory of NP-completeness. John Wiley & Sons.
Geman, Stuart, & Geman, Donald (1984). Stochastic relaxation, Gibbs distributions
and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6(6):721{741.
Germann, Ulrich, Jahr, Michael, Knight, Kevin, Marcu, Daniel, & Yamada, Kenji (2001).
Fast decoding and optimal decoding for machine translation. Proceedings of the 39th
Annual Meeting on Association for Computational Linguistics (pp. 228{235). Toulouse,
France.
Goldberg, Yoav, Adler, Meni, & Elhadad, Michael. EM can nd pretty good HMM
POS-taggers (when given a good start). Proceedings of the 46th Annual Meeting of
the Association for Computational Linguistics: Human Language Technologies (ACL-
HLT). Columbus, OH, USA.
Goldberg, Yoav, & Elhadad, Michael (2008). Identication of transliterated foreign words
in Hebrew script. Proceedings of the 9th international conference on Computational
linguistics and intelligent text processing (CICLing) (pp. 466{477). Haifa, Israel.
Goldsmith, John (2001). Unsupervised learning of the morphology of a natural language.
Computational Linguistics, 27(2):153{198.
Goldwasser, Dan, & Roth, Dan (2008a). Active sample selection for named entity translit-
eration. Proceedings of the 46th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies (ACL-HLT) Short Papers (pp. 53{56).
Columbus, Ohio.
Goldwasser, Dan, & Roth, Dan (2008b). Transliteration as constrained optimization.
Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP) (pp. 353{362). Honolulu, Hawaii.
166
Goldwater, Sharon, & Griths, Thomas (2007). A fully Bayesian approach to unsuper-
vised part-of-speech tagging. Proceedings of the 45th Annual Meeting of the Association
of Computational Linguistics (pp. 744{751). Prague, Czech Republic.
Graehl, Jonathan (1997). Carmel nite-state toolkit. http://www.isi.edu/licensed-
sw/carmel. .
Gra, David, & Finch, Rebecca (1994). Multilingual text resources at the linguistic data
consortium. Proceedings of the Workshop on Human Language Technology (pp. 18{22).
New Jersey, USA.
Haghighi, Aria, Liang, Percy, Berg-Kirkpatrick, Taylor, & Klein, Dan (2008). Learning
bilingual lexicons from monolingual corpora. Proceedings of the Annual Meeting of the
Association for Computational Linguistics - Human Language Technologies (ACL-HLT
(pp. 771{779). Columbus, Ohio.
Hassan, Hany, Sima'an, Khalil, & Way, Andy (2009). A syntactied direct translation
model with linear-time decoding. Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP) (pp. 1182{1191). Singapore.
Hermjakob, Ulf, Knight, Kevin, & Daum e III, Hal (2008). Name translation in sta-
tistical machine translation - learning when to transliterate. Proceedings of the 46th
Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies (ACL-HLT) (pp. 389{397). Columbus, Ohio.
Hockenmaier, Julia, & Steedman, Mark (2007). CCGbank: A corpus of CCG deriva-
tions and dependency structures extracted from the Penn Treebank. Computational
Linguistics, 33(3):355{396.
Huang, Fei, Vogel, Stephan, & Waibel, Alex (2004). Improving named entity transla-
tion combining phonetic and semantic similarities. Proceedings of the Human Language
Technology Conference of the North American Chapter of the Association for Compu-
tational Linguistics (HLT-NAACL) (pp. 281{288). Boston, Massachusetts, USA.
Jakobsen, Thomas (1995). A fast method for cryptanalysis of substitution ciphers. Cryp-
tologia, 19(3):265{274.
Joshi, Aravind (1988). Tree Adjoining Grammars. In D. Dowty, L. Karttunen and
A. Zwicky (Eds.), Natural language parsing, 206{250. Cambridge: Cambridge Univer-
sity Press.
Karimi, Sarvnaz, Scholer, Falk, & Turpin, Andrew (2007). Collapsed consonant and vowel
models: New approaches for English-Persian transliteration and back-transliteration.
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics
(ACL) (pp. 648{655). Prague, Czech Republic.
Klementiev, Alex, & Roth, Dan (2008). Named entity transliteration and discovery in
multilingual corpora. In Learning machine translation. MIT press.
167
Knight, Kevin, & Al-Onaizan, Yaser (1998). Translation with nite-state devices. In
D. Farwell, L. Gerber and E. Hovy (Eds.), Machine translation and the information
soup, vol. 1529 of Lecture Notes in Computer Science, 421{437. Springer Berlin / Hei-
delberg.
Knight, Kevin, & Graehl, Jonathan (1998). Machine transliteration. Computational
Linguistics, 24(4):599{612.
Knight, Kevin, Nair, Anish, Rathod, Nishit, & Yamada, Kenji (2006). Unsupervised
analysis for decipherment problems. Proceedings of the Joint Conference of the Interna-
tional Committee on Computational Linguistics and the Association for Computational
Linguistics (COLING-ACL) (pp. 499{506). Sydney, Australia.
Knight, Kevin, & Yamada, Kenji (1999). A computational approach to deciphering un-
known scripts. Proceedings of the ACL Workshop on Unsupervised Learning in Natural
Language Processing (pp. 37{44). Maryland, USA.
Koehn, Philip (2009). Machine translation of languages. Cambridge University Press.
Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, Callison-Burch, Chris, Federico, Mar-
cello, Bertoldi, Nicola, Cowan, Brooke, Shen, Wade, Moran, Christine, Zens, Richard,
Dyer, Chris, Bojar, Ond rej, Constantin, Alexandra, & Herbst, Evan (2007). Moses:
open source toolkit for statistical machine translation. Proceedings of the 45th Annual
Meeting of the ACL on Interactive Poster and Demonstration Sessions (pp. 177{180).
Prague, Czech Republic.
Koehn, Philipp, & Knight, Kevin. Estimating word translation probabilities from un-
related monolingual corpora using the EM algorithm. Proceedings of the Seventeenth
National Conference on Articial Intelligence and Twelfth Conference on Innovative
Applications of Articial Intelligence (pp. 711{715). Austin, Texas, USA.
Koehn, Philipp, & Knight, Kevin (2001). Knowledge sources for word-level translation
models. Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP) (pp. 27{35). Pittsburgh, PA USA.
Kuo, Jin-Shea, Li, Haizhou, & Yang, Ying-Kuei (2006). Learning transliteration lexicons
from the web. Proceedings of the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Association for Computational Linguistics
(ACL-COLING) (pp. 1129{1136). Sydney, Australia.
Li, Haizhou, Sim, Khe Chai, Kuo, Jin-Shea, & Dong, Minghui (2007). Semantic translit-
eration of personal names. Proceedings of the 45th Annual Meeting of the Association
of Computational Linguistics (pp. 120{127). Prague, Czech Republic.
Li, Haizhou, Zhang, Min, & Su, Jian (2004). A joint source-channel model for machine
transliteration. Proceedings of the 42nd Meeting of the Association for Computational
Linguistics (ACL) (pp. 159{166). Barcelona, Spain.
168
Liang, Percy, Jordan, Michael I., & Klein, Dan (2010). Type-based MCMC. Proceedings
of the Conference on Human Language Technologies: The Annual Conference of the
North American Chapter of the Association for Computational Linguistics (pp. 573{
581). Los Angeles, California.
McClosky, David, Charniak, Eugene, & Johnson, Mark (2006). Eective self-training
for parsing. Proceedings of the Human Language Technology Conference of the North
American Chapter of the Association of Computational Linguistics (pp. 152{159). New
York City, USA.
Melamed, I. Dan (1997). A word-to-word model of translational equivalence. Proceedings
of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth
Conference of the European Chapter of the Association for Computational Linguistics
(pp. 490{497). Madrid, Spain.
Merialdo, Bernard (1994). Tagging English text with a probabilistic model. Computa-
tional Linguistics, 20(2):155{171.
Munteanu, Dragos S., Fraser, Alexander, & Marcu, Daniel (2004). Improved machine
translation performance via parallel sentence extraction from comparable corpora. Pro-
ceedings of the Human Language Technology Conference of the North American Chap-
ter of the Association for Computational Linguistics (HLT-NAACL) (pp. 265{272).
Boston, Massachusetts, USA.
Munteanu, Dragos S., & Marcu, Daniel (2002). Processing comparable corpora with
bilingual sux trees. Proceedings of the Conference on Empirical Methods in Natural
Language Processing (pp. 289{295). Philadelphia, PA, USA.
Nagata, Masaaki, Saito, Teruka, & Suzuki, Kenji (2001). Using the web as a bilingual
dictionary. Proceedings of the Workshop on Data-driven methods in machine translation
(pp. 1{8). Toulouse, France.
Navarro, Gonzalo (2001). A guided tour to approximate string matching. ACM Comput-
ing Surveys, 33(1):31{88.
Newman, David, Asuncion, Arthur, Smyth, Padhraic, & Welling, Max (2009). Distributed
algorithms for topic models. Journal of Machine Learning Research, 10:1801{1828.
Och, Franz J., Gildea, Daniel, Khudanpur, Sanjeev, Sarkar, Anoop, Yamada, Kenji,
Fraser, Alex, Kumar, Shankar, Shen, Libin, Smith, David, Eng, Katherine, Jain, Viren,
Jin, Zhen, & Radev, Dragomir (2004). A smorgasbord of features for statistical machine
translation. Proceedings of the Human Language Technology Conference of the North
American Chapter of the Association for Computational Linguistics (HLT-NAACL)
(pp. 161{168). Boston, Massachusetts, USA.
Och, Franz J., & Ney, Hermann (2003). A systematic comparison of various statistical
alignment models. Computational Linguistics, 29(1):19{51.
Och, Franz J., & Ney, Hermann (2004). The alignment template approach to statistical
machine translation. Computational Linguistics, 30(4):417{449.
169
Oh, Jong-Hoon, & Isahara, Hitoshi (2006). Mining the web for transliteration lexicons:
Joint-validation approach. Proceedings of the IEEE/WIC/ACM International Confer-
ence on Web Intelligence (pp. 254{261). Washington, DC, USA.
Olson, Edwin (2007). Robust dictionary attack of short simple substitution ciphers.
Cryptologia, 31(4):332{342.
Oranchak, David (2008). Evolutionary algorithm for decryption of monoalphabetic ho-
mophonic substitution ciphers encoded as constraint satisfaction problems. Proceedings
of the 10th Annual Conference on Genetic and Evolutionary Computation (pp. 1717{
1718). Atlanta, GA, USA.
Papineni, Kishore, Roukos, Salim, Ward, Todd, & Zhu, Wei-Jing (2002). Bleu: a method
for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics (pp. 311{318). Philadelphia, Pennsylva-
nia.
Peleg, Shmuel, & Rosenfeld, Azriel (1979). Breaking substitution ciphers using a relax-
ation algorithm. Communications of the ACM, 22(11):598{605.
Pollard, Carl, & Sag, Ivan (1994). Head driven phrase structure grammar. CSLI/Chicago
University Press, Chicago.
Quirk, Chris, Menezes, Arul, & Cherry, Colin (2005). Dependency treelet translation:
syntactically informed phrasal smt. Proceedings of the 43rd Annual Meeting on Asso-
ciation for Computational Linguistics (ACL) (pp. 271{279). Ann Arbor, Michigan.
Raghavendra, Udupa U., & Maji, Hemant K. (2006). Computational complexity of statis-
tical machine translation. Proceedings of the 11th Conference of the European Chapter
of the Association for Computational Linguistics (EACL) (pp. 25{32). Trento, Italy.
Rapp, Reinhard (1995). Identifying word translations in non-parallel texts. Proceedings
of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL)
(pp. 320{322). Cambridge, Massachusetts, USA.
Ravi, Sujith, Baldridge, Jason, & Knight, Kevin (2010a). Minimized models and
grammar-informed initialization for supertagging with highly ambiguous lexicons. Pro-
ceedings of the 48th Annual Meeting of the Association for Computational Linguistics
(pp. 495{503). Uppsala, Sweden.
Ravi, Sujith, & Knight, Kevin (2008). Attacking decipherment problems optimally with
low-order n-gram models. Proceedings of the Empirical Methods in Natural Language
Processing (EMNLP) (pp. 812{819). Honolulu, Hawaii.
Ravi, Sujith, & Knight, Kevin (2009a). Attacking letter substitution ciphers with integer
programming. Cryptologia, 33(4):321{334.
Ravi, Sujith, & Knight, Kevin (2009b). Learning phoneme mappings for transliteration
without parallel data. Proceedings of Human Language Technologies: The Annual Con-
ference of the North American Chapter of the Association for Computational Linguistics
(NAACL-HLT) (pp. 37{45). Boulder, Colorado.
170
Ravi, Sujith, & Knight, Kevin (2009c). Minimized models for unsupervised part-of-
speech tagging. Proceedings of the Joint Conferenceof the 47th Annual Meeting of the
Association for Computational Linguistics and the 4th International Joint Conference
on Natural Language Processing of the Asian Federation of Natural Language Processing
(ACL-IJCNLP) (pp. 504{512). Suntec, Singapore.
Ravi, Sujith, & Knight, Kevin (2009d). Probabilistic methods for a Japanese syllable
cipher. Proceedings of the International Conference on the Computer Processing of
Oriental Languages (ICCPOL) (pp. 270{281). Hong Kong, China.
Ravi, Sujith, & Knight, Kevin (2011a). Bayesian inference for Zodiac and other homo-
phonic ciphers. To appear in Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies (ACL-HLT).
Ravi, Sujith, & Knight, Kevin (2011b). Deciphering foreign language. To appear in Pro-
ceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies (ACL-HLT).
Ravi, Sujith, Vaswani, Ashish, Knight, Kevin, & Chiang, David (2010b). Fast, greedy
model minimization for unsupervised tagging. Proceedings of the 23rd International
Conference on Computational Linguistics (COLING) (pp. 940{948). Beijing, China.
Sch onhofen, P eter, Bencz ur, Andr as, B r o, Istv an, & Csalog any, K aroly (2008). Cross-
language retrieval with Wikipedia, 72{79. Springer-Verlag.
Schrijver, Alexander (1998). Theory of linear and integer programming. John Wiley &
Sons.
Shannon, Claude E. (1948). A mathematical theory of communication. Bell System
Technical Journal, 27:379{423 and 623{656.
Shannon, Claude E. (1949). Communication theory of secrecy systems. Bell System
Technical Journal, 28:656{715.
Sherif, Tarek, & Kondrak, Grzegorz (2007a). Bootstrapping a stochastic transducer for
Arabic-English transliteration extraction. Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics (pp. 864{871). Prague, Czech Republic.
Sherif, Tarek, & Kondrak, Grzegorz (2007b). Substring-based transliteration. Proceedings
of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 944{
951). Prague, Czech Republic.
Smith, Noah A., & Eisner, Jason (2005). Contrastive estimation: Training log-linear
models on unlabeled data. Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics (pp. 354{362). Ann Arbor, Michigan.
Snyder, Benjamin, & Barzilay, Regina (2008). Unsupervised multilingual learning for
morphological segmentation. Proceedings of the Annual Meeting of the Association for
Computational Linguistics - Human Language Technologies (ACL-HLT (pp. 737{745).
Columbus, Ohio.
171
Snyder, Benjamin, Barzilay, Regina, & Knight, Kevin (2010). A statistical model for lost
language decipherment. Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics (pp. 1048{1057). Uppsala, Sweden.
Sproat, Richard, Tao, Tao, & Zhai, ChengXiang (2006). Named entity transliteration
with comparable corpora. Proceedings of the 21st International Conference on Com-
putational Linguistics and 44th Annual Meeting of the Association for Computational
Linguistics (pp. 73{80). Sydney, Australia.
Steedman, Mark (2000). The syntactic process. Cambridge, MA: MIT Press.
Tao, Tao, Yoon, Su-Youn, Fister, Andrew, Sproat, Richard, & Zhai, ChengXiang (2006).
Unsupervised named entity transliteration using temporal and phonetic correlation.
Proceedings of the Conference on Empirical Methods in Natural Language Processing
(pp. 250{257). Sydney, Australia.
Taskar, Ben, Lacoste-Julien, Simon, & Klein, Dan (2005). A discriminative matching
approach to word alignment. Proceedings of Human Language Technology Conference
and Conference on Empirical Methods in Natural Language Processing (pp. 73{80).
Vancouver, British Columbia, Canada.
Tiedemann, J org (2009). News from OPUS - A collection of multilingual parallel corpora
with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova and R. Mitkov
(Eds.), Recent advances in natural language processing, vol. V, 237{248. John Ben-
jamins, Amsterdam/Philadelphia.
Toutanova, Kristina, & Johnson, Mark (2008). A Bayesian LDA-based model for semi-
supervised part-of-speech tagging. Proceedings of the Advances in Neural Information
Processing Systems (NIPS) (pp. 1521{1528). Vancouver, British Columbia, Canada.
Vogel, Stephan, Ney, Hermann, & Tillmann, Christoph (1996). HMM-based word align-
ment in statistical translation. Proceedings of the 16th Conference on Computational
linguistics (COLING) (pp. 836{841). Copenhagen, Denmark.
Weaver, Warren (1955). Translation (1949). Reproduced in W.N. Locke, A.D. Booth
(eds.). In Machine translation of languages, 15{23. MIT Press.
Wu, Dekai (1997). Stochastic inversion transduction grammars and bilingual parsing of
parallel corpora. Computational Linguistics, 23(3):377{403.
Wu, Jian-Cheng, & Chang, Jason S. (2007). Learning to nd English to Chinese transliter-
ations on the web. Proceedings of the Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
(pp. 996{1004). Prague, Czech Republic.
Yoon, Su-Youn, Kim, Kyoung-Young, & Sproat, Richard (2007). Multilingual transliter-
ation using feature based phonetic method. Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics (pp. 112{119). Prague, Czech Republic.
172
Abstract (if available)
Abstract
Most state-of-the-art techniques used in natural language processing (NLP) are supervised and require labeled training data. For example, statistical language translation requires huge amounts of bilingual data for training translation systems. But such data does not exist for all language pairs and domains. Using human annotation to create new bilingual resources is not a scalable solution. This raises a key research challenge: How can we circumvent the problem of limited labeled resources for NLP applications? Interestingly, cryptanalysts and archaeologists have tackled similar challenges in solving "decipherment problems".
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Exploiting comparable corpora
PDF
Generating psycholinguistic norms and applications
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Decipherment of historical manuscripts
PDF
Automatic decipherment of historical manuscripts
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Hashcode representations of natural language for relation extraction
PDF
Improved word alignments for statistical machine translation
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Neural creative language generation
PDF
Towards generalized event understanding in text via generative models
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Enabling open domain interactive storytelling using a data-driven case-based approach
PDF
Representing complex temporal phenomena for the semantic web and natural language
Asset Metadata
Creator
Ravi, Sujith
(author)
Core Title
Deciphering natural language
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/28/2011
Defense Date
03/21/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,computational decipherment,machine learning,natural language processing,OAI-PMH Harvest,statistics
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Knight, Kevin (
committee chair
), Chiang, David (
committee member
), Marcu, Daniel (
committee member
), Narayanan, Shrikanth S. (
committee member
), Teng, Shang-Hua (
committee member
)
Creator Email
sravi@usc.edu,sujith_ravi@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3705
Unique identifier
UC1483377
Identifier
etd-Ravi-4461 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-448537 (legacy record id),usctheses-m3705 (legacy record id)
Legacy Identifier
etd-Ravi-4461.pdf
Dmrecord
448537
Document Type
Dissertation
Rights
Ravi, Sujith
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
computational decipherment
machine learning
natural language processing