Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Active data acquisition for building language models for speech recognition
(USC Thesis Other)
Active data acquisition for building language models for speech recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ACTIVE DATA ACQUISITION FOR BUILDING LANGUAGE MODELS FOR
SPEECH RECOGNITION
by
Abhinav Sethy
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2007
Copyright 2007 Abhinav Sethy
Epigraph
The three characteristics of a good programmer are impatience, laziness and hubris.
Larry Wall
ii
Dedication
to my parents and brother.
iii
Acknowledgments
I have spent six wonderful productive years in SAIL and USC during the course of my
PhD. There are many people to thank, for the help, guidance, and support.
First of all, I am grateful to my advisor, Prof. Shrikanth Narayanan, for his guidance
in research and for the countless invaluable discussions we had about our work. His wis-
dom, diligence, and persistence has been a guiding light for me in many ways during my
study here and will continue inspiring me in the future. Without his support and trust, I
could not have finished this thesis. I would also like to thank Prof. Panayiotis Georgiou,
for his encouragement and support to me during the course of my PhD research.
My thesis would not have been finished without the support from the LVCSR group
of IBM T.J. Watson research center. I thank Bhuvana Ramabhadran for believing in
my work and providing me the wonderful opportunity to collaborate on the TC-STAR
project. I am indebted for the support and insightful discussions that I received from
her.
I would like to thank Prof Shrikanth Narayanan, Prof. Keith Jenkins, Prof. Dani
Byrd, Prof. Kevin Knight and Dr. Bhuvana Ramabhadran for serving as my thesis
committee.
I thank my fellow colleagues, Shankar Ananthakrishnan, Shiva Sundaram, Vivek
Rangarajan, Ozlem Kalinli, Erdem Unal, Jorge Silva, Tom Murray, Viktor Rozgic,
Selina Chu, Abe Kazemzadeh for making the SAIL lab in RTH a fun place. I specially
iv
thank Shankar , Vivek and Shiva for many insightful discussions in speech recognition,
machine learning and more importantly, other interesting stuff in life. I would also like
to acknowledge the help I received from SAIL alumni who were in the lab during my
initial years. In particular discussions with Naveen Shrinivasamurthy were very helpful
in forming new research ideas.
During my graduate student life I have had the opportunity to meet many other
wonderful people who as friends made life fun and interesting. In particular Pankaj
Mishra, Kalpesh Solanki, Sachin Malhotra ,Narayanan Sadagopan ,Amol Bakshi, Mitali
Singh and Vaibhav Mathur gave me the best companionship I could hope for. I must also
acknowledge my best friend, Paola Virga, for showing me how colorful and fulfilling life
can be.
I am grateful to Gloria Halfacre in SIPI and Diane Demetras, Tim Boston in Elec-
trical Engineering, for their professional help in making all administrative matters run
smoothly.
My family has been the one and only reason I have gone thus far. My parents and
my brother, selflessly supported my study abroad and provided me comfort and encour-
agement all along. Without their support I could not have focused on my thesis study.
A.S.
Los Angeles, California
August 2007.
v
Table of Contents
Epigraph ii
Dedication iii
Acknowledgments iv
List of Tables x
List of Figures xiii
Abbreviations xiv
Abstract xvi
Chapter 1: Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Brief summary of the proposed approach . . . . . . . . . . . . . . . . 2
1.3 Contributions of this research . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Open Problems and limitations . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Background Review 7
Chapter 3: Query generation 10
3.1 Computation of relative entropy . . . . . . . . . . . . . . . . . . . . . 12
3.2 Generating queries using Relative Entropy . . . . . . . . . . . . . . . . 14
3.3 Tracking queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Evaluating goodness of a document . . . . . . . . . . . . . . . 16
3.3.2 Tracking URLS . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.3 Tracking queries . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Effect of initial seed set size . . . . . . . . . . . . . . . . . . . 17
3.4.2 Effect of query size and type . . . . . . . . . . . . . . . . . . . 18
vi
3.5 LM convergence analysis using R.E . . . . . . . . . . . . . . . . . . . 19
3.5.1 Reference language models . . . . . . . . . . . . . . . . . . . 20
3.5.2 Convergence analysis results . . . . . . . . . . . . . . . . . . . 21
3.5.3 Sparseness of resampled LMs . . . . . . . . . . . . . . . . . . 23
3.6 ASR results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6.1 OOV rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6.2 WER results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.3 LM complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 LM convergence and ASR design . . . . . . . . . . . . . . . . . . . . 26
3.8 Other applications of recursive R.E computation . . . . . . . . . . . . 27
Chapter 4: Data Filtering 28
4.1 Web data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Preliminary data processing . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Data weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Document level features . . . . . . . . . . . . . . . . . . . . . 31
4.4.2 Utterance level scores . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Preliminary text cleanup . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Data weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6.1 Method 1 : Likelihood based weighting . . . . . . . . . . . . . 33
4.6.2 Method 2 : Positive only classification . . . . . . . . . . . . . 34
4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7.1 Effect of rejection model . . . . . . . . . . . . . . . . . . . . . 37
4.8 Rank and select methods for text filtering . . . . . . . . . . . . . . . . 38
4.9 Balanced data selection using relative entropy . . . . . . . . . . . . . . 39
4.9.1 The Core Algorithm . . . . . . . . . . . . . . . . . . . . . . . 40
4.9.2 Initialization and parameters . . . . . . . . . . . . . . . . . . . 42
4.9.3 Selection and randomization . . . . . . . . . . . . . . . . . . . 42
4.9.4 Further enhancements . . . . . . . . . . . . . . . . . . . . . . 43
4.10 Generalization to N-gram back off models . . . . . . . . . . . . . . . . 44
4.10.1 Fast Computation of Relative Entropy . . . . . . . . . . . . . . 45
4.10.2 Incremental updates on a n-gram model . . . . . . . . . . . . . 48
4.10.3 Relation to LM pruning using r.e . . . . . . . . . . . . . . . . . 51
4.11 Implementation details and data collection . . . . . . . . . . . . . . . . 52
4.12 Some intuition from simulations . . . . . . . . . . . . . . . . . . . . . 53
4.12.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 55
4.13 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.13.1 Medium vocabulary ASR experiments on Transonics . . . . . . 58
4.13.2 Large vocabulary experiments on TC-STAR . . . . . . . . . . . 61
4.14 Discussion and analysis of results . . . . . . . . . . . . . . . . . . . . 65
vii
Chapter 5: Boosting Coverage 67
5.1 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Clustering using LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Unsupervised LM adaptation . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4 Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Using lattices and references . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 TC-STAR open condition evaluation . . . . . . . . . . . . . . . . . . . 77
5.6.1 ASR system architecture . . . . . . . . . . . . . . . . . . . . . 78
5.6.2 Language modeling for the open condition . . . . . . . . . . . 82
Chapter 6: Hierarchical Speech Recognition 85
6.1 Acoustics based pronunciation modeling using syllables . . . . . . . . 87
6.2 Recognizer Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.1 Phone Recognizer . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.2 Syllable Recognizer . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.3 Larger Acoustic Units . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Spoken name recognition systems . . . . . . . . . . . . . . . . . . . . 93
6.4 Training for spoken name recognition: Corpora and Implementation . . 95
6.4.1 Initial TIMIT system . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.2 NAMES corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.3 Bootstrapping from TIMIT . . . . . . . . . . . . . . . . . . . . 97
6.5 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5.1 Preliminary TIMIT based Experiments . . . . . . . . . . . . . 97
6.5.2 Evaluation on the spoken name recognition task . . . . . . . . . 102
6.5.3 Syllables and pronunciation variation in names . . . . . . . . . 105
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 7: Conclusion and future work 108
7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1.1 Query Generation . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1.2 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.1.3 Other applications . . . . . . . . . . . . . . . . . . . . . . . . 111
References 112
viii
List of Tables
3.1 Perplexity(PPL) of testdata for different sizes of initial seed data. Both
the models were merged with the baselm using linear interpolation . . . 18
3.2 ASR WER of testdata for different sizes of initial seed data. Both the
models were merged with the baselm using linear interpolation . . . . . 18
3.3 Effect of different number of query keywords and keyphrases on system
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 R.E across different parts of speech for LM2 corresponding to artificial
training data size of 20M words . . . . . . . . . . . . . . . . . . . . . 23
3.5 R.E, perplexity and number of trigrams across LMs resampled from
increasing generated data for LM2. . . . . . . . . . . . . . . . . . . . 24
3.6 OOV rate with increasing data size. The vocabulary of the original LM
was 4970. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Word error rate with increasing data size. The WER with the original
language model was 18.5%. . . . . . . . . . . . . . . . . . . . . . . . 25
3.8 R.E, perplexity and number of trigrams across LMs resampled from
increasing generated data. . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Perplexity(PPL) and vocabulary size of the final system with and with-
out the rejection model. . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Perplexity of data selected with P
ind
for varying number of selected
sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Relative entropy of models built from the selected data with the refer-
enceP
true
distribution for varying number of sentences . . . . . . . . . 56
4.4 Perplexity of testdata with the web adapted model for different number
of initial sentences. Corpus size=150M . . . . . . . . . . . . . . . . . 59
ix
4.5 Percentage of selected sentences for different number of initial in-domain
sentences. Corpus size=150M . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Word Error Rate (WER) with web adapted models for different number
of initial sentences. Corpus size=150M . . . . . . . . . . . . . . . . . 61
4.7 Perplexity of testdata with the web adapted model for different number
of initial sentences. Corpus size=850M . . . . . . . . . . . . . . . . . 61
4.8 Word Error Rate (WER) with web adapted models for different number
of initial sentences. Corpus size=850M . . . . . . . . . . . . . . . . . 62
4.9 Percentage of selected sentences for different number of initial sen-
tences. Corpus size=850M . . . . . . . . . . . . . . . . . . . . . . . . 62
4.10 Number of estimated n-grams with web adapted models for different
number of initial sentences for the case with 40K in-domain sentences.
Corpus size=850M . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.11 Performance comparison of the language models built with different
fractions of data being selected for the Dev06 and Eval06 test sets. The
baseline had 525M words of fisher web data (U.Wash) and 204M words
of Broadcast News(BN) as out-of-domain data. The WER on Eval06
for the baseline was 8.9% and 11% on Dev06. . . . . . . . . . . . . . . 64
5.1 ASR WER of testdata with unsupervised adaptation and data acquisi-
tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 ASR WER of testdata with acoustic features included in data weighting. 77
6.1 Distribution of words and their syllable count for the training part of the
TIMIT corpus. Total number of words in this training corpus was 40000. 98
6.2 Distribution of words and their syllable count for the NAMES corpus.
Total number of words was 10000. . . . . . . . . . . . . . . . . . . . . 98
6.3 Distribution of syllables common to TIMIT and NAMES and their length.
Total number of common syllables is around 1200. This table does not
include single phone syllables. . . . . . . . . . . . . . . . . . . . . . . 98
6.4 TIMIT word recognition accuracy results with syllable and word level
units at different stages of reestimation after CD phone initialization
compared to baseline phone recognizer. . . . . . . . . . . . . . . . . . 100
x
6.5 TIMIT Word recognition accuracy results for syllable and word level
units with and without CD phone-based initialization after three reesti-
mations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.6 Word recognition accuracy for TIMIT and complexity in number of
states of syllable, word and hybrid lexicon recognizers . . . . . . . . . 101
6.7 Recognition rates for different FSG based spoken name recognition sys-
tems on the 6000 utterance test set . . . . . . . . . . . . . . . . . . . . 103
6.8 Spoken name recognition accuracy for the information retrieval scheme
after FSG rescoring of compacted name list . . . . . . . . . . . . . . . 105
xi
List of Figures
3.1 R.E vs training data size for LM1 . . . . . . . . . . . . . . . . . . . . 21
3.2 R.E vs training data size for LM2 . . . . . . . . . . . . . . . . . . . . 22
3.3 Perplexity vs. training data size for LM1 . . . . . . . . . . . . . . . . 22
3.4 Perplexity vs. training data size for LM2 . . . . . . . . . . . . . . . . 23
4.1 System overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Relative entropy imbalance with number of selected sentences . . . . . 57
5.1 Overall System Architecture . . . . . . . . . . . . . . . . . . . . . . . 81
6.1 Initialization of the 9 state syllable m uw v . . . . . . . . . . . . . . . 91
6.2 Information retrieval scheme for name recognition provides scalability. 95
6.3 Plot of recognizer accuracy vs word list size for phoneme and syllable
recognizers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
xii
Abbreviations
ASR Automatic Speech Recognition
BLEU Bilingual Evaluation Understudy
LDA Latent Dirichlet Allocation
LDC Linguistic Data Consortium
LM Language Model
LSA Latent Semantic Analysis
NIST National Institute of Standards and Technology
NLP Natural language Processing
OOV Out of V ocabulary
PLP Perceptual Linear Prediction
pLSI Probabilistic Latent Semantic Indexing
PPL Perplexity
r.e Relative Entropy
SAT Speaker Adaptive Training
xiii
SVD Singular Vector Decomposition
TC-STAR Technology and Corpora for Speech to Speech Translation
VTLN V ocal Tract Length Normalization
WWW World Wide Web
xiv
Abstract
The ability to build task specific language models, rapidly and with minimal human
effort, is an important factor for fast deployment of natural language processing appli-
cations such as speech recognition in different domains. Although in-domain data is
hard to gather, we can utilize easily accessible large sources of generic text such as the
Internet (WWW) or the GigaWord corpus for building statistical task language models
by appropriate data selection and filtering methods. In this work I show that significant
improvements in language model performance can be achieved by simultaneously
boosting the coverage and relevance of the generic corpus.
The relevance of an adaptation corpus depends on the degree to which its style and
content matches the domain of interest. The mismatch between in-domain data and
generic corpora can be seen as a semi-supervised learning problem. We can model the
generic corpus as a mix of sentences from two classes: in-domain (I) and noise (N)
(or out-of-domain). The labels I and N are latent and unknown for the sentences in the
generic corpus, but we usually have a small number of examples of I from the limited
in-domain data. Selecting the right labels for the unlabeled set is important for ben-
efiting from it. I will show that similar to the question of balance in semi-supervised
learning for classification, there is a need to address the question of distributional sim-
ilarity while selecting the appropriate utterances for building a language model from
noisy data. The subset of sentences from generic corpora which are selected to build
xv
the adaptation language should have a distribution similar to the in-domain data model.
To address the issue of distributional similarity, I will present an incremental algorithm
that compares the distribution of the selected set and the in-domain examples by using
a relative entropy (R.E) criterion. Experimental results are provided which show the
superiority of the proposed algorithm over existing schemes.
The coverage of an adaptation corpus is an indication of the degree with which it
covers the topics or styles implicit in the domain of interest. I will present methods that
use clustering for querying/merging to achieve significant performance improvements.
In some speech recognition applications such as spoken document retrieval, automated
call centers a lot of untranscribed speech data is available. I will present methods that
utilize hypothesis generated from this raw speech data in conjunction with the generic
corpus to build better language models by boosting coverage.
A.S.
Los Angeles, California
August 2007.
xvi
Chapter 1
Introduction
1.1 Significance of the Research
State of the art speech and Natural Language Processing (NLP) systems are data-driven,
relying on learning patterns from large amounts of appropriate data sources. A key
step in creating Automatic Speech Recognition (ASR) systems for different domains
and applications is to identify the appropriate resources for building language models
matched to that application or domain. In some cases text for the target domain might
be available from institutions such as Linguistic Data Consortium (LDC) and National
Institute of Standards and Technology (NIST) . However in most cases such data are
not readily available and needs to be collected manually. This imposes severe con-
straints in terms of both the system turnaround time and cost. To limit the effects of data
sparsity, a topic independent language model is often merged with a language model
generated from limited in-domain data to generate a smoothed topic specific language
model. However, this adaptation approach can only be viewed as a procedure to reduce
the effect of data sparsity and will likely give suboptimal results compared to having
good in-domain data.
In a variety of speech recognition tasks such as spoken document retrieval and auto-
mated call center applications, it is possible to acquire a lot of raw speech data even
though text might be difficult to collect. To use this speech data for building language
models we need to first convert it to text, which typically requires human annotators.
1
The annotation effort is costly and time consuming, thus making rapid and cost effec-
tive deployment of ASR systems difficult.
This research work presents a system for acquiring data from large generic text
resources such as the World Wide Web (WWW) and other large text collections such
as GigaWord. The proposed data acquisition strategy combines information retrieval
and text data filtering to gather a weighted text corpus that can be used to build better
speech recognition models given the initial in-domain text.
1.2 Brief summary of the proposed approach
Data acquisition from the WWW and other large text sources starts with building the
queries that will return the right data set. It is assumed that an initial representative
set of documents for the domain of interest is available. If this seed set is large, then
the data mining can be seen as an update to the already existing language model. This
is of special interest in cases where we need to handle new content such as in broad-
cast news applications[BM98]. However, if the initial set of documents is too small
to build a robust language model, then the acquired data is more critical for building a
good topic specific speech recognition system. In addition to the initial topic model, we
assume the existence of a generic topic independent language model. The two language
models, one topic dependent and the other topic independent (hereby referred to as the
background model) are used to generate search queries using a relative entropy (r.e)
measure[SRN04] . The topic model can be substituted or augmented with an adaptation
language model built from speech recognition transcripts of raw speech. The proposed
r.e method generates queries for which the retrieved text has a larger classifier margin
between the background and the topic models. After an initial set of downloads the per-
formance of query terms and the URL’s which gave the best data are identified and used
2
them in the next iteration. The evaluation of the goodness of data is carried out using a
set of features derived from the in-domain data.
Even with the best queries, a large section of the returned documents, or parts of
them, might not actually be relevant for the particular task and tend to be noisy. Thus
it becomes important to selectively weight the data gathered from these sources so as
to maximize the gains in terms of language model perplexity or ASR Word Error rate.
We present a relative entropy (r.e) based method to select relevant subsets of sentences
whose distribution in an n-gram sense matches the domain of interest. Using simula-
tions, we provide an analysis of how the proposed scheme outperforms filtering tech-
niques proposed in recent language modeling literature on mining text from the web. We
will also show that the proposed subset selection scheme leads to performance improve-
ments over state of the art systems in terms of both Word Error Rate (WER) and Per-
plexity (PPL).
An interesting question in the scenario where potentially large amounts of train-
ing can be obtained for building language models at low cost is the rate at which the
language models converges to the true distribution and how the convergence proper-
ties differ across different sets of words. The amount of data available to estimate a
good language model for functional words is much more than the corresponding data
for content words. This disparity in convergence properties was highlighted by LM
reestimation simulations carried out using the R.E measure as a distributional distance
measure. Based on these experiments, it would seem that more complex acoustic mod-
els should be used for content words especially names, since the acoustic models are
relatively robust to differences in word distribution’s. A hierarchical acoustic modeling
scheme for recognition of names and function words is presented in chapter 6.
3
1.3 Contributions of this research
The research effort presented in this thesis differ from previous work on building of
language models from large resources for speech recognition in the following manner.
We introduce the use of R.E between language models as a measure for generating
query terms that help identify data which is easier to classify as generic or topic
specific data. Previous research has used either manual query generation[BOS03],
term ratio[NOH
+
05a] or the training text directly[SGG05]. As we will show the
R.E approach includes term-ratio in the query weighting process. Manual query
generation can be very limited in the query terms it can generate and using the
training utterances directly is hard to generalize. We also identify good urls and
query terms based on the downloaded data in each iteration. This allows us to get
cleaner and more focused data in an iterative fashion.
Performance of statistical n-gram language models depends heavily on the amount
of training text material and the degree to which the training text matches the
domain of interest. The language modeling community is showing a growing
interest in using large collections of text (obtainable, for example, from a diverse
set of resources on the Internet) to supplement sparse in-domain resources. How-
ever, in most cases the style and content of the text harvested from the web differs
significantly from the specific nature of these domains. In this paper, we present a
relative entropy based method to select subsets of sentences whose n-gram distri-
bution matches the domain of interest. We present results on two speech recogni-
tion tasks: a medium vocabulary medical domain doctor-patient dialog system and
a large vocabulary transcription system for European parliamentary speech. We
show that the proposed subset selection scheme leads to performance improve-
ments over state of the art systems in terms of both Word Error Rate (WER) and
4
Perplexity (PPL). In addition we can improve over the models built from the entire
collection using just 10% of the data for the medical domain system and 33% data
for the transcription system. Improvements in data selection also translate to a
significant reduction in the vocabulary size as well as the number of estimated
parameters in the adapted language model.
We introduce the use of Latent Dirichlet Allocation based unsupervised cluster-
ing for generating query terms and merging data such that we are able to cover
the set of topics implicit in the domain of interest. This helps in ensuring that the
adaptation corpus has a balanced coverage of the domain of interest. In addition
by augmenting raw speech/asr hypothesis with text from the generic corpus by
appropriate querying and data selection methods, we are able to achieve signifi-
cant performance improvements in both WER and Perplexity.
1.4 Open Problems and limitations
The research work and experiments in this thesis has been restricted to n-gram language
models. N-gram language models are probability models that define the probability of
seeing a word sequence W in a sentence. They are specified in terms of a hierarchi-
cal sequence of n-grams that specify the probability of observing a word w given the
sequence of n 1 words specified by the history h that has been observed before it.
These models are simple to implement in terms of model parameter estimation and cal-
culation of the probability of a word sequenceW . One reason for doing is that they have
a relatively simpler analytical nature (as opposed to say syntax based language models).
More importantly n-gram language models are the most widely used models for speech
like applications. It should be possible to generalize the above schemes to other kinds of
5
language models that can be computed from untagged text data, but we do not explore
this in our work.
The task of data filtering and querying are intertwined. If we have a good data
weighting procedure, we should use a simpler query mechanism, which returns a lot
of data even if it is not relevant or clean. Similarly if the query mechanism is really
good and the sources are not noisy, no data filtering is required. The use of active
learning in conjunction with data acquisition for unannotated speech data raises even
more questions: Should we acquire data which is more confusable (and consequently
more informative) for the learning process, so that we can ask the annotator to label it?
In what order should the annotation and data acquisition be carried out? In this thesis we
propose a number of mechanisms for query generation, data filtering and LM adaptation
from acoustics and study their effectiveness in isolation. A detailed analysis of the effect
of these component over each other component is not presented.
1.5 Thesis outline
The thesis is organized as follows. The next chapter describes the process of generating
keywords and keyphrases using relative entropy (R.E) between language models. In
Chapter 4. We will describe our data weighting and clustering mechanism. The use
of acoustic data for building language models is covered in Chapter 5. A hierarchical
acoustic modeling scheme for recognition of names and function words is presented
in chapter 6. The last chapter concludes with a summary of the results and methods
presented in the thesis and directions for future work.
6
Chapter 2
Background Review
This chapter provides a short introduction to speech recognition and language modeling
as a background to the research work discussed in the next few chapters. We will focus
on speech recognition as a speech to text conversion task. Speech can be seen as a set
of acoustic observationsX = X
1
X
2
:::X
n
. In the probabilistic framework, the goal of
speech recognition is to find out the corresponding word sequenceW =w
1
w
2
:::w
n
that
has the maximum posterior probabilityP (WjX) as expressed by the bayesian formula-
tion
^
W =argmax
w
P (XjW )P (W )
P (X)
Since the observation vectorX is fixed the above maximization is equivalent to
^
W =argmax
w
P (XjW )P (W )
The speech recognition problem thus centers on building accurate acoustic models
(P (XjW ))and language models (P (W )), that can truly reflect the spoken language to be
recognized. Acoustic models are implemented as sub-word level Hidden Markov Mod-
els, usually at the phone level. Language models which serve as priors in the bayesian
formulation are the focus of the research work presented in this proposal. The most com-
mon language models are called n-gram models. N-gram models are specified in terms
of a hierarchical sequence of n-grams that specify the probability of observing a word
w given the sequence ofn 1 words specified by the historyh that has been observed
7
before it. Since the data to estimate parameters of long N-gram sequences is sparse, lan-
guage models are tree structured and a lot of the n-gram densities backoff to probabilities
of correspondingn1 grams. By describing the set of words and their probabilities, the
language model also describes the vocabulary of the speech recognition system. All the
hypotheses of the speech recognizer are constrained to the vocabulary of the language
model.
To estimate a N-Gram language model we use smoothed counting methods on the
training data. Consider the trigram P (z=yx), if the word sequence yx occurs C(yx)
times in the training data and the sequencexyx occursC(zyx) times then the maximum
likelihood estimate ofP (z=xy) is given by
P (z=yx) =
C(zyx)
C(yx)
The maximum likelihood estimate assigns the highest probability to the training data for
any n-gram model. However, its performance on test data is very poor. One particularly
severe problem is that if a sequence zyx is not seen in the training data it is assigned
a probability of zero, even though it might occur in the test data. Smoothing methods
improve upon this by modifying the counts to generate a more uniform probability mass
that explains the test data better. One of the simplest smoothing methods is add delta
smoothing, which modifies the MLE estimate by adding a constant to the counts
P (z=yx) =
C(zyx) +
C(yx) +V
where V is the size of the vocabulary. The add delta smoothing method is not the
smoothing method of choice for most applications, but serves well to illustrate the gen-
eral principles behind smoothing. For a good review of smoothing methods see [CG96].
8
The acoustic and the language models should be matched to the task at hand. Acous-
tic models are matched to the acoustic environment of the speech, the recording medium,
and to the speaking style. A task specific language model fits the conversational style
and the vocabulary of the task. Both the acoustic model and the language model are built
from existing generic models by various adaptation schemes that transform the models
to fit limited in-domain training data. The most commonly used adaptation scheme for
language models is simple linear interpolation in which two models are merged linearly
P (z=xy) =
P
1
(z=xy) +P
2
(z=xy)
1 +
In this proposal we focus on building language models out of limited in-domain data
available either as raw speech or text using a generic text resource such as the WWW.
The acquired data after filtering and weighting is converted to a language model which
is then merged with preexisting language models using linear interpolation.
The language model can be converted to a graph that represents all the paths or text
sequences that the recognizer can potentially generate. Words can be seen as nodes on
the graph with the edges representing the probability of generating a word given the
current word sequence or word state. Acoustic evidence is converted to a probability
weight that is merged with the language model weight to construct the posterior graph.
This is then pruned to generate a hypothesis lattice. The n most likely paths through
the graph serve as the n-best recognition hypothesis or the n-best list. The n-best list is
used in various adaptation and score merging algorithms instead of the primary speech
hypothesis (the 1-best) as it contains more information about the speech signal.
9
Chapter 3
Query generation
Data can be acquired from the web either search engines or crawling the web given
an initial webpage. Crawling or crawling-like approaches have been used success-
fully to gather data for minority languages[GJM03], building domain specific search
engines[MNRS99] and also in building parallel corpora of languages[RS03]. In our
case where the goal is to automatically download data related to a particular speech
task, crawling is of limited utility. Crawling requires specification of an initial set of
hosts and webpages, that cannot be generated automatically. In addition links on web
pages typically lead to a much broader content then the focused results of a good query
search engine such as Google. Further, crawling techniques are not applicable to large
collections of non-hyperlinked text data such as the GigaWord corpus. In this proposal,
we restrict ourselves solely to query based retrieval for the task of generating topic spe-
cific language models. The query and url tracking sub-sections of our system that will
be described later in this chapter,can be seen as crawling like schemes.
To generate query strings we compare the topic language model(LM) with the back-
ground language model using the relative entropy measure first described in [SRN04].
The motivation for doing this comes from the work on measuring query clarity in infor-
mation retrieval[CTC02][LC01], which provides a quantification of the ambiguity of a
query. In [CTC02] the query clarity is defined as a measure of the unigram entropy of
a distribution of the documents containing the query terms and the documents in the
entire collection. Thus, queries that would retrieve data having the least ambiguity with
respect to the in-domain topic data will have the highest relative entropy between the
10
collection model and the model of the documents containing the query term. Thus our
goal is to select query termsq such that the R.ED(p
q
jjb
q
) is maximum, wherep
q
andb
q
represent the entropy of the in-domain model and the background model with the term
q as root. As we download more data from the web, we can build better models for the
in-domain set and the collection set and thus generate queries with lesser ambiguity. We
extend the query clarity measure described in [CTC02] to generate queries from n-gram
language models withn> 1.
In general, relative entropy computation between two discrete distributions such as
language models requires density comparisons across all the possible symbols in the
alphabet. This implies that a direct R.E implementation for n-gram language models
would requireV
n
computations, whereV is the vocabulary size. This would make R.E
comparisons for even medium sized trigram LM’s with 15-20K words computationally
prohibitive. However, since real world n-gram language models are tree structured and a
majority of the n-gram densities back-off to probabilities of correspondingn 1 grams,
it becomes possible to compute R.E between two LM’s inO(L) computations whereL is
the number of language model terms actually present in the two LM’s. The computation
process described in [SRN04] recursively calculates the relative entropy for an n-gram
model using the relative entropy for then 1 gram model. During the computation, we
are able to get relative entropy conditioned on word sequence historiesh , i.e the R.E
between the n-grams represented byp(x=h) andq(x=h) whereh is the history on which
the probability of seeing the wordx is conditioned;p being the base (topic model) LM
andq being the LM (background model) being evaluated with respect top.
11
3.1 Computation of relative entropy
Computing relative entropy between discrete distributions requires density comparisons
across all the possible symbols in the alphabet. This implies that a direct R.E imple-
mentation for a n-gram language models would requireV
n
computations, whereV is
the vocabulary size. This would make R.E comparisons for even medium sized trigram
LM’s with 15-20K words computationally prohibitive.
However real world n-gram language models are tree structured and a lot of the n-
gram densities backoff to probabilities of correspondingn 1 grams. Based on this tree
like structure, we provide a scheme which makes it possible to compute R.E between
two LM’s inO(L) computations whereL is the number of language model terms actu-
ally present in the two LM’s. We will express the language models as p(x=h) and
q(x=h), whereh is the history on which the probability of seeing the wordx is condi-
tioned;p being the reference LM andq being the LM being evaluated with respect top.
In case of convergence analysis,q is the language model estimated from artificial data.
The other symbols we are going to use are:
x: The current word
h: The historyw
1
::w
n1
h
0
: The back off historyw
2
..w
n1
b
h
: The back-off weight forp distribution for historyh
b
0
h
: The back-off weight for theq distribution
W : The vocabulary of the language model
R.E at leveln
D
n
=
X
h2H
p
h
X
x2W
p(x=h) ln
p(x=h)
q(x=h)
(3.1)
We can divide the set of histories (H) at leveln intoH
s
for allh which exist asn1
gram and have a back-off weight6= 1 in thep or theq distribution. The complement set
12
(H
s
0) will contain histories with a back-off 1. H
s
0 corresponds to histories not seen in
either language model. Let
D
xh
=
X
x2W
p(x=h) ln
p(x=h)
q(x=h)
(3.2)
Then
D
n
=
X
h2Hs
p
h
D
xh
+
X
h2H
s
0
p
h
D
xh
=
X
h2H
p
h
D
xh
0 +
X
h2Hs
p
h
D
xh
X
h2Hs
p
h
D
xh
0
Marginalizingw
1
D
n
=D
n1
+
X
h2Hs
p
h
D
xh
D
xh
0
!
(3.3)
D
xh
can be split into four terms depending on whetherx=h is defined in thep or theq
distribution
D
xh
=T
1
+T
2
+T
3
+T
4
T
1
:p(x=h) existsq(x=h) backs-off (Letx2X
1
)
T
2
:p(x=h) backs-offq(x=h) exists (Letx2X
2
)
T
3
:p(x=h) existsq(x=h) backs-off (Letx2X
3
)
T
4
:p(x=h) backs-offq(x=h) backs-off (Letx2X
4
)
T
1
=
X
x2X
1
p(x=h) ln
p(x=h)
q(x=h)
T
2
=b
h
lnb
h
X
x2X
2
p(x=h
0
) +b
h
X
x2X
2
p(x=h
0
) ln
p(x=h
0
)
q(x=h)
13
T
3
=
X
x2X
3
p(x=h) ln
p(x=h)
q(x=h
0
)
lnb
0
h
X
x2X
3
p(x=h)
T
4
=
X
x2X
4
b
h
p(x=h
0
) ln
b
h
p(x=h
0
)
b
0
h
q(x=h
0
)
=b
h
ln
b
h
b
0
h
X
x2X
4
p(x=h
0
) +b
h
X
x2X
4
p(x=h
0
) ln
p(x=h
0
)
q(x=h
0
)
+b
h
X
x2X
0
4
p(x=h
0
) ln
p(x=h
0
)
q(x=h
0
)
b
h
X
x2X
0
4
p(x=h
0
) ln
p(x=h
0
)
q(x=h
0
)
=b
h
ln
b
h
b
0
h
1
X
x2X
0
4
p(x=h
0
)
+b
h
D
xh
0
b
h
X
x2X
0
4
p(x=h
0
) ln
p(x=h
0
)
q(x=h
0
)
Thus we are able to express D
xh
, in terms of the LM terms actually seen. Using
D
xh
computed in this fashion in (4.5) we get a recursive formulation for R.E at leveln
using LM densities actually seen. D
xh
can be computed using the base expression(4.4)
ifp(x=h) orq(x=h) are mostly defined in the two LM’s for the historyh.
3.2 Generating queries using Relative Entropy
The histories with large R.E serve as good candidates for being keyphrases or key-
words since they have good discriminative power. To make sure that they qualify as key
words or phrases for the topic, we also need to ensure thatp(h) is larger than the cor-
respondingq(h). For the medical domain task, which we describe in our experiments,
some of the keyphrases were very relevant (“Stomach Ache”, “Feeling Nausea”, “Bad
Headache”). However, in many cases key phrases contained words without key nouns
14
such as “Hurts The”, “Place A”. We found out that even though these query phrases
are not very useful on their own, they are effective when combined with keywords, e.g.
“Hurts The ”+“Stomach”. A detailed discussion on the effect of query length and infor-
mation retrieval accuracy can be found in [GJM03][BKK
+
03].
A list of query keyphrases and keywords was generated using the techniques described
above and merged with a keyword list based on the mutual information between words
and class labels using the document classification system described in Section 4.3. Then
a random selection of five query words with a couple of randomly chosen keyphrases
was used as a search query for the Google SOAP API, which then returned the relevant
set of URLs. The mix of number of keywords and keyphrases to embed in a single query
is a function of the task at hand. For example, to model conversational styles it should
be advantageous to put more keyphrases in the query than keywords[BOS03]. The set
of URLs was then downloaded and converted to text, which was subsequently weighted
and filtered as described in the next chapter.
3.3 Tracking queries
After each iteration of downloads, we identify the queries and URL’s that provided doc-
uments with high match to the in-domain data. The match is evaluated in terms of
the average perplexity of the documents downloaded and their classification in the data
weighting stage. In the next iteration of downloads, the top scoring domains (identified
from the URLs) and the top scoring queries are preferred in generating search param-
eters for Google. A relevance feedback score [YRN99] can also be generated for the
search engine using the model match measure. The identification of the top url’s also
allows us to identify potential sources for recursive web crawling.
15
3.3.1 Evaluating goodness of a document
The data acquired from the web consists of documents(i.e web linkspages). From on-
disk corpora it is possible to do retrieval at the sentence level. To evaluate the impor-
tance of query terms and urls, we need to evaluate the goodness of the retrieved data.
The goodness evaluation is essentially carried out by the data filtering and weighting
mechanism described in the next chapter. The evaluation is carried out using a mixture
of document, sentence, ASR based features in a positive only classification framework.
3.3.2 Tracking URLS
To track URLs we compute the overall goodness score of a document. The computa-
tion is carried out at different levels of the URL hierarchy. For example, if we have
the URL www.msn.com/arbitray/random/goodone.html we evaluate the average good-
ness of the documents that share the URLS www.msn.com, www.msn.com/arbitary and
www.msn.com/arbitary/random. The top scoring URLS are used for searching docu-
ments in the next iteration. We restrict downloads to URLS that are high scoring with
more than a particular size of downloads.
3.3.3 Tracking queries
The downloaded data in every iteration is indexed using the Lemur search engine. After
indexing we find out the set of documents corresponding to the top N ( 200) query
keywords and keyphrases. For every documentD, there can ben query terms that can
be identified with the document. Since the query to WWW search is specified as a
conjunction of query terms, we can attribute the document as a potential search result of
any of the query terms in the document and their combinations. If restrict ourselves to
at-mostj query terms in a query, we can identifyC
n
j
queries withj terms,C
n
j1
terms
16
with j 1 terms etc. We compute the average score of queries of term length j to 1
along with the downloaded data size. This computation is carried out by iterating over
the document set, collecting the set of queries possible in every document by generating
all the combinations to length j and calculating the average over the entire set. At every
iteration the best queries from the previous iteration are used in conjunction with random
concatenation of query terms generated from the R.E measure.
3.4 Experiments
Our experiments were conducted on medical domain data as part of the Transonics
speech to speech translation system[Nea03]. 3K utterances from patient-doctor dialogs
were used to build the topic model. 1K utterances were used for testing and about 500
utterances were used as heldout data for tuning system parameters such as language
model weights. For the background model, we chose to use a interpolated model con-
sisting of data from SWB (4M words), WSJ (3M words), and some text from the Guten-
berg project (2M words) (The Gutenberg project is a collection of E-Books of English
literature). After pruning[VW03], the background model had a vocabulary of about 40K
words The test set size was fixed at 110K words. The downloaded data size was kept
fixed at 10M words to provide meaningful comparisons across different experiments.
3.4.1 Effect of initial seed set size
In our first experiment we evaluated the performance of the system for different sizes
of the initial seed set. The goal was to see how important it is to have a large initial set
for the query generation subsystem to generate a good set of keywords and keyphrases.
The results can be seen in Tables 3.1 3.2. The results indicate that the query generation
process performs nicely even with small amounts of data, and hence the performance
17
Initial data size (K
words)
PPL with initial
data
PPL with initial
data+Webdata
5 200 120
10 160 110
15 117 92
Table 3.1: Perplexity(PPL) of testdata for different sizes of initial seed data. Both the
models were merged with the baselm using linear interpolation
Initial data size (K
words)
WER with initial
data
WER with initial
data+Webdata
5 27 25
10 25 23.9
15 23 22.3
30 20.5 19.8
Table 3.2: ASR WER of testdata for different sizes of initial seed data. Both the models
were merged with the baselm using linear interpolation
of the final merged models is not critically dependent on the size of the seed set. With
decreasing seed set size, the jump in perplexity for the models trained from the seed
data is very high as compared to the final models built after merging with Webdata.
The background model was also used as the base language model for interpolation with
initial data.
3.4.2 Effect of query size and type
We studied the effect of different query structures on the system performance. We exper-
imented with different number of keywords and keyphrases for the Google query string.
The results are presented in Table 3.3. The best results were obtained with a five key-
word, two keyphrase system. We tried increasing the query string length beyond the five
keyword, two keyphrase mark, but there was no performance improvement, and Google
18
Keyword count Keyphrase count Perplexity of
merged LM
5 2 92
4 1 100
3 0 111
Table 3.3: Effect of different number of query keywords and keyphrases on system
performance.
would frequently return a very short list of URLs implying that the query string was too
specific.
3.5 LM convergence analysis using R.E
A central question in the case of text data acquisition scenario is how much data we need
to converge to the true density. Assuming that we are able to design querying and data
filtering strategies such that the downloaded web data is close to true in-domain data,
we have access to large amounts of text for training language model. However, even in
such a scenario a language model might converge slowly to the true density in case of
certain words. In this section, we present a study of language model convergence using
the R.E measure.
In order to specify how close an estimated LM is from the true density we need
to define a distance measure. Perplexity[CBR98][CMU
+
95] can be a useful tool for
this purpose. We can compute the perplexity of different language model estimates and
select the one which gives lowest perplexity on the test set as the closest match to the
true density. However it is difficult to interpret the results since the lower bound of
perplexity is not known beforehand. A decrease in perplexity although indicative of
progress does not measure how far we are from the true density. Another issue is that
the test set is a small sample drawn from the true density and does not cover the entire
19
range of n-gram probabilities in the true language model. Also since it is essentially a
random sample from the true density, it does not represent the true density accurately.
If the true density is known, R.E of the LM estimate can address the issues with per-
plexity based analysis. R.E can provide local information in terms of convergence of
different probability estimates as well as a global match between the true density and
the approximating LM.
To study convergence of LM estimates, we substitute the hidden true language model
with a representative language model. Next, we generate training data from this lan-
guage model by a process that can be seen as a random walk through a graphical model
that encapsulates the LM. The generated data represents a sample drawn from the speci-
fied language model. Language models can then be estimated from this ‘artificial’ data,
and R.E comparisons with the initial representative language model can be carried out.
This scheme can be seen as a distribution resampling approach. The size of the ‘artifi-
cial’ training data can be increased to see how the R.E converges to zero. An advantage
of this scheme is that it is possible to generate a training corpus of arbitrary size without
effecting the underlying true data density. In conventional schemes the data available
for training is limited, and to study convergence we can only decrease the training set
size. R.E comparisons across different classes of the LM such as proper nouns, func-
tional words and domain specific terms can be carried out separately to estimate the
confidence that can be placed in the LM estimates for these groups.
3.5.1 Reference language models
Our experiments were carried out using the SRI toolkit[Sto02]. We applied the resam-
pling technique (Sec. 3.5) on two different language models: An interpolated trigram
LM for LVCSR based on conversational speech(LM1) and a task adapted medical
20
domain trigram language model(LM2). LM1 [RJP03]was built from manually tran-
scribed 180 hours of conversational speech (1.7M words), interpolated with language
models built from Broadcast News and Switchboard corpora (158M and 3.4M words,
respectively) . LM2 has a vocabulary of around 20K words with 700K bigrams and 8M
trigrams. This LM is used in the English part of the transonics speech to speech trans-
lation system [Nea03]. It was built on a conversational trigram LVCSR LM adapted on
semi-structured patient doctor dialog data from a variety of sources.
3.5.2 Convergence analysis results
We generated ‘artificial data’ from the two LMs and built language models on the gen-
erated data using Kneser-Ney smoothing. The estimated LM and the reference LM were
compared using the recursive R.E measure (Sec. 3.1). We also carried out LM estima-
tion on multiple data sets of same size to measure the variance of R.E. Variance of R.E
was around 1%, which implies that R.E is consistent across different runs of the resam-
pling process. The plots ( See Fig. 3.1,3.2) compare well with the results of [CG96].
10
5
10
6
10
7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Training data size (words)
R.E
Figure 3.1: R.E vs training data size for LM1
21
10
5
10
6
10
7
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Training data size (words)
R.E
Figure 3.2: R.E vs training data size for LM2
For the two LMs, perplexity was also measured for the resampled language models
on held out data. As expected the perplexity plots are similar to the R.E plots (See Fig.
3.3,3.4).
10
5
10
6
10
7
50
100
150
200
250
300
350
Training data size (words)
Perplexity
Figure 3.3: Perplexity vs. training data size for LM1
To observe how the language model convergence rate compared across different
word categories, we estimated the in-class R.E across different LM subsets grouped
according to their most common part of speech tag. As can be seen in table3.4 the
convergence of the LM to the reference varies widely across the different word cate-
gories. Convergence for nouns (specifically proper nouns) was very poor compared to
22
10
5
10
6
10
7
10
15
20
25
30
35
40
45
50
55
60
Training data size (words)
Perplexity
Figure 3.4: Perplexity vs. training data size for LM2
Parts of speech Average R.E
Nouns 0.8
Proper nouns 3.4
Verbs 0.4
Adjectives 0.6
Determinants 0.3
Table 3.4: R.E across different parts of speech for LM2 corresponding to artificial train-
ing data size of 20M words
other word categories such as verbs, adjectives etc. This would imply that in weighted
classifier merging a low LM weight should be chosen for nouns and a higher weight for
other words. In addition class based models could be advantageous for these words in
terms of estimation accuracy.
3.5.3 Sparseness of resampled LMs
As can be seen in table 3.5, resampled LMs were much sparser than the reference LM
(LM2) even though the R.E and corresponding perplexity figures were very low. The
reference LM has a perplexity of 11.39 with 8M trigrams. LM pruning is one example
23
Generated data
(K words)
R.E Perplexity Number of tri-
grams
500 0.31 18 80K
1000 0.18 13.6 118K
2000 0.11 12 188K
5000 0.06 11.5 284K
10000 0.04 11.47 464K
20000 0.02 11.44 774K
Table 3.5: R.E, perplexity and number of trigrams across LMs resampled from increas-
ing generated data for LM2.
application where LM reestimation and recursive R.E computation can prove useful. In
the next section we discuss this in more detail.
3.6 ASR results
We now present results comparing ASR performance with increasing data size. For
these experiments we used the transonics LM generated from 200 K words of in-domain
text. This language model has a vocabulary of 4970 words. The ASR test set consists of
520 utterances with 3600 words.
3.6.1 OOV rate
All words in the vocabulary are not represented in limited training data. Table 3.6 com-
pares the sampled data OOV rate with increasing size. The original language model
was built from 200K words. At the same sample size we have a very high OOV rate
(20%). The OOV rate goes down to zero at 25000 words which is 125 times the origi-
nal data size. This experiment shows the necessity of augmenting the ASR vocabulary
externally, either from a mixture LM or a list of in-domain words.
24
Generated data (K words) OOV count
250 1480 (24%)
500 954 (20%)
1000 440 (9%)
2500 63 (1.2%)
10000 20 (0.4%)
25000 0 (0%)
50000 0 (0%)
Table 3.6: OOV rate with increasing data size. The vocabulary of the original LM was
4970.
Generated data (K
words)
R.E PPL WER
250 0.2 74 21.7
500 0.15 72.8 21.2
1000 0.11 72.5 20.4
2500 0.08 71.3 19.2
10000 0.05 67 19.5
25000 0.045 66.92 18.8
50000 0.03 66 18.9
Table 3.7: Word error rate with increasing data size. The WER with the original lan-
guage model was 18.5%.
3.6.2 WER results
WER results are shown in table 3.7. With increasing data size the WER tends to
decrease. However the decrease is not monotonic. The WER increases from 2500K
words to 10000K words from 19.2% to 19.5%. There is also a 0.1% increase from
25000K to 50000K. The R.E between the original language model and the estimated
language model however decreases monotonically with increasing data. The perplexity
calculated on a set of 1000 held out utterances shows the same trend. R.E has a corre-
lation of 0.96 with the WER, and perplexity has a correlation of 0.84. As discussed in
previous setions, R.E can be interpreted as perplexity evaluated on infinite data. Hence
25
Generated data (K words) Number of trigrams
250 20K
500 38K
1000 155K
2500 490K
10000 1069K
20000 1.8M
50000 3.2M
Table 3.8: R.E, perplexity and number of trigrams across LMs resampled from increas-
ing generated data.
it is able to explain the behaviour of the language model better. The departure from
non-monotonic behaviour of WER is a point of investigation. It can be attributed to
sampling effects and overtraining.
3.6.3 LM complexity
Table 3.8 shows the LM complexity (in terms of number of trigrams) with increasing
data size. The model complexity increases monotonically with increasing data size. The
original model has 25K trigrams. With more data, the reestimated models get substan-
tially more complex than the original model.
3.7 LM convergence and ASR design
We have seen from the simulations in previous section that even with large amounts
of data, LM convergence is disparate and slow for many words, specifically nouns.
Based on this we believe that the acoustic system of the speech recognizer can play a
more important role in recognizing such words since the acoustic models for the speech
recognizer are designed at the phone level. The use of sub-word units makes the acoustic
models less severely affected by the lack of acoustic training data for certain words.
26
The phone level representation of nouns can be covered in the training data for other
functional words.
More complex acoustic models can be used in cases where the LM convergence is
expected to be poor. We have explored the use of longer length acoustic models such as
syllables to improve ASR performance for names and proper nouns[SNP05][SNR03].
3.8 Other applications of recursive R.E computation
The recursive R.E computation scheme described in section 3.1 can be used to compare
any two nested n-gram models, which can even have different order. To address the issue
of the vocabulary mismatch between two models, the words not seen in either language
model can be mapped to the unknown word (< unk>). Class based models can also be
compared with minor modifications to the recursive formulation given in section 3.1.
A simplified recursive R.E scheme for the case when a n-gram is being approx-
imated by a (n-1) gram was used in [Sto98a] to prune language models. Using the
recursive R.E scheme, this idea can be extended to allow adjustments of the lower level
probability estimates to provide a better fit for the n-gram. We also observed (Sec. 3.5)
that resampling a LM using the generation and estimation process generates language
models that are substantially sparser than the reference LM. In an information theoretic
sense, this can be seen as a quantization process where the R.E measures the bits we are
losing with respect to the reference LM.
27
Chapter 4
Data Filtering
Even with well crafted queries, the data acquired from generic resources may contain
large amounts of out of domain text. And for a source like the WWW, the acquired
data tends to be noisy and contains lot of spurious text corresponding to embedded web
advertisements, links etc. It is necessary to clean the data acquired and pre-process it
before using it to build language models. The data filtering task is carried out in two
stages. In the first stage, we remove spurious content using a rejection model which is
updated at every iteration. In the second stage, we weight the data at the utterance level
in terms of its match to the in-domain data. The match is evaluated by using a feature
space comprising of document and utterance level features. Data from the second stage
filtering that receives low weights is pruned out and collected for building the first stage
rejection model.
4.1 Web data preprocessing
Data from the web needs to be converted to text before it can be utilized for building
language models. Most of the downloaded files are in HTML format. Converting the
HTML files to text involves removing the HTML and other extraneous tags in the down-
loaded data. Some of the downloaded files are in binary/non text formats such as pdf,
doc etc. These files need to parsed appropriately to be converted to text using format
specific tools.
28
The converted data specifically from HTML typically does not have well defined
sentence boundaries. We pipe the text through a maximum entropy based sentence
boundary detector[Rat96] to insert better sentence boundary marks. We also experi-
mented with doing text normalization using the NSW tools[Spr01]. Using the sentence
boundary detector was found to be of great help both in terms of improving the system
performance and also designing the subsequent stages of text processing. Some of the
processing assumes that the text can be chunked into units (sentences). In absence of
this, the processing requirements can go very high. The performance improvement from
NSW tools was not substantial, although the vocabulary contained fewer spurious items.
Document OOV rate is computed after initial conversion with respect to both the in-
domain data and a global word list, which has a much larger coverage of English words.
The global word list that we used has more than 500K entries. A OOV rate threshold
is fixed for both the topic and global data with the global data threshold being lower.
Documents that cross the thresholds are rejected from further processing.
4.2 Preliminary data processing
After conversion to text and OOV filtering, we do document level filtering using word
based text classification. The documents corresponding to the topic, background,
and rejected data are used to train a document classification system based on the
TFIDF/Naive Bayes measure using the CMU BOW toolkit[McC96] at the end of each
iteration. For the next iteration we use this classifier to assign scores to the documents
downloaded, and reject the documents that have a high score with the rejection class.
These documents are used to update the rejection document model.
29
The downloaded documents are then scored with the in-domain, background, and
rejection data model to generate per utterance and per document perplexity scores. Doc-
uments and utterances with perplexity below a certain threshold with the rejection lan-
guage model are included in the rejection data set used to build the rejection model.
Documents and utterances with perplexity higher than a threshold for both the topic
model and the background model are also pooled into the rejected data.
The rejection model built in this fashion was found to be very useful in removing
spurious content from web-data. This helps to build a more robust and computationally
efficient data weighting procedure.
4.3 Data weighting
Data from a generic source such as the web is not sampled from the same distribution
as the in-domain text. Even with well formed queries the acquired data does not match
the domain and style of interest completely. In order, to build a model for the in-domain
data from such a source we need to use soft-clustering like approaches where we attach
a confidence weight to every data point. For text, we choose to use the utterance as a
data point even though data might be acquired at the document level. We next describe
the feature space representation we use to build our statistical models for weighting the
data.
4.4 Feature Space
We use a combination of document and utterance level features to do the text filtering.
30
4.4.1 Document level features
The documents corresponding to the topic, background, and rejected data are used to
train a document classification system based on the TFIDF/Naive Bayes measure using
the CMU BOW toolkit[McC96]. The document classification system was used to assign
document weights to the downloaded text for the three classes: background, topic, and
rejected data.
Document level perplexity scores are also generated using the in-domain and back-
ground language models. The document OOV rate is also calculated as a factor.
4.4.2 Utterance level scores
The perplexity of every utterance is evaluated with the in-domain model and the back-
ground rejection model. Language models are generally conceived as probability mod-
els that define the probability of seeing a word sequence W in a sentence. The most
common language models are specified in terms of a hierarchical sequence of n-grams
that specify the probability of observing a wordw given the sequence ofn 1 words
specified by the historyh which have been observed before it. These models are simple
to implement in terms of model parameter estimation and calculation of the probability
of a word sequenceW .
Utterance BLEU score is also calculated in the fashion described in [SGG05]. The
BLEU[PRWZ01] score is used in machine translation as a fully automatic evaluation
metric that forms a viable alternative to expensive and time consuming human judgment
of translation quality. To compute the BLEU score we first take the geometric average of
the modified n-gram precisions,p
n
, using n-grams up to length N and positive weights
31
w
n
summing to one. Next, letc be the length of the candidate translation andr be the
reference translation length. Then, the brevity penalty BP is computed as
BP = 1 ifc>r
BP =e
(1r=c)
ifcr
Then,
BLEU =BP:exp
N
X
n=1
w
n
logp
n
For our purpose we consider the in-domain sentences as possible reference and calculate
for each utterances in the acquired data its best BLEU score with any in-domain utter-
ance. The number of utterances in the acquired set can be very large, potentially running
into hundreds of millions utterances. Thus even with a small reference set of say 300
utterances, computing the BLEU score can be a computationally formidable task. To
make the task computationally tractable,we use an indexing engine[OC02] to identify
utterances that have at-least one word common with a given in-domain utterance. All
other utterances will have zero BLEU score.
The BLEU metric can be very restrictive on the structure and vocabulary of the
utterances that get a good score. Utterance level perplexity allows for a much better
coverage than BLEU and can help in inducing more variability into the acquired data.
BLEU is more useful for identifying very specific utterances.
Other features at the utterance level include OOV rate and some acoustic based fea-
tures that are described in chapter 5. Just like the document level processing. The
training set is enhanced with each iteration, as downloaded documents are included.
32
4.5 Preliminary text cleanup
Documents and utterances with low score with respect to both the topic and background
are pooled at every iteration to form the rejected text set that is used to build a rejection
document classifier and a rejection language model. At every iteration, data that has
a good match with the rejection model is cleaned out. This data typically represents
spurious text, web advertisements, etc. In addition, documents/utterances that have a
high OOV rate are also rejected. The rejection model implements the threshold based
cleaning described in [BOS03][NOH
+
05a]. The use of BLEU score for threshold based
cleaning of text was suggested in [SGG05]. We use the BLEU score for weighting
the data instead of filtering since the BLEU score based thresholding rejects a lot of
utterances, especially if the initial seed set is small.
4.6 Data weighting
We used two methods to assign weights to the downloaded utterances.
4.6.1 Method 1 : Likelihood based weighting
In this weighting scheme, the weight assigned to each utterance is taken to be propor-
tional to the log-probability of the topic language model generating that utterance. This
approach is similar to EM training used for a variety of statistical models. The constant
of proportionality is decided using the held out set. We have experimented with other
monotonic weighting functions such as the sigmoid, exponential but the difference in
achieved performance is not significant compared to simple linear weighting. We have
33
also experimented with using a linear combination of model match features i.e Docu-
ment perplexity with both the topic and background model, utterance perplexity with
both models. WeightW
i
on utterancei
W
i
=
1
P
topic
(Doci) +
2
P
background
(Doc(i)) +
3
P
topic
Utt(i) +
4
P
background
(Utt(i))
The utterance weights are used to generate fractional counts and build a language
model using Witten Bell smoothing. We achieved a significant performance improve-
ment by using the linear weighting over threshold based data selection in both WER and
perplexity.
4.6.2 Method 2 : Positive only classification
The data weighting process can be viewed as a case of positive only learning[LDL
+
03].
In positive only learning, we are given a set of examplesS for the class we are interested
in along with unlabeled data U. The unlabeled set is typically much larger than the
labeled set. Positive only learning comprises of two steps
We start by constructing a initial set RN from U of data points which are very
likely to be in the set S
0
which is the complement of S in U. Various schemes
are suggested in the literature for building the set of RN examples. The basic
idea behind these schemes is that the entire set of unlabeled dataU can be taken
asRN initially and a fast simple classifier such as Naive Bayes can be trained to
classify the setU into positive and negative. If the setU is large, a random subset
of examples can be taken as the initial setRN.
The initial setRN and the positive setS is used to build a classifier. The classifier
can then be used to select a new updates setRN and also more positive examples
34
fromU to add toS. The assignment of data points can be done in a weighted EM
like fashion or based on a threshold. This process is then repeated.
In our case, we first create a bootstrap set of negative samples for the classifier from
utterances that receive a very low score with the in-domain language model. Next we
generate the feature space representation of the training data by building language mod-
els and computing BLEU scores etc from sampled subsets of the training material. This
comprises the positive set. We also augment the positive set with utterances from the
unlabeled set that have a high score with the topic model or a high BLEU score. If
ASR lattices are available, we also include utterances that score high with the ASR
based measure. This helps in selecting data that improves ASR performance in addition
to improving the language model. We train an adaboost[SS00] classifier using the set
described and classify all the remaining utterances in the acquired data. From the clas-
sified utterances, we can select a updated negative set of utterances and repeat the above
process in an iterative fashion. The threshold for choosing the negative utterances is
selected using a set of SPY utterances included in the unlabeled set. The spy utterances
are heldout utterances from the training set and should be a assigned a positive label by
the classifier. We select the negative document threshold such that a very high percent
of the spy set is given a positive label. The negative set can be made even stricter by
allowing a margin for the spy utterances set. Using SVM’s[Joa99] instead of adaboost
as the classifier provided some improvement in performance. These utterances are then
assigned a weight proportional to the classifier confidence in assigning a positive label
to the utterance and a language model is build from the weighted utterances.
The language model built after data weighting is merged with the initial topic based
language model and the background model using linear combination with weights deter-
mined on a held out set.
35
Figure 4.1 presents an overview of the entire system. The termination condition
for the iteration loop can be set in terms of a number of parameters like size of the
downloaded data, perplexity improvements, keyword coverage, etc.
Generate queries using R.E
and TFIDF
T
R
Query, download, convert
to text, preliminary
filtering
Reject low scoring
documents and utterances
T
B
R
T
B
R
Classify
Documents
Assign weights & update
model data
T
B
Build LM’s from weighted
data
Merge & update topic
model & and background
LM
T Topic model(Documents + LM)
B Background Model (Documents + LM)
R Reject Model (Dcouments + LM)
Figure 4.1: System overview.
36
With Rejection Model PPL= 92 V ocab = 45K words
Without Rejection Model PPL= 105 V ocab = 60K words
Table 4.1: Perplexity(PPL) and vocabulary size of the final system with and without the
rejection model.
4.7 Experiments
4.7.1 Effect of rejection model
To study the effectiveness of the rejection model, we report results on final system per-
plexity and vocabulary size with and without the rejection model. As can be seen in
Table 4.1, the rejection model plays an important role in filtering out noisy Webdata
thus reducing the system perplexity and keeping out spurious words(web links, Adver-
tisements etc) from the system vocabulary. The rejection model would typically remove
8% of the downloaded documents and around 6% utterances from the remaining docu-
ment set in its first iteration. In the subsequent iterations, the rejected data size increases
and on an average 10% documents and 13% of the utterances are rejected at the end of
the download process.
One striking point in recent results on semi-supervised learning for classifica-
tion ([Zhu05] presents a good survey) is the importance of balancing the unlabeled
data [Zhu05] [Joa03]. We believe that similar to the question of balance in semi-
supervised learning for classification, we need to address the question of distributional
similarity while selecting the appropriate sentences for building a language model from
noisy data. Rank-and-select filtering schemes select individual sentences on the merit of
their match to the in-domain model. As a result, even though individual sentences might
be good in-domain examples, the overall distribution of the selected set is imbalanced
with a bias towards the high probability regions of the distribution.
37
In this paper we build on our work in [SGN06] and present an improved incre-
mental selection algorithm that compares the distribution of the selected set and the
in-domain examples by using a relative entropy (r.e) criterion at each step. Several
ranking schemes, which provide baseline for performance comparison, are reviewed
in section 4.8. The proposed algorithm is described in section 4.9 and 4.10. A brief
description of the setup used to build the large corpus used in our experiments and
other implementation details is given in section 4.11. To validate our approach, we will
present and compare the performance gain achieved by the proposed approach on two
Automatic Speech Recognition(ASR) systems. The first system is a medium vocabu-
lary system for doctor-patient conversations in English [Nea03]. The second system is a
large vocabulary transcription system for European parliamentary speeches [RSM
+
06].
Experimental results are provided in section 4.13. We conclude with a summary of this
work and directions for future research.
4.8 Rank and select methods for text filtering
In recent literature, the central idea behind text cleanup schemes for using generic cor-
pora to build language models has been to use a scoring function that measures the sim-
ilarity of each observed sentence in the corpus to the in-domain set and assign an appro-
priate score. The subsequent step is to set a threshold in terms of either the minimum
score or the number of top scoring sentences. The threshold can usually be fixed using
a heldout set. In-domain model perplexity [NOH
+
05a, MK06] and variants involving
comparison to a generic language model [SGN05, WSY06] have been the dominant
choice as ranking functions. A modified version of the BLEU metric, which measures
sentence similarity in machine translation, has been proposed by Sarikaya [SGG05] as
a scoring function. Instead of explicit ranking and thresholding it is also possible to
38
design a classifier to learn from positive and unlabeled examples (LPU) [LDL
+
03]. In
this system, a subset of the unlabeled set is selected as the negative or noise set. A binary
classifier is then trained using the in-domain set and the negative set. The classifier is
then used to label sentences in the corpus. The classifier can then be iteratively refined
by using a better and larger subset of the sentences selected in each iteration.
Ranking based selection has some inherent shortcomings. Rank ordering schemes
select sentences on individual merit. Since the merit is evaluated in terms of the match
to in-domain data, there is a natural bias towards selecting sentences that already have a
high probability in the in-domain text. Adapting models on such data has the tendency to
skew the distribution towards the center. For example, in our doctor-patient interaction
task, short sentences containing the word ‘okay’ such as ‘okay’,‘yes okay’, ‘okay okay’
were very frequent in the in-domain data. Perplexity and other similarity measures
assign a high score to all such examples, boosting the probability of these words even
further. In contrast other pertinent sentences seen rarely in the in-domain data such
as ‘Can you stand up please?’ receive a low rank and are more likely to be rejected.
Simulation results provided in [SGN06] show the skew towards high probability regions
clearly.
4.9 Balanced data selection using relative entropy
Motivated by recent results in semi supervised learning [Zhu05] [Joa03] that show
the importance of balanced selection, we proposed an iterative selection algo-
rithm [SGN06]. The essential idea behind the algorithm is to select a sentence if adding
it to the already selected set of sentences reduces the relative entropy with respect to
the in-domain data distribution. Based on experimental analysis of the performance of
this selection algorithm, we came up with some critical modifications. In this section
39
we present the new data selection algorithm and comment on how it compares with our
basic scheme.
4.9.1 The Core Algorithm
Let us denote the language model built from in-domain data
1
byP . LetW (i) denote
the count in the selected set of every word i which exists in the vocabulary V of the
modelP . Our selection algorithm considers every sentence in the corpus sequentially.
Suppose we are at thej
th
sentences
j
. We denote the count of wordi ins
j
withm
ij
. Let
n
j
=
P
i
m
ij
be the number of words in the sentence andN =
P
i
W (i) be the total
number of words already selected. The skew divergence of the maximum likelihood
estimate of the language model of the selected sentences to the initial modelP is given
by
D(j) =
X
i2V
P (i) ln
P (i)
(1)P (i) +W (i)=N
The skew divergence [Lee99] is a smoothed version of the Kullback-Leibler (KL)
distance with the alpha parameter denoting the smoothing influence of theP model on
our current Maximum Likelihood (ML) model. It is equivalent to the KL model for =
1. Using alpha skew divergence in place of distance was useful in improving the data
selection especially in the initial iterations where the countsW (i) are low and the ML
estimateW (i)=N changes rapidly. For notational simplicity, we denote = 1. The
1
The in-domain modelP is usually represented by a linear interpolation of LMs built from different
in-domain text corpora available for the task
40
model parameters and the divergence remain unchanged if sentences
j
is not selected.
If we selects
j
, the updated divergence is given by
D
+
(j) =
X
i2V
P (i) ln
P (i)
P (i) +(W (i) +m
ij
)=(N +n
j
)
(4.1)
Direct computation of divergence using the above expressions for every sentence in
a large corpus will have a very high computational cost sinceO(V ) computations per
sentence are required. The number of sentences can be very large, easily on the order
10
8
to 10
9
. The total computation cost for even moderate vocabularies (around 10
5
)
would be large.
However given the fact thatm
ij
is sparse, we can split the summationD
+
(j) into
D
+
(j) =
X
i2V
P (i) lnP (i) +
X
i2V
P (i) ln
P (i) +
(W (i) +m
ij
)
N +n
j
= D(j) + ln
(N +n
j
)
N
| {z }
T 1
(4.2)
X
i2V;m
ij
6=0
P (i) ln
P (i)(N +n
j
) +(W (i) +m
ij
)
P (i)N +W (i)
| {z }
T 2
X
i2V;m
ij
=0
P (i) ln
W (i) +P (i)(N +n
j
)
W (i) +P (i)N
| {z }
0
The approximation in the above equation is valid if the number of total words
selected is significantly larger than the number of words expected to be seen in a single
sentence. This issue can be addressed with proper initialization. As the data selec-
tion process selects more data, N will increase and the approximation will be better.
41
Intuitively, the termT 1 measures the decrease in probability mass because of the addi-
tion of n
j
words to the corpus, and the term T 2 measures the in-domain distribution
P weighted increase in probability for words with non-zerom
ij
. Using expression 4.2
makes it tractable to compute stepwise changes in divergence by reducing required com-
putations to the number of words in sentence instead of the vocabulary size (Equation
4.1).
4.9.2 Initialization and parameters
We experimented with two strategies for initializing the countsW (i). In the first method
we initializedW (i) to 1 for all words in the vocabulary. For this initialization, the initial
word countN is equal to vocabulary sizeV . The second method was to initialize the
counts from a random subset (without replacement) of the adaptation data. The size of
the random subset was taken to be the same as the size of the in-domain set. In this case,
N is the total number of words in the random subset andW (i) is the count for wordi
in the random subset. The strategy that worked best was to initializeW (i) to 1 and then
add counts from a random subset, thus ensuring a min-count of 1.
The alpha parameter in equation 4.2 controls the smoothing influence of the in-
domain language model. The motivation of smoothing was to make the relative entropy
function behave smoothly during the initial part of data selection. For this purpose, a
high value of alpha in the range 0:95 1 was found to give good results. Choosing a
low value of alpha will in general tend to reduce the number of sentences selected (In
the extreme case = 0 no sentence will be selected).
4.9.3 Selection and randomization
The proposed algorithm is sequential and greedy in nature and can benefit from random-
ization of the order in which it scans the corpus. We generate random permutations of
42
the sentence sequence and take the union of the set of sentences selected in each per-
mutation. Sentences that have already been included in more than two permutations are
skipped during the selection process, thus forcing the selection of different sets of sen-
tences. After each permutation and data selection iteration, we build a language model
from the union of the data selected and compute perplexity on a heldout set. If the per-
plexity is higher than the perplexity at end of previous iteration, no further permutations
are carried out.
A sentence is selected if its inclusion decreases the divergence (T 2>T 1). If a sen-
tence is not selected, we push it into a separate set of accumulated rejected sentences.
We add the number of words in the sentence to a accumulation countern
rej
. We then
consider the inclusion of the entire accumulated set of rejected sentences. T 1 for the
accumulated sentences can be calculated simply by using the above expression (substi-
tutingn withn
rej
). To avoid calculation ofT 2 of the entire accumulated set every time
we add a new sentence, we note that by Jensen’s inequalityT 2 for the accumulated set
is upper bounded by the sum of individualT 2 for the rejected sentences. If this upper
bound exceedsT 1, we calculate the occurrence countm
i
for every word. In our previ-
ous version of the algorithm no accumulation was carried out [SGN06]. Accumulation
helps in making the algorithm look at longer chunks of data, instead of individual sen-
tences. By accumulating, we are able to include groups of sentences that wouldn’t have
been selected individually because the individual r.e changes were negative.
4.9.4 Further enhancements
Smoothing [CG96] can be used after a fixed number of selected sentences to modify the
counts of the selected text modelW (i). We have experimentally found out that Good-
Turing smoothing after selection of every 500K words is sufficient for this task. The
impact of smoothing was not seen to be significant to warrant further exploration.
43
The expected r.e gain is higher in the beginning of the incremental selection process.
As the algorithm selects relevant data while traversing through the adaptation corpus,
improvements in r.e become smaller. A useful tactic that boosts the performance of
the algorithm is to complete one pass of selection and then do a second pass where we
consider the set selected in the first pass as an adaptation corpus. For this second pass,
the sentences are traversed in the reverse order of their first pass selection sequence.
The set of sentence selected at the end of the second pass are taken as the final set of
sentences selected. Order reversal is useful since initiallyW (i) are low, which implies
that the ratio
W (i)+m
ij
W (i)
would be higher. This is also one of the motivations for moving
to skew divergence instead of KL distance where the counts ratio is smoothed by theP
model.
Our development of the algorithm in this section focused on unigram models. In the
next section we discuss extension to n-gram models especially back-off n-gram models.
4.10 Generalization to N-gram back off models
If the probabilities of all possible n-gram sequences w
1
::w
n
are estimated from data
based solely on occurrence counts of the sequences, then the algorithm presented in
section 4.9 can be modified to consider a sequence ofn words as a single entity. How-
ever such models don’t exist in practice.
For any real world language modeling task, probability of all the possible higher
(than unigram) n-grams can rarely be robustly estimated. If we are interested in a
complete n-gram model, we need to estimate V
n
parameters (where V is the size of
the vocabulary). Since the training data required for robust estimation of these many
parameters is not available, the estimation is pruned to a smaller list of n-grams and the
44
probability of other n-grams is estimated from the probability of the corresponding n-1
grams using back-off weights [Sto98a].
Letx
n
2 S be the set of words for which we have sufficient counts for estimating
the n-gramp(x
n
jx
n1::1
). Consider the n-gram probabilitiesp(x
n
jx
n1::1
) which lie in
the complement setS
c
. The probability of these n-grams can be computed in terms of
the probability of the back-off n-gram probability
p(x
n
jx
n1::1
) =b(x
n1::1
)p(x
n
jx
n1::2
)
whereb is called the back-off weight. Given that
P
V
p(x
n
jx
n1::1
) = 1 it is easy to see
that
b(x
n1::1
) =
1
P
x2S
p(x
n
jx
n1::1
)
1
P
x2S
p(x
n
jx
n1::2
)
We first describe a scheme for fast computation of r.e between two language mod-
els which have a back-off structure. We use the generalized derivation from [SRN04]
adapted to the case where the two language models have the same back-off structure. To
keep the presentation of the algorithm simple, we will use the entropy model described
in [SGN06]. This can be changed to the skew divergence model described for the uni-
gram case in section 4.9 by adjusting the counts to include in-domain model probability.
4.10.1 Fast Computation of Relative Entropy
We define the following symbols for the purpose of describing the r.e computation:
x : The current word
h: The historyw
1
::w
n
1
h
0
: The back off historyw
2
::w
n1
b
p
h
: The back-off weight forp distribution for historyh
b
q
h
: The back-off weight for theq distribution
45
V : The vocabulary of the language model
A language model is a discrete distribution specifying the distribution of words in the
current state given the lastn 1 words. The information theoretic measure of relative
entropy rate can be used to compare language models. Given two language models
p(xjh) andq(xjh) the r.e (rate) at leveln is defined as
D
n
=
X
h2H
p
h
X
x2V
p(xjh) ln
p(xjh)
q(xjh)
(4.3)
In the rest of this paper, we will refer to relative entropy rate as just relative entropy.
We now divide the set of histories (H) at leveln intoH
s
for allh which exist asn 1
gram and have a back-off weight6= 1 in thep or theq distribution. The complement set
(H
s
c) will contain histories with a back-off 1. H
s
c corresponds to histories not seen in
either language model. We define
D
xh
=
X
x2V
p(xjh) ln
p(xjh)
q(xjh)
(4.4)
Then r.e at leveln,D
n
can be expressed as
D
n
=
X
h2Hs
p
h
D
xh
+
X
h2H
s
c
p
h
D
xh
=
X
h2H
p
h
D
xh
0 +
X
h2Hs
p
h
D
xh
X
h2Hs
p
h
D
xh
0
46
Sinceh=w
1
.h
0
, we can marginalize with respect tow
1
D
n
=
X
w
1
X
h
0
p
h
D
xh
0 +
X
h2Hs
p
h
D
xh
X
h2Hs
p
h
D
xh
0
D
n
=
X
h
0
D
xh
0
X
w
1
p
h
+
X
h2Hs
p
h
(D
xh
D
xh
0)
D
n
=
X
h
0
D
xh
0p
h
0 +
X
h2Hs
p
h
(D
xh
D
xh
0)
D
n
=D
n1
+
X
h2Hs
p
h
(D
xh
D
xh
0)
D
xh
can be split into four terms depending on whetherxjh is defined in thep or the
q distribution. We will assume thatp andq have same structure and denote byS the set
ofxjh defined in the LMs. We useS
c
to denote the complement of setS i.e those terms
for which xjh is not defined in the LMs. When the two LMs have the same back-off
structure, we are left with two terms to consider. We call these termsT 1 andT 4, to use
the same notation as the derivation in [SRN04], which considers the general case where
the two LMs have different back-off structure.
T
1
:p(xjh) exists,q(xjh) exists (xjh2S)
T
4
:p(xjh) backs-off,q(xjh) backs-off (xjh2S
c
)
D
xh
=T
1
+T
4
(4.5)
47
T
1
=
X
x2S
p(xjh) ln
p(xjh)
q(xjh)
T
4
=
X
x2S
c
b
p
h
p(xjh
0
) ln
b
p
h
p(xjh
0
)
b
q
h
q(xjh
0
)
=b
p
h
ln
b
p
h
b
q
h
1
X
x2S
p(xjh
0
)
+b
p
h
D
xh
0
b
p
h
X
x2S
p(xjh
0
) ln
p(xjh
0
)
q(xjh
0
)
(4.6)
Thus we are able to expressD
xh
, in terms of the LM terms actually seen. UsingD
xh
computed in this fashion in (4.5) we get a recursive formulation for r.e at leveln using
LM densities actually seen. We have used the tree based representation of back-off n-
gram models to derive the efficient computation scheme described above. An alternative
approach for deriving the same relative entropy expressions presented above would be to
consider n-gram back-off language models as a special case of probabilistic finite state
grammars (pfsg). Carrasco [Car97] describes a approximate r.e computation scheme
for pfsg’s which computes an improved estimate for r.e in each iteration. The derivation
presented by Carrasco can be modified for the special case of n-gram models to give the
exact r.e in one iteration. The resulting update equations will have the same structure as
the equations presented above.
4.10.2 Incremental updates on a n-gram model
We now consider the incremental change in r.e between an in-domain n-gram back-off
modelp and a ML modelq built from data selected. We are interested in finding out an
efficient way to compute the change in r.e when a sentence is added to the selected data
set, thus changing the modelq. Let us definew
x;h
as the count of the wordx seen with
contexth andw
h
as the count for contexth (ML estimateq(xjh) = w
x;h
=w
x
). We use
48
m
x;h
andm
h
to denote the counts in the current sentence. We assume that the modelq
has the same back-off structure as the modelp. Thus we can divideD
xh
into justT
1
and
T
4
depending on whetherx is seen with contexth in the model. We defineS as the set
of wordsx seen for the contexth in the language model.
Constraining the update language model to have same back-off structure as the in-
domain model, we get from equation 4.5D
xh
=T
1
+T
4
T
1
=
X
x2S
p(xjh)
X
x2S
p(xjh) lnq(xjh)
=
X
x2S
p(xjh)
X
x2S
p(xjh) ln
w
xh
w
h
After addition of sentence we have for T1
T
+
1
=
X
x2S
p(xjh)
X
x2S
p(xjh) ln
w
xh
+m
xh
w
h
+m
h
=T
1
+ ln
w
h
w
h
+m
h
X
x2S
p(xjh)
X
x2S;m
xh
6=0
p(xjh) ln
w
xh
+m
xh
w
xh
T
1
= ln
w
h
w
h
+m
h
X
x2S
p(xjh)
X
x2S;m
xh
6=0
p(xjh) ln
w
xh
+m
xh
w
xh
The term
P
x2S
p(xjh) can be precomputed since it is not a function of the word
counts in the selected set.
49
We now consider T4 which we further split into two parts
T
4
=
X
x2S
c
p(xjh) lnp(xjh)
X
x2S
c
b
p
h
p(xjh
0
) lnb
q
h
q(xjh
0
)
=
X
x2S
c
p(xjh) lnp(xjh)b
p
h
X
x2S
c
p(xjh
0
) lnb
q
h
| {z }
T
4A
b
p
h
X
x2S
c
p(xjh
0
) lnq(xjh
0
)
| {z }
T
4B
ForT
4B
the change due to addition of sentence can be expressed as,
T
4B
=
X
x2S
c
p(xjh
0
) lnq(xjh
0
)
=D
xh
0
X
x2S
p(xjh
0
) lnq(xjh
0
)
=D
xh
0
X
x2S
p(xjh
0
) ln
w
xh
0
w
h
0
T
+
4B
=D
xh
0
X
x2S
p(xjh
0
) ln
w
xh
0 +m
xh
0
w
h
0 +m
h
0
T
4B
=D
xh
0 + ln
w
h
0
w
h
0 +m
h
0
X
x2S
p(xjh
0
)
X
x2S;m
xh
06=0
p(xjh
0
) ln
w
xh
0 +m
xh
0
w
xh
0
T
4A
requires computation of change inb
q
h
T
4A
= (1
X
x2S
p(xjh
0
)) lnb
q
h
T
4A
= (1
X
x2S
p(xjh
0
)) lnb
q
h
50
As for theT 1 case,
P
x2S
p(xjh
0
) can be precomputed.
The expression forb
q
h
is given by
b
q
h
=
1
P
x2S
w
xh
w
h
1
P
x2S
w
xh
0
w
h
0
=
w
h
0
w
h
w
h
P
x2S
w
xh
w
h
0
P
x2S
w
xh
0
Ifm
h
0 = 0 thenm
h
= 0 i.eT
4B
= 0. Alsob
q
h
= 0 i.eT
4A
= 0.In addition,T
1
= 0.
Thus the incremental changes need to be computed only for terms withm
h
06= 0.
The order of computations required in the non back-off version of the algorithm for
each sentence is linear in the number of words in the sentence. For back-off n-grams
the total number of computations depends on the how the setsS andS
c
are split which
depends both on the sentence and the structure of the LM. As a rough estimate, the
number of computations is of the order nwords
n
where n is the order of the LM and
nwords is the number of words in the sentence.
4.10.3 Relation to LM pruning using r.e
Stolcke[Sto98b] introduced relative entropy as a measure for pruning n-grams. The
central idea is to consider whether the r.e of the distribution with historyh is significantly
different from the back-off distribution with historyh
0
. This can be done efficiently in a
manner similar to the fast relative entropy computation described in 4.10.1. In terms of
the computation algorithm, incremental r.e change with data addition for n-gram back-
off model differs significantly from the computation required for determining the r.e loss
in pruning n-grams. Data selection based on r.e can be seen as serving complimentary
goal to n-gram model pruning. The data selection algorithm aims at finding a good
51
subset of data for building language models while the goal of r.e pruning is to find a
sparser n-gram model that matches the initial model as closely as possible in terms of
distribution.
In the next section we describe the text corpus used in our experiments and provide
some implementation details of the proposed algorithm.
4.11 Implementation details and data collection
We crawled the web to build the large text corpora used in our experiments. Queries
for downloading relevant data from the web were generated using a technique similar
to [NOH
+
05b, SGN05]. An in-domain language model was first generated using the
training material and compared to a generic background model of English text [SGN05]
to identify the terms which would be useful for querying the web. For every termh in the
language model, we calculated the weighted ratiop(h) ln
p(h)
q(h)
wherep is the in-domain
model andq is the background model. The top scoring unigrams, bigrams, and trigrams
were selected as query terms. Starting from queries containing just trigrams we move
to queries containing bigrams and then just unigrams. The set of URLs returned by our
search are downloaded and non-text files are deleted. HTML files are converted to text
by stripping off tags. The converted text typically does not have well defined sentence
boundaries. We piped the text through a maximum entropy based sentence boundary
detector[Rat96] to insert better sentence boundary marks. Sentences and documents
with high OOV rates were rejected as noise to keep the converted text clean. As a
pre-filtering step, we computed the perplexity of the downloaded documents with the
in-domain model and rejected text that had a very high perplexity [SGN05]. The goal
of the pre-filtering step is to remove artifacts such as advertisements or spurious text.
Most of these artifacts show up clearly as a very high perplexity cluster compared to the
52
rest of the data. Thus, by using a perplexity histogram we could easily fix a perplexity
threshold for pre-filtering. Data was mined separately for the two ASR tasks, which we
evaluate in the next section. In both cases, the initial size of the data downloaded from
the web was around 750M words. After filtering and normalization the downloaded data
amounted to about 500M words.
For the data selection algorithm we need an in-domain language model against
which we compare the r.e of the selected set iteratively (Section 4.10). The general-
ized n-gram back-off algorithm used with the in-domain model is significantly slower
than the unigram implementation because of the need to update lower order back-off
weights. Even though impractical for most other language modeling tasks, both sim-
ulation experiments [SGN06] and experiments on web-data indicate that bigram and
unigram language models seem to perform well for data selection using the r.e mini-
mization algorithm. No performance gains were observed when using trigram models
for selection. For this reason the experimental results presented in this paper are using
bigram models for selection. Note that the order of the LM used for data selection does
not put any restrictions on the order of the language models used for generating query
terms or the adapted language model we built from the selected data.
4.12 Some intuition from simulations
A measure of the relevancy of the selected adaptation data is the Kullback-Leibler dis-
tance between the estimated n-gram model and the true n-gram distribution for that
domain
2
. By comparing the two distributions we can identify the convergence proper-
ties of the data selection methods and also explain how the selection affects the estimated
probability mass distribution compared to the true distribution.
2
We restrict ourselves to the n-gram approach for modeling the distribution of word sequences
53
To compare the n-gram language models, we developed a fast r.e computation
scheme for tree based n-gram models. The description of this computation scheme is
given in Appendix A. The fast scheme makes it possible to compute the r.e between two
LMs in O(L) computations where L is the number of language model terms actually
present in the two LMs, compared toV
n
computations required in a direct implementa-
tion. This helps to reduce the computation effort by a factor of 10
4
or more.
We cannot use KL distance to judge relevance with real world data since the true
distribution for real world data is unknown. However for analysis purposes, we can
substitute the true distribution with a known distribution and then sample from it to get
examples of in-domain text (with reference to the known distribution). We can then mix
the known distribution with a noise model to simulate a generic text corpus, such as text
acquired from web.
For simulation purposes, the language model used to generate the equivalent of in-
domain text becomes the true distributionP
true
. To complete the analogy with our data
selection problem, text samples D
ind
generated from P
true
serve as the equivalent of
in-domain data. The equivalent of the large generic corpus D
generic
can be generated
from the noisy model. The simulation equivalent of the in-domain language model will
be the language modelP
ind
estimated from the clean dataD
ind
.
We use a P
true
with a vocabulary of 3K words estimated from a real world med-
ical dialog task. The noise model is a language model of vocabulary 20K estimated
from 1M words collected from webpages identified by Google using medical domain
queries. We used a D
ind
set (generated from P
true
) of 200K words and a generic set
D
generic
of size 20M words or 3.8M sentences. The vocabulary sizes were kept small to
efficiently generate text samples from the language models. The goal of the simulations
54
Rand PPLSel ItSel
200K 32.2 9.1 16.1
400K 34.2 13.3 24.3
800K 31.1 22 27.3
1200K 33.7 28 29.5
2400K 32.9 31 31
Table 4.2: Perplexity of data selected withP
ind
for varying number of selected sentences
is solely to gain an insight into the differences between iterative selection and rank-and-
select scheme. Results on real world data are presented in the next section where the
vocabulary size is more realistic.
4.12.1 Simulation results
The first question that we address is whether it is useful to select all data from the generic
corpusD
generic
that scores high in perplexity terms with the in-domain modelP
ind
. In
Table 4.2, we compare the perplexity of the data selected by different methods. Rand
selectsn sentences randomly, PPLSel selects the topn sentences ranked by perplexity,
and ItSel is the proposed iterative method. We build language models with the selected
data and merge it withP
ind
using weights determined from the heldout set. Table 4.3
shows the relative entropy of the adapted models with the true distribution. Our goal is
to select the adaptation data cleverly to reduce the r.e between the adapted model and the
true distributionP
true
. It can be seen that selecting data with lowest perplexity does not
lead to a better language model. The perplexity of the data selected by ItSel, which is the
most beneficial in improving the language model, lies between the perplexity of random
selection and PPLSel. It should be noted that by design, as the number of selected
sentences approaches the size of the generic corpus, the selected data for all methods
will be similar (identical if all the data is selected). Thus all methods essentially give
same performance when selecting high percentages of data.
55
Rand PPLSel ItSel
200K 12.1 15.2 9.2
400K 11.3 13.2 8.3
800K 10.5 11.1 10.1
1200K 9.7 9.5 9.4
2400K 9.3 8.9 8.9
Table 4.3: Relative entropy of models built from the selected data with the reference
P
true
distribution for varying number of sentences
Next we verify our hypothesis that use of ranking methods skews the distribution by
focusing solely on the high probability regions. We selected the top 10% words which
have the highest probability inP
ind
and 10% words with the smallest probability. Then
we computed the partial sums
H
high
bias
=
X
w2top
P
true
(w) ln
P
true
(w)
P
sel
(w)
H
low
bias
=
X
w2bottom
P
true
(w) ln
P
true
(w)
P
sel
(w)
for the language modelsP
sel
estimated from the selected data. Note that the summation
involves the true density P
true
. If the selected data is imbalanced with respect to the
true distributionP
true
and unnecessarily biased towards the high probability regions of
P
base
, the separation of the partial sums
H
imbalance
=H
high
bias
H
low
bias
will be large. However, if the bias towards high probability regions ofP
base
is justified,
H
imbalance
will be low.
In Figure 4.2, we plotH
imbalance
with increasing number of selected sentences for
PPlSel and ItSel. High H
imbalance
for PPLSel especially when number of sentences
56
selected is low confirms our hypothesis that selection using perplexity ranking skews
the distribution by focusing solely on the high probability regions.
500K 1M 1.5M 2M 2.5M
0
0.5
1
1.5
2
2.5
3
3.5
Number of sentences
Relative entropy imbalance
ItSel
PplSel
Figure 4.2: Relative entropy imbalance with number of selected sentences
The simulation results provide an easy and intuitive way to understand how the pro-
posed algorithm scores over rank-and-select scheme. Perplexity and WER results on
real world tasks are hard to interpret because the underlying distributions are unknown.
For example, our claim of skew towards high probability regions is hard to justify by
looking at perplexity of test sets. However, we do need to ensure that our algorithm
indeed improves over the baseline methods on real world applications. We proceed to
do so in the next section.
4.13 Experiments
To provide a more general picture of the performance of our data selection algorithm
we provide experimental results on two systems that differ significantly in their system
design and the nature of the ASR task that they address.
The first set of experiments was conducted on the English ASR of the Transon-
ics [Nea03] English-Persian speech to speech translation system for medical domain
doctor-patient dialogs. The other ASR system that we will use for performance evalua-
tion is the 2006 IBM TC-STAR speech recognition system. The TC-STAR (Technology
57
and Corpora for Speech to Speech Translation) project financed by the European Com-
mission within the Sixth framework Program is a long-term effort to advance research
in speech to speech translation technologies
3
. The 2006 Evaluation was open to external
participants as well as the TC-STAR partner sites [eld].
We first present results on the Transonics task. The transonic system was also used
to provide comparisons against a large class of rank-and-select schemes described in
section 4.8. We will then provide results on TC-STAR. As stated in section 4.11, bigram
models from in-domain data were used for data selection. All language models used for
decoding and perplexity measurements are trigram models estimated using Kneser-Ney
smoothing.
4.13.1 Medium vocabulary ASR experiments on Transonics
The English ASR component of Transonics speech to speech translation system is a
medium vocabulary speech recognizer built around the SONIC [Pel01] engine. We
had 50K in-domain sentences ( 200K words) for this task to train the language model.
A generic conversational-speech language model was built from the WSJ, Fisher, and
SWB corpora. All language models built from the web-collection and in-domain data
were interpolated with this language model, with the interpolation weight determined
on the held-out set. The test set for perplexity evaluations consists of 5000 sentences
(35K words) and the held-out set had 2000 sentences (12K words). The test set for
word error rate evaluation consists of 520 utterances. We will show results with varying
amount of in-domain training material ranging from 10K sentence to 40K sentences.
Data selection was carried out using the baseline methods and the proposed r.e based
method for the different training set sizes. The language models used for perplexity
ranking and r.e based data selection were also built separately for each training set size.
3
Project No. FP6-506738
58
10K 20K 40K
No Web 60.0 49.6 39.7
AllWeb 57.1 48.1 38.2
PPL 56.1 48.1 38.2
BLEU 56.3 48.2 38.3
LPU 56.3 48.2 38.3
Proposed 53.7 46.6 38.0
Table 4.4: Perplexity of testdata with the web adapted model for different number of
initial sentences. Corpus size=150M
We first compare our proposed algorithm against baselines based on perplexity
(PPL), BLEU, and LPU classification (Section 4.8) in terms of test set perplexity. The
thresholds for data selection using the ranking-based baselines was fixed using the held-
out set perplexity. LPU and BLEU based rank-and-select schemes are computationally
intensive. Our results comparing against these two systems are thus on a smaller 150M
words web-collection. As the comparison shows, the proposed algorithm outperforms
the rank and select schemes with just 10% of data. Table 4.4 shows the test set per-
plexity with different amounts of initial in-domain data. Table 4.5 shows the number
of sentences selected for the best perplexity on the heldout set by the above schemes.
No Web refers to the language model built solely from in-domain data . AllWeb refers
to the case where the entire 150M web-collection was used. To get a more complete
picture of the relationship between performance and amount of data selected, we have
also conducted experiments using both simulations [SGN06] and web-data where we
restricted the number of sentences selected by the baselines to be the same as the num-
ber of sentences selected by the proposed method. In all such experiments, the perfor-
mance of the baselines was significantly below the proposed method. In fact for many
cases selecting a random subset of data was found to be more efficient then the baseline
methods [SGN06].
59
The WER results are shown in Table 4.6. The average reduction in WER is close
to 3% (relative). It can be seen that adding data from the web without proper filtering
can actually harm the performance of the speech recognition system when the initial
in-domain data size increases. This can be attributed to the large increase in vocabulary
size which increases the acoustic decoder perplexity.
To test how the performance of our algorithm scales with increasing data, we con-
ducted experiments on a larger data set of 850M words that consisted of the medical
domain collection of 320M words collected from the web and a 525M word collection
published by the University of Washington for the Fisher corpus [BOS03, OS05]. We
provide comparisons with only the perplexity based rank-and-select system, as LPU and
the BLEU based system are hard to scale to large text collections. Also, our results on
the 150M set suggest that the performance of these systems is comparable to perplexity
based selection. The Fisher adaptation data was included to reflect the case where we
attempt to benefit from a large corpus from some other task.
The results on PPL and WER (Table 4.7, Table 4.8) follow the same trend as in the
150M data set. The importance of proper data selection is highlighted by the fact that
there was little to no improvement in the unfiltered case (AllWeb) by adding the extra
data, whereas there were consistent improvements when the proposed iterative selection
algorithm was used. Perplexity reduction in relative terms was 7%, 5% and 4% for the
10K, 20K and 40K in-domain set, respectively. Corresponding WER improvements in
10K 20K 40K
PPL 93 92 91
BLEU 91 90 89
LPU 90 88 87
Proposed 11 10 11
Table 4.5: Percentage of selected sentences for different number of initial in-domain
sentences. Corpus size=150M
60
10K 20K 40K
No Web 19.8 18.9 17.9
AllWeb 19.5 19.1 17.9
PPL 19.2 18.8 17.9
BLEU 19.3 18.8 17.9
LPU 19.2 18.8 17.8
Proposed 18.1 17.9 17.1
Table 4.6: Word Error Rate (WER) with web adapted models for different number of
initial sentences. Corpus size=150M
10K 20K 40K
No Web 60.0 49.6 39.7
AllWeb 56.9 47.7 38.2
PPL 55.8 47.4 38.2
Proposed 52.1 45.2 36.8
Table 4.7: Perplexity of testdata with the web adapted model for different number of
initial sentences. Corpus size=850M
relative terms were 6% ,4% and 4%. It is interesting to note that for our data selection
scheme the perplexity improvments correlate surprisingly well with WER improvments.
This is in contrast to our TC-STAR LVCSR results where WER results did not correlate
well with perplexity. We plan to investigate this observation further and see if it holds
for limited domain tasks in general or is peculiar to this task.
Table 4.9 shows the percentage of data selected using the proposed scheme and PPL
based rank-and-select. We are able to achieve around a factor of 9 reduction in the
selected data size. This translates to (Table 4.10) a factor of 7 reduction in the number
of estimated language model parameters (bigram+trigram) and a 30% reduction in the
vocabulary size.
4.13.2 Large vocabulary experiments on TC-STAR
To contrast with the medium vocabulary single decoder Transonic system, we present
results on the the IBM LVCSR system used for transcription of European Parliamentary
61
10K 20K 40K
No Web 19.8 18.9 17.9
AllWeb 19.3 19.1 17.9
PPL 19.1 18.7 17.9
Proposed 17.8 17.6 17.0
Table 4.8: Word Error Rate (WER) with web adapted models for different number of
initial sentences. Corpus size=850M
Plenary Speech (EPPS) [RSM
+
06] as part of the TC-STAR project. The IBM TC-STAR
speech recognition system is organized around an architecture that combines multiple
systems through cross-adaptation across different segmentation schemes and ROVER
architecture that combines multiple systems through the outputs from an ensemble of
ASR systems. Training of acoustic models used EPPS material only. Each ASR sys-
tem has approximately 6000 tied-states and 150K Gaussians. The acoustic front-end
employs 40-dimensional, perceptual linear prediction (PLP) features obtained from an
LDA projection that are mean and variance normalized on a per utterance basis. All
systems employ V ocal Tract Length Normalization (VTLN), Speaker Adaptive Train-
ing (SAT) using features in a linearly transformed feature space resulting from apply-
ing fMLLR transforms, and are discriminatively trained on features obtained from a
feature-space minimum phone error (fMPE) transformation (MPE models). A detailed
description is provided in [RSM
+
06].
We describe here the baseline system used for this task focusing on the language
modeling components relevant to our experiments. All decoding passes use a 4-gram
modified Kneser-Ney model that was built using the SRI LM toolkit. One model was
10K 20K 40K
PPL 88.5% 87.8% 87.3%
Proposed 9.3% 10% 8.7%
Table 4.9: Percentage of selected sentences for different number of initial sentences.
Corpus size=850M
62
unigram bigram trigram
AllWeb 105K 25.3M 36.2M
PPL 99K 22.1M 32.4M
Proposed 70K 3.2M 8.2M
Table 4.10: Number of estimated n-grams with web adapted models for different number
of initial sentences for the case with 40K in-domain sentences. Corpus size=850M
trained on the training transcripts (LM1) and another on the text corpus based on the
Final Text Editions (LM2). A perplexity minimizing mixing factor was computed using
the Dev06 reference text. The final interpolated language model used in the construction
of the static decoding graph contains 5.5M n-grams.
LM3 containing 80M n-grams was trained on 525M words of web data released by
the University of Washington and LM4 containing 39M n-grams was built on 204M
words of Broadcast News. The interpolation weights assigned to the out-of-domain
language models LM3 and LM4 is relatively low, 0.12 and 0.13 compared to 0.21 and
0.54 for LM1 and LM2. The final interpolated LM contains 130M n-grams. The 59K
recognition lexicon was obtained by taking all words occurring at least twice in the text
corpus and once in the the acoustic training transcripts. The OOV rate on the dev06 test
set was slightly under 0.4%.
In the architecture described in [RSM
+
06], the best baseline system was obtained by
rescoring the lattices produced after MLLR (speaker adaptation) with an out-of-domain
language model (public condition). This is the only step that uses non-EPPS training
material, i.e, UW web data and BN data.
The WER on the Dev06 and Eval 06 system after LM rescoring with the out-of-
domain LM, using a single system prior to ROVER was 11.0% and 8.9% respectively.
The best performance after ROVER across multiple systems was 10.4% and 8.3%.
We present results on TC-STAR Dev06 and Eval06 test sets. The 2006 development
set (Dev06) on which the acoustic and language models were optimized consists of
63
Fraction
of data
selected(words)
All
(525M)
1/11
(45M)
1/7
(71M)
1/3
(170M)
Perplexity(Dev) 94.5 94.5 91.3 88.7
Interpolation
weight
0.32 0.29 0.45 0.49
WER (Eval)% 8.4 8.6 8.5 8.5
WER (Dev)% 10.7 10.9 10.8 10.6
Table 4.11: Performance comparison of the language models built with different frac-
tions of data being selected for the Dev06 and Eval06 test sets. The baseline had 525M
words of fisher web data (U.Wash) and 204M words of Broadcast News(BN) as out-of-
domain data. The WER on Eval06 for the baseline was 8.9% and 11% on Dev06.
approximately 3 hours of data from 42 speakers (mostly non-native speakers). The 2006
English Evaluation (Eval06) comprises 3 hours of data from 41 speakers. The Dev06
and Eval06 sets cover parliamentary sessions between June and Sept. 2005. Both Dev06
and Eval06 sets contain approximately 30K words. The text for Dev06 was used to fix
LM weights for linear interpolation.
For the baseline system the in-domain language model was built with EPPS acoustic
and final text transcriptions and was interpolated with out-of-domain LMs comprising
U.Wash 525M word Fisher web corpus and 204M words from broadcast news (Section
4.13.1). The WER on Eval06 was 8.9% and 11% on Dev06. We provide performance
comparisons against this baseline by replacing the two baseline out-of-domain LMs with
LMs built from increasing fractions of text selected by our data selection method. As
can be seen from Table(4.11), incorporating the 525M words mined by our crawling
scheme boosted the system performance to 8.4% (6% relative) over the baseline. The
effectiveness of the data selection scheme is demonstrated by the fact that we almost
get the same WER gain (8.5 vs 8.4) and slightly better perplexity by using 1=7
th
of the
data, i.e 70M words. With 1=3
rd
data, we equal performance in WER terms and out-
perform significantly in perplexity. Combining the LM built from complete data with
64
broadcast news decreased the WER to 10.6% and 8.3% on Dev06 and Eval06 respec-
tively. In comparison, the LM built from 1=3
rd
data when interpolated with broadcast
news gave 10.3% and 8.3% on Dev06 and Eval06 respectively, thus outperforming the
LM built from entire data and BN. The WER after ROVER was 9.8% on Dev06 (10.4
ROVER baseline) and 7.9% on Eval06 (8.3% ROVER baseline). The WER differences
improvment with 1=3
rd
selected is statistically significant compared to using all of the
data.
4.14 Discussion and analysis of results
It is interesting to compare the data selection results between the Transonics and TC-
STAR experiments. For Transonics, we used a web corpus of 320M words (excluding
Fisher data). The data selection algorithm was able to achieve better performance than
the out-of-domain LM built from the entire 320M word corpus, while selecting just
1=10
th
of the data. In contrast the IBM TC-STAR system requires a lot more data.
However, if we consider the ratio of the selected data size with in-domain training data
size we find the results much more comparable. This is expected since with good in-
domain training data the dependency on out of domain data is less. In addition, the
Transonic ASR system has a higher baseline WER than TC-STAR system. Another
interesting observation is that in Transonics perplexity improvements translated to WER
improvements consistently. This was not true for TC-STAR.
More insights into these results can be gained by comparison with the performance
of the ROVER-based TC-STAR system. Firstly, the 525M word web collection gener-
ated using the scheme presented here gave an improvement of 0.5% compared to the
baseline which used two out-of-domain LMs (over 700M words). The data selection
method is able to achieve the same improvement with just 70M words. Secondly, in the
65
final stage of the TC-STAR decoding architecture, the baseline ASR output is combined
with the output of three other systems using ROVER to achieve a reduction in WER
from 8.9% to 8.3% and 11.0% to 10.4% on Eval06 and Dev06 test sets respectively.
We are able to achieve close to that performance with just one system. Thirdly, the LM
built from 1/3rd data interpolated with BN-based LM (LM3 in baseline) gives the same
performance as the combination of three ASR systems under ROVER.
66
Chapter 5
Boosting Coverage
A typical speech recognition application can be split into various domains or topics
which differ in their style as well as content. For example a speech recognition sys-
tem for broadcast news has to address sports, movies, music, current events and other
domains. In this chapter we will look into techniques which help in identifying data for
all the domains and topics that are implicitly covered by the application. The first step
in this process is to identify the topics in the in-domain data. In a general case where
the in-domain data is not manually labeled with topic information we start by clustering
the in-domain training data into topics using an unsupervised text clustering scheme.
Traditional text clustering methods based on Latent Semantic Analysis (LSA) reduce
a document to a vector in a weighted word counts based feature space and then use SVD
to identify the prominent vectors, which serve as topic centers. The central assumption
is that a Euclidean distance metric will cluster similar documents. However, a more
natural model for documents is a multinomial distribution. Recently proposed methods
such as Latent Dirichlet Allocation(LDA) can represent documents in terms of gener-
ative multinomial factors and have been shown in various studies to have a superior
performance compared to Euclidean distance based measures.
5.1 Latent Dirichlet Allocation
A generative model for documents is based on simple probabilistic sampling rules that
describe how words in documents might be generated on the basis of latent (random)
67
variables. When fitting a generative model, the goal is to find the best set of latent
variables that can explain the observed data (i.e., observed words in documents), assum-
ing that the model actually generated the data. Different documents can be produced by
picking words from a topic depending on the weight given to the topic. The way that this
generative topic model is defined, there is no notion of mutual exclusivity that restricts
words to be part of one topic only. This allows topic models to capture polysemy, where
the same word has multiple meanings. For example, both the money and river topic can
give high probability to the word BANK, which is sensible given the polysemous nature
of the word.
The most commonly used generative process is the bag of words model, which does
not make any assumptions about the order of words as they appear in documents. The
only information relevant to the model is the number of times words are produced. This
is known as the bag-of-words assumption and is common to many statistical models of
language including LSA. Of course, word-order information might contain important
cues to the content of a document, and this information is not utilized by the model.
Griffiths [GSBT] present an extension of the topic model that is sensitive to word-order
and automatically learns the syntactic as well as semantic factors that guide word choice
The problem of statistical inference for a topic based model is that given the observed
words in a set of documents, we would like to know what topic model is most likely to
have generated the data. This involves inferring the probability distribution over words
associated with each topic, the distribution over topics for each document, and, often,
the topic responsible for generating each word.
A variety of probabilistic topic models have been used to analyze the content of
documents and the meaning of words [DMB].These models all use the same fundamen-
tal idea that a document is a mixture of topics but make slightly different statistical
assumptions. To introduce notation, we will writeP (z) for the distribution over topics z
68
in a particular document andP (wjz) for the probability distribution over words w given
topic z. Each wordw
i
in a document (where the index refers to thei
th
word token) is
generated by first sampling a topic from the topic distribution, then choosing a word
from the topic-word distribution. We write P (z
i
= j) as the probability that the j
th
topic was sampled for thei
th
word token andP (w
i
jz
i
= j) as the probability of word
w
i
under topic j. The model specifies the following distribution over words within a
document:
P (w
i
) =
T
X
j=1
P (w
i
jz
i
=j)P (z
i
=j)
where T is the number of topics. To simplify notation, let (j) = P (wjz = j) refer
to the multinomial distribution over words for topic j and (d) = P (z) refer to the
multinomial distribution over topics for documentd. Furthermore, assume that the text
collection consists ofD documents and each documentd consists ofNd word tokens.
LetN be the total number of word tokens (i.e., N = SNd). The parameters and
indicate which words are important for which topic and which topics are important for
a particular document, respectively.
Hoffman [Hofb, Hofa] introduced the probabilistic topic approach to document mod-
eling in his Probabilistic Latent Semantic Indexing method (pLSI; also known as the
aspect model). The pLSI model does not make any assumptions about how the mixture
weights are generated, making it difficult to test the generalizability of the model to new
documents. Blei [DMB] extended this model by introducing a Dirichlet prior on(d),
calling the resulting generative model Latent Dirichlet Allocation (LDA). As a conju-
gate prior for the multinomial, the Dirichlet distribution is a convenient choice as prior,
simplifying the problem of statistical inference. More details can be seen in [DMB].
69
5.2 Clustering using LDA
The LDA algorithm was primarily designed to cluster text documents. The first step in
clustering in-domain data is to organize it into meaningful chunks which can be treated
as virtual documents. A natural unit for clustering is to treat each speaker turn as a
document. Clustering at the sentence level is harder because of sparsity issues. In
absence of speaker turns we can use other meta information available in the training
data or use fixed number of sentences as a unit. In the second step we cluster these
virtual documents in the training data by using LDA. Each cluster can then be treated as
a separate topic for which we can query and acquire data. The acquired data can then be
re-clustered using LDA or the original clusters can be retained. Subsequently we build
a language model for each cluster. The merge weights for the cluster language models
can be fixed using a heldout set. The choice of number of clusters can either be based
on perplexity minimization on a heldout set or in some cases an intuition of the number
of topics inherent in a domain.
5.3 Unsupervised LM adaptation
Developing a speech recognition system for a new domain is costly, primarily due to
the collection and preparation of the data required to train the system. Generally speak-
ing, fairly large amounts of manually annotated data (tens of hours of data for a large
vocabulary system) are needed, which are very labor intensive to obtain.
Language model (LM) and acoustic model (AM) adaptation attempt to obtain mod-
els for a new domain with little training data, by leveraging existing(”out-of-domain”)
models. AM adaptation in particular has been studied, both for application at test time
and for application on in-domain data other than the test set.In contrast to AM adapta-
tion, LM adaptation has received much less attention. The most widespread approaches
70
to supervised LM adaptation in a large vocabulary setting are model interpolation and
count mixing. The basic idea behind unsupervised LM adaptation is to decode the acous-
tic signal using an initial seed LM and then use the decoded lattice to generate adaptation
text for the language model. Typically nbest lists are generated from the lattice that con-
tains weighted candidate hypothesis for the utterance in question. Fractional counts from
these hypothesis are then used to adapt the initial language model[BR03]. The adapta-
tion can be carried out in an iterative fashion. The models can be adapted using count
merging or model interpolation, both of which can be viewed as maximum a posteriori
(MAP) adaptation strategy with a different parametrization of the prior distribution.
The model parameters are assumed to be a random vector in the space of possible
parameters. Given an observation samplex, the MAP estimate is obtained as the mode
of the posterior distribution of as denoted byg(:kx)
MAP
=argmax
kx =argmax
f(xk)g()
The case of LM adaptation is very similar to MAP estimation of the mixture weights
of a mixture distribution. In this case, the objective is to estimate probabilities for a
discrete distribution across words, entirely analogous to the distribution across mixture
components within a mixture density. A practical candidate for the prior distribution of
the weightsw
1
;w
2
:: is the Dirichlet density
g(w
1
;w
2
::::w
K
jv
1
;v
2
:::v
K
)
K
Y
i=1
w
v
i
1
i
wherev
i
> 0 are the parameters of the Dirichlet distribution. If the expected counts for
the i-th component is denoted asc
i
, the mode of the posterior distribution is obtained as
^ w
i
=
(v
i
1) +c
i
P
K
k=1
(v
k
1) +
P
K
k=1
c
k
1iK
71
For a wordw
i
in n-gram historyh, let the expected adaptation counts, in this appli-
cation either from supervised transcripts or from ASR transcripts, be denoted as c(hw
i
).
Let the expected count for an n-gram history h be c(h) =
P
i
c(hw
i
). Let the corre-
sponding expected counts from the out of domain data sample be denoted as ~ c(hw
i
) and
~ c(h). Let ~ c
d
(hw
i
) and c
d
(hw
i
) denote the discounted counts for out of domain and in
domain samples respectively. Let
~
P (w
i
kh) and
P (w
i
kh) denote the probability of w
i
in historyh as estimated from the out-of-domain and in-domain samples, respectively.
Then a count merging approach with mixture parameters and is obtained by choos-
ing the parameters of the prior distribution for historyh asv
i
= ~ c
P (w
i
kh) + 1) since
in that case
^
P (w
i
kh) =
~ c(h)
~
P (w
i
kh)+ c
d
(hw
i
)
P
K
k=1
(~ c(h)
~
P (w
k
kh))+ c(h)
=
~ c
d
(hw
i
)+ c
d
(hw
i
)
~ c(h)+ c(h)
On the other hand, if the parameters of the prior distribution for historyh are chosen
as
^
P (w
i
kh) =
c(h)
1
~
P (w
i
kh)+ c
d
(hw
i
)
P
K
k=1
( c(h)
1
~
P (w
k
kh))+ c(h)
=
1
~
P (w
i
kh)+
P (w
i
kh)
1
+1
=
~
P (w
i
kh) + (1)
P (w
i
kh)
Unsupervised LM adaptation is greatly limited by initial ASR accuracy and also by
the vocabulary of the seed-lm. In cases where the ASR accuracy is very poor, using
unsupervised LM adaptation can even harm the ASR performance, since it would rein-
force the incorrect hypothesis. Also, the adaptation language model has the same vocab-
ulary as the original language model. Thus adaptation in this fashion does not help in
72
inducing words that were OOV for the initial language model. By combining unsuper-
vised LM adaptation with data acquisition we can improve the gains from adaptation
substantially[ZEV04]. We adapt our data acquisition strategy, which we have described
for text to acoustic lattices, for this purpose. Instead of using in-domain text, we can use
the initial decoded n-best lists to generate the topic-based language model for generat-
ing queries. We can then use the seed-lm used to decode the lattices as the background
LM. By comparing the topic based model and the background model, the query genera-
tion process can then pick up keywords/keyphrases introduced by the acoustic evidence
and this can be used to identify candidate text for updating the models. Even though
the initial language model might not fully contain the vocabulary of the new domain,
it frequently happens that the data retrieved using the keywords/keyphrases extends the
system vocabulary to allow for the new words in the domain of interest. For example
in our medical domain example, the generic language model did not contain words like
’SUTURES’, ’X-RAY’, it did contain words like ’MEDICINE’ and ’DOCTOR’, which
were used as keywords, and the extended system vocabulary from these keywords con-
tained the additional vocabulary items. Also, since we adapt using both the ASR output
and the generic source the adaptation process is more robust to ASR errors in the initial
iteration. Like unsupervised LM adaptation, this process can be carried out iteratively.
The data weighting in this case was done using the adaptation model built from the
lattices and the background language model.
5.3.1 Results
Using the baseline language model gave a poor 36% WER. One of the reasons for the
poor perforamance of the baseline model is the high OOV rate. Unsupervised adapation
improves the WER by 2% absolute but with the OOV rate and even though there is a
WER drop with. With web data acquisition the WER drops substantially along with
73
Model WER OOV count
Baseline model 36 32
Unsupervised adaptation 34 32
Unsupervised adaptation + data
acquisition (I iteration)
30.1 5
Unsupervised adaptation + data
acquisition (III iteration)
29 4
Table 5.1: ASR WER of testdata with unsupervised adaptation and data acquisition.
OOV rate. Two more iteration of unsupervised adaptation and data acquisition reduce
the error rate further from 30.1 to 29.
5.4 Active learning
Active learning aims at reducing the number of training examples to be labeled by
inspecting the unlabeled examples and selectively sampling the most informative ones
with respect to some given cost function. The goal of the active learning algorithm
is to select the examples for labeling that when included for training, will have the
largest performance improvement. Inspired by the certanity based active learning
methods[TSHT03][THT03] to reduce the transcription effort, we select the examples
that we predict that the speech recognizer has misrecognized and give them to human
labelers for transcription. These transcribed utterances can be used for training the
acoustic and language models leaving out the ones that we predict the recognizer has
recognized correctly or are otherwise not informative. The first step of the algorithm
is the training of initial language and acoustic models, using a small set of transcribed
data,S
t
. Using these models, we compute the speech utterances confidence scores and
predict which candidate utterances are recognized incorrectly. The utterances are ranked
according to their estimated correctness. We then add the transcribed utterances toS
t
74
and exclude them fromS
u
. This step is iterated as long as there are additional untran-
scribed utterances, and the algorithm is halted if the word accuracy on the development
set has converged.
We can combine active learning, unsupervised LM adaptation with Web data acqui-
sition to reduce the transcription requirements substantially and to get more information
out of the decoded lattices. We start with a seed-language model and an initial set of
nbest transcripts for the unlabeled set S
u
. We can then do unsupervised adaptation,
active learning and data acquisition in different order, based on our confidence in the
ASR performance and also the cost of the transcription process. If the ASR perfor-
mance is poor we should select some utterances to be labeled by the transcriber, which
can be used for LM/AM adaptation and also to identify keywords for web data down-
load. Unsupervised adaptation can then be carried out after decoding with this updated
model. In the alternate case of transcription cost being very high we can carry out
multiple rounds of unsupervised adaptation and data acquisition before identifying the
utterances for transcription.
In this setup the data acquisition process gives extra more weight to the utterances
labeled by the human annotator as compared to the transcripts from the ASR for both
keyword generation and data weighting. The human labeler can also be queries for a
core set of keywords and topic for the web crawling and retrieval part. The labeled
data can also aid in better unsupervised LM adaptation by serving as held out data for
determining the mixture weight between the out of domain language model and the
adaptation language model.
75
5.5 Using lattices and references
We experimented with the use of decoded ASR lattices for data acquisition in a discrim-
inative framework when the reference transcriptions are also provided. Based on the
hypothesis that acquiring more data for confusable words in the language model will
benefit ASR performance, we build a language models from decoded n-best lists and
then compare it with the reference language model using the Relative Entropy (R.E)
measure. The goal is to identify words and word n-grams that have the highest decod-
ing error. The R.E comparison can be used to identify both positive and negative query
terms (corresponding to words that are commonly inserted). These terms can be used to
augment the queries generated from the in-domain text/transcripts.
We also experimented with using a the posterior language model ( AntiLM[Sea00] )
built from the nbest lists for weighting the data during the filtering stage. The posterior
language model captures the acoustic confusability in the nbest transcripts. Utterances
in the acquired data that have a poor score with the AntiLM are included in the positive
set of utterances used in building the classifier for the weighting process. This helps in
increasing the weight of utterances in the training data that might have a poor match to
the in-domain language model but can improve the ASR performance.
5.5.1 Results
We used a set of 200 utterances with reference hypothesis to generate the AntiLM for the
transonics task. The AntiLm score was used as a feature in scoring utterances for data
weighting as described in the previous chapter. The SCLITE ASR hypothesis scoring
tool was used to generate a list of confusable words. The percentage of confusable
words in an utterance was included as an additional acoustic driven feature
76
Feature PPL WER
No Acoustic
features
87 23.3
Anti LM
score
87 23.1
Anti LM +
confusabil-
ity score
88 23.1
Table 5.2: ASR WER of testdata with acoustic features included in data weighting.
The Anti LM score helps improve the ASR performance marginally. The AntiLM
is generated from a very small set of utterances and does not effectively measure
acoustic confusability. Using acoustic-model based confusability measures such as
SAWER[PO00] can potentially help in better data weighting.
5.6 TC-STAR open condition evaluation
The TC-STAR (Technology and Corpora for Speech to Speech Translation) project
financed by the European Commission within the Sixth framework Program is a long-
term effort to advance research in speech to speech translation technologies
1
. The pri-
mary goal of the TC-STAR project is to produce an end-to-end system in English and
Spanish that accepts parliamentary speeches in one language, transcribes, translates and
synthesizes them into another language, while significantly reducing the gap between
the performance of a human (interpreter) and a machine. To support this goal, the per-
formance of each component technology, namely, speech recognition (ASR), machine
translation (MT) and text-to-speech (TTS) is optimized to produce the best output at
their respective stages. The 2007 Evaluation was open to external participants as well
as the TC-STAR partner sites [eld].
1
Project No. FP6-506738
77
As part of the TC-STAR 2007 ASR evaluation, three training conditions were
defined:
a restricted training condition for which systems must be trained only on data
collected within the TC-STAR project and listed in the next section.
a public data condition for which systems can be trained on any publicly available
data
an open condition where the only constraint concerns the cut-off date of the train-
ing data.
The TC-STAR open-condition ASR evaluation served as a testbed to validate the ideas
presented in this thesis for a highly competitive LVCSR task. In participation with the
IBM ASR team we built language models from web-data using the techniques presented
in this thesis. We first provide a brief description of the system architecture for the IBM
LVCSR system used in the evaluation. Subsequently we describe the data acquisition
and language modeling strategy we used to benefit from the algorithms we have devel-
oped.
5.6.1 ASR system architecture
The key design characteristics for all evaluation conditions include:
An architecture that uses two different speaker segmentation and clustering
schemes and uses the output of the system using one scheme to cross adapt the
same models to the second scheme;
78
System combination via ROVER of multiple ASR systems built using a ran-
domized decision-tree growing procedure [SRK05] and cross adapted across two
speaker segmentation schemes
2
;
A basic set of models that use VTLN and SAT training followed by fMPE+MPE
training [PKM
+
05] and speaker adaptation using MLLR;
Rescoring of the lattices produced after MLLR with an in-domain language model
(restricted condition) and out-of-domain language model (public condition). This
is the only step that uses non-EPPS training material;
Static decoding graph with quinphone context;
Training of acoustic models using EPPS material only; and
Automatic punctuation of the final output with periods and commas in a post-
processing step.
The first step is speech segmentation and speaker clustering where the speaker clusters
corresponding to the two schemes S1 and S2 are determined. This is followed by decod-
ing steps (a) through (f) described below and represented by a single block (labeled as
“Baseline”) in Figure 5.1. The SAT model (Model A)was used to decode the test data
in 6 passes using segmentation scheme S1.
a) The SI pass uses the SI model and the LDA projected PLP features.
b) Using the transcript from a) as supervision, warp factors are estimated for each
cluster using the voicing model and a new transcript is obtained by decoding using
the VTLN model and VTLN warped features.
2
This is not part of the 2006 Spanish ASR system
79
c) Using the transcript from b) as supervision, fMLLR transforms are estimated for
each speaker cluster using the SAT model. A new transcript is obtained by decod-
ing using the SAT model and the fMLLR transformed VTLN features.
d) The VTLN features after applying the fMLLR transforms are subjected to the
fMPE transform and a new transcript is obtained by decoding using the MPE
model and the fMPE features.
e) Using the transcript from d) as supervision, MLLR transforms are estimated for
each cluster using the MPE model.
f) The lattices resulting from e) are rescored using the 4-way interpolated language
model described in ??. The one-best at this step will be referred to as CTM.
A larger SAT model (Model B) was used to decode the test data using segmentation
scheme S2 from step (a) through step (c), i.e., to obtain the vtln warp factors and the
fMLLR transforms corresponding to the speaker clusters in S2. For cross-segmentation
adaptation, CTM (from step (f) above) is now used as the reference transcript to com-
pute MLLR transforms for Model B and process steps (e) and (f) using S2. The one-best
from this stage will be referred to as CTM
0
. The above steps (a) through (f) are applied
to the three different models built using randomized decision trees (R1, R2 and R3)
using segmentation scheme S1. These three decodes from segmentation S1 were subse-
quently used to cross adapt the models, R1, R2 and R3 and redecode the test data using
segmentation scheme S2. Finally, three different decoded outputs CTM-R1
0
, CTM-R2
0
and CTM-R3
0
were obtained. Last, CTM
0
, CTM-R1
0
, CTM-R2
0
and CTM-R3
0
were
rovered together to produce the final output.
The first step in the recognition system is a segmentation of each audio file into
speech and non-speech segments, followed by deletion of the non-speech segments. We
use an HMM-based segmentation system that models speech and non-speech segments
80
Figure 5.1: Overall System Architecture
with five-state, left-to-right HMMs with no skip states. The output distributions in each
HMM are tied across all states in the HMM, and are modeled with a mixture of diagonal-
covariance Gaussian densities. The speech and non-speech models are obtained by
applying a likelihood-based, bottom-up clustering procedure to the speaker indepen-
dent acoustic model. This is follow by a clustering procedure to cluster the segments
into clusters that can then be used for speaker adaptation. All homogeneous speech
segments are clustered into a pre-specified number of clusters using K-means. The K-
means procedure operates on single Gaussian densities estimated on each homogeneous
segment, using a Mahalahobis distance measure.
After the speaker clusters are determined, the final system output is obtained in 4
passes using Model A:
a) The SI pass uses the SI model and the LDA projected PLP features. b) Using the
transcript from a) as supervision, warp factors are estimated for each cluster using the
voicing model and a new transcript is obtained by decoding using the VTLN model and
VTLN warped features. c) Using the transcript from b) as supervision, fMLLR trans-
forms are estimated for each cluster using the SAT model. A new transcript (A-fsa-ctm)
is obtained by decoding using the SAT model and the CMA transformed VTLN features.
The LM built for the restricted condition is used for decoding. d) The VTLN features
81
after applying the CMA transforms are subjected to the fMPE transform. MLLR trans-
forms are computed using A-fsa-ctm for supervision and a new transcript (A-ctm) is
obtained by decoding using the MPE+MLLR model and the fMPE features The LM
built for the open condition is used at this step for decoding.
A-fsa-ctm transcript is used as the reference script by Models R1 through R4 to
obtain the vtln warp factors and the fMLLR transforms corresponding to the automat-
ically identified speaker clusters. Step (d) is then applied to these models R1 through
R4 built using randomized decision trees (R1, R2, R3 and R4). Cross adaptation across
systems in step (d) is achieved in the following manner:
A-ctm is used for MPE+MLLR decoding using Models R1 to produce R1-ctm R1-
ctm is used for MPE+MLLR decoding using Models R2 to produce R2-ctm R2-ctm is
used for MPE+MLLR decoding using Models R3 to produce R3-ctm R3-ctm is used
for MPE+MLLR decoding using Models R4 to produce R4-ctm R4-ctm is used for
MPE+MLLR decoding using Models A to produce R5-ctm
A-ctm, R1-ctm, R2-ctm, R3-ctm, R4-ctm, R5-ctm were rovered together to produce
the final output.
5.6.2 Language modeling for the open condition
All decoding passes upto Step (d) used a 4-gram Katz back-off model using Good-
Turing discounting to reserve probability mass for unseen events and was built using
the SRI LM toolkit[8]. One model was trained on the training transcripts (approx. 75
hours of data) and another on the text corpus based on the Full-text Editions from the
EU parliament website, released by RWTH for this project. A perplexity minimizing
mixing factor was computed using the Dev 06 reference text. The interpolated language
model comprising of 59k unigrams, 2.4M 2-grams, 2.9M 3-grams and 3.5M 4-grams
was subsequently pruned using an entropy based criterion [9] to yield a mixed language
82
model comprising of 59k unigrams, 2.2M 2-grams, 2.1M 3-grams and 1.2M 4-grams.
The interpolation weights used were 0.360026 and 0.639974 respectively. This is the
LM used under the restricted condition for the second TC-STAR ASR evaluation (2006).
We collected text data from the Internet in an iterative fashion using the architecture
described in [SGN05]. All domains under europa.eu were blocked in our web-crawling
setup. For building the evaluation system, we used the May 2006 snapshot. To ensure
that dev data is not accidentally included in our crawl we removed all web pages that
had more than two common 6-grams with devtext ( 0.08% of web pages crawled). To
increase our coverage of web-content we split the training data (and decodes from the
2005 portion of the unsupervised training data) into 5 topics using Latent Dirichlet Allo-
cation (Section 5.2) and gathered data for each topic ( 800M words). Queries from the
entire set of unsupervised decodes were used to crawl an additional 1.5G words data. We
also crawled around 6G words from queries generated from the entire training set (EPPS
FTE transcripts from RWTH). This 6G set was split into 5 topics and LMs were built
for each topic. These individual models were then merged using weights optimized on
Dev06 test set. Perplexity filtering, OOV rate thresholds [SGN05] and subset selection
(Chapter 4)were used to filter the data. In all, a total of 12G filtered words were used
in the LM build. Perplexity on the Dev06 test set reduced from 120 (with the restricted
condition language models) to 63 with inclusion of web data. This data was merged with
the 204M words of Broadcast News. This resulted in an LM containing 150M ngrams
after pruning that was used in decoding step (d). For the Spanish system, additional
language model data for the open condition was obtained in the following manner. The
queries were generated by comparing the top n trigrams between the in-domain LM text
and the background Spanish web-lm which was generated by querying for the unigrams
with the highest probability in the in-domain data. The Spanish system did not employ
subset data selection using the relative entropy criterion or the LDA-based topic splits.
83
For the English ASR evaluation, the open-domain LM gave a WER of WER of 7.4%
compared to the reference IBM System with WER of 9.6%. The second best system in
the evaluation had a WER of 9.5%. Experiments conducted after the evaluation indicate
that 0.3% gain came from using unsupervised data for querying, 1% gain from topic
splits (querying and merging) and 0.4% gain from incremental r.e data selection.
84
Chapter 6
Hierarchical Speech Recognition
In this chapter we focus on building acoustic models for names and infrequent func-
tion words for which the LM estimation is poor (Section 3). Spoken name recognition
is in itself a key component of many speech recognition applications. Speech based
information retrieval relies heavily on accurate spotting of keywords or names. Another
common application area is in the call center domain for tasks such as directory assis-
tance, name dialing systems, city name recognition as part of a travel system, caller
name identification for banking, etc. For most applications, the list of names tends to be
on the order of several thousand, making spoken name recognition an inherently high
complexity problem. In addition, the large variability in name pronunciation, both at
the segmental and suprasegmental levels significantly decreases recognition accuracy.
Names have multiple valid pronunciations that evolve as a product of various socio-
linguistic phenomena. Specifically in a country like the USA, with a broad cultural
base, there is considerable variability in the linguistic origins of names. A large num-
ber of names have foreign origin and depending on the speaker’s linguistic background,
they are pronounced differently. As an example, the name Abhinav, which has an Indian
(Sanskrit) origin, is typically pronounced as ‘ae b hh ih n ae v’ by a native speaker of
American English whereas a native speaker from India pronounces it as ‘aa b hh ih n eh
v’.
Automatic pronunciation generation techniques such as those based on neural
nets[DNHP97], decision trees[RBF
+
00], finite state transducers[GA02], and acoustic
85
decoding [RDI99][BSWW03] have been proposed previously. However, these tech-
niques require a large set of words and their different pronunciations for training.
In addition, performance of such data-driven schemes tends to be limited for names,
especially those with foreign origins, since generating valid pronunciations requires an
understanding of the original language and its phonology. Embedding this knowledge
into an automatic pronunciation generation system is not easy, and thus, we often require
manual augmentation of the names pronunciation dictionary. Inclusion of multiple pro-
nunciations can also increase the complexity of the recognition task.
We propose the use of syllable-based reverse lookup schemes to reduce the high
complexity of spoken name recognition and show that the larger length of the unit com-
bined with the implicit phonotactic constraints that the syllable model imposes can help
boost the accuracy of the reverse lookup scheme substantially. We believe that longer
length units such as syllables can help in reducing the dependency on dictionary accu-
racy by modeling larger contexts in the acoustics itself. This belief stems from the
effectiveness of longer (word) length units in tasks such as digit recognition where
word length units remove the need of having a phonetic dictionary with variants. A
low insertion, deletion, and substitution [Gre98] rate implies that a syllable dictionary
based decoder can be more robust against base-form variants. Also these units can help
in exploiting longer acoustic correlations beyond the phone. Prosody and stress varia-
tions in ‘non-native’ pronunciation that are very difficult to represent in a dictionary can
also be embedded in the acoustic modeling. More details about this line of modeling are
given in Section 6.1.
The number of different acoustic units required for a given recognition task is a
function of the vocabulary size and the nature of the underlying acoustic units. For
phones, the number of basic models (without context modeling) is fixed for a given
language, but if we decide to use syllable or word size units, the number generally
86
increases with the vocabulary size. Many of these units corresponding to infrequently
occurring words will have poor coverage in the training data. The sparsity of training
data has been the main stumbling block in using larger units for tasks such as large
vocabulary continuous speech recognition (LVCSR) or spoken name recognition. For
small vocabulary tasks such as alphabet or digit recognition, larger units (typically word
level units) are used frequently and with success. To address some of the challenges due
to data sparsity, in this chapter, we present techniques for initialization of larger units
using context-dependent (CD) phones, which ensure that the system performs well even
with minimal or no acoustic data for training the larger units. We also present different
criteria to split the lexicon such that we use the appropriate units for representing the
vocabulary words with a given training corpus. Proper unit selection is important for
reducing system complexity and ensuring robust training.
The chapter is organized as follows. The next section will discuss syllable based
acoustic modeling. Section 6.2 describes the general design issues in syllable-based
speech recognition. Section 6.3 describes the design of our spoken name recognition
system. The speech corpora used for the experiments in this chapter and the training
methodology are described in section 6.4. Experiments and results are discussed in
section 6.5. In the concluding section, we provide a brief summary of our work, the
major findings and an outline for future research.
6.1 Acoustics based pronunciation modeling using syl-
lables
Differences in speaking styles arising from accent and other factors such as age lead to
pronunciation variations that are systematic in nature. Pronunciation variations in spon-
taneous speech can also occur because of factors such as emotion and coarticulation.
87
It is difficult to represent these effects by pure surface form variants. Longer length
acoustic units such as syllables should automatically capture many of these variations
occurring at the phone level, thus reducing the dependence on surface form variations.
This strategy has worked very effectively for limited domain tasks such as digit recog-
nition where whole word level models are used and hence a dictionary mapping is not
required. However, it is not possible to build whole word models for LVCSR systems
because of training data sparsity and also because in most cases the test vocabulary is
not a subset of the training vocabulary.
Another motivation for using an acoustic unit with a longer duration is that it
facilitates exploitation of temporal and spectral variations simultaneously. Param-
eter trajectory models and multi-path HMMs[GN96][Kor97] are examples of tech-
niques that can exploit the longer duration acoustic context, and yet have had marginal
impact on phone-based systems. Units of syllabic duration or longer have the
potential to model cross phone spectral and temporal dependencies better than tra-
ditional phone-based methods. We are further motivated by results from psychoa-
coustics research[Gre97][Mas72][Lip96] that indicate that syllable length durations
play a central role in human perception of speech especially under noise conditions.
Recent research on syllable-based recognition[GHO
+
01] [Kir96][WKMG98][SO03]
also demonstrates that syllables can help in improving performance of ASR.
As explained earlier, the major challenge in using syllables and word level units
for recognition is the training data sparsity problem. In [GHO
+
01] this problem is
partially resolved by using only those syllables that have good coverage in the acoustic
data. At a context-independent level the syllable, being a larger unit, requires more
training data than phone sized units and hence proper training of all syllable models
using flat initialization strategies, as described in [GHO
+
01], is difficult. We describe
a simple scheme for initializing syllable models from context-dependent (CD) phones
88
that helps in robust estimation of syllable models. In addition, we need to estimate the
advantage, in terms of recognition accuracy, that we can gain by moving from a phone
representation to a syllable (or a whole word) representation. In general the achievable
improvement depends heavily on the training data, since it determines how well longer
acoustic units can be trained. Thus depending on the acoustic training data available
to us we need to determine the proper representation for each word in the lexicon. We
describe two methods of creating a lexicon representation for variable-length acoustic
units. The next section will describe the proposed longer-unit initialization and lexicon
splitting strategies in more detail.
6.2 Recognizer Design
In the first stage, we design three separate recognizers corresponding to the different
acoustic units of interest, i.e., phone, syllable, and word.
6.2.1 Phone Recognizer
The design of the phone-based recognizer follows the standard flat start (i.e., uniform
segmentation) Baum-Welch reestimation strategy with decision tree based within-word
triphone creation and clustering[OOW
+
95]. The phone-based recognizer serves as a
baseline for our results and is also the only context-dependent system that we built.
Instead of flat start initialization we can use prebuilt models from an existing corpus to
initialize the segmentations for the corpus of interest.
6.2.2 Syllable Recognizer
The first stage in designing a syllable lexicon is to syllabify the training vocabulary and
generate the list of syllables present in the training data. To do this we need to convert the
89
phone level pronunciation in the dictionary to the corresponding syllabic representation.
This process of syllable boundary detection(syllabification) is described in [Kah76] as
a set of rules which define permitted syllable-initial consonant clusters, syllable-final
consonant clusters, and prohibited onsets. Syllabification software available from NIST
[Fis] implements these rules and given a phone sequence comes up with a set of alter-
native possible syllables that can be used to generate the syllabic pronunciation. In our
system, we represent syllables in terms of the underlying phone sequence. Thus given
a phonetic transcription of the speech in a standardized format like WordBet or IPA, we
can write a syllable representation by coming up with a set of syllable symbols from the
phones comprising the syllable. For example Junior with the phonetic transcription ’jh
uw n y er’ can be represented in syllabic terms as ‘jh uw n’ ‘y er’. Homophones were
given the same lexical representation.
It is possible for a given phone sequence to have multiple valid syllabifications due
to consonants at syllable boundaries that belong, in part, to both the preceding and the
following syllable. These consonants are referred to as ambisyllabic. In our case, for
simplicity, we chose the most commonly occurring syllabification, ignoring ambisyl-
labicity. Using multiple syllabifications to represent a single word in the dictionary
might potentially decrease the training data per unit substantially. It is also possible that
other syllabifications provide better fit to the speaking style and rate of particular speak-
ers. The effect of ambisyllabic representations on recognition performance is a research
issue, which was not addressed in this chapter.
Phone-level HMM models typically share the same topology with the same number
of states for all phones. However, syllable models require a different number of states
depending on their size. A syllable comprising four phones such as ‘s w eh l’ requires
more states than single phones or other shorter syllables such as ‘t eh n’. To account for
90
this the number of states was chosen to be three times the number of phones comprising
the syllable.
To initialize the models for the syllable recognizer we used Context-Dependent (CD)
phone models trained on the same data. Context-dependent phone states were concate-
nated to generate the states of the corresponding syllable models (Figure 6.1). The
context for each phone model was restricted to the surrounding phones inside the syl-
lable. The same principle can be extended to generate word level models. It should be
noted that since the syllable models are copied independent of the words in which they
occur it is not possible to use context information beyond the syllable boundary. For
example, even though the left context of ‘ey’ for the syllable ‘b l iy’ in ‘ably’(ey b l iy)
would be a more appropriate choice while copying the model for the phone b (ey-b+l
instead of b+l), we restrict ourselves to the context-independent model (b+l) only since
the same syllable might occur in some other word with a different context.
m+uw
uw−v
m−uw+v
Figure 6.1: Initialization of the 9 state syllable m uw v
6.2.3 Larger Acoustic Units
Longer length units such as whole word models can be built in the same fashion as the
syllable models. For monosyllabic words the syllable and word models are equivalent.
91
We did not build separate word level models for monosyllabic words as suggested in
[GHO
+
01].
Initialization from phone level models in this manner ensures that the syllable and
word level models perform similarly to the corresponding phone recognizer even with-
out further acoustic training. The increase in recognition accuracy that can be achieved
by preferring a syllabic unit to the phone representation depends on the coverage of the
unit in the training data. This brings out the need to identify the proper lexical repre-
sentation or choice of units to represent the words in the lexicon. We tried two different
strategies for addressing this problem. The first uses a simple threshold based on the
number of training units available in the acoustic data. The second uses the difference
in recognition accuracy achieved on the training data by using syllable, phone, or word
level units. Based on this unit selection process, we split the word lexical entries appro-
priately. Certain commonly occurring words such as ‘the’ will automatically end up
having word level models. Other words will have either a pure syllable representation
or a hybrid syllable and phone representation. As an illustrative example, consider the
phone level representation ‘ae k t x r z’ for ‘Actors’. The pure syllabic representation
would be ae k t x r z. However if the syllable t x r z is not included in the list of acous-
tic units based on the selection criteria mentioned above, we split the lexicon entry as
‘ae k t+x t-x+r x-r+z r-z’ with ‘-’ referring to left context and ’+’ to the right context.
Our results (Section 6.5) indicate that the second approach of using training data recog-
nition accuracy as a unit selection criterion achieves better performance than the counts
based scheme.
92
6.3 Spoken name recognition systems
The intended application for the aforementioned modeling in the chapter is spoken name
recognition. In this section we will describe our decoding strategies with reference to
the spoken names recognition task. The standard approach for spoken name recogni-
tion is to use a Finite State Grammar (FSG) based recognition network, in which all the
required names along with their possible pronunciations are taken as arcs or alternate
paths for evaluation. The recognizer matches the input utterance against all possible
names and their variations and selects the name that matches best. One could use uni-
gram or bigram name (word) level statistics to weight the different paths. However it
is difficult to obtain these statistics from real usage data, e.g., directory applications
where all entries are deemed equally likely[ABD
+
98][BCR98]. Considering this we
gave equal weight to all the names although it is possible to improve the recognition
performance by giving a higher weight to common English names such as James, Mary,
John and Patricia. As is evident from the design, the perplexity of FSG recognition net-
works can be prohibitive for very large name lists, which can be in the order of 100K or
more words for many directory applications.
For spoken name recognition tasks, we observed that as the perplexity of the names
FSG recognition network grows, the phone level recognition accuracy drops along with
the WER. The phone recognition accuracy achieved with the lexical constraints imposed
by the FSG network comes close to the accuracy of a simple phonotactic model based
recognizer. Based on this observation, a promising approach (Figure 6.2) for cases with
large word lists is to use inverse dictionary lookup techniques to recognize the name.
That is, we identify the underlying N best phone sequences based on an n-gram (phone)
language model and then use statistical string matching to find the best candidates from
the name list using the dictionary. The statistical string matching process compares
the phone sequence with the pronunciations for the different names in the name list
93
and selects names, which have similar pronunciations. Knowledge of frequent phone
insertion, deletion, and substitution can be incorporated in the statistical search stage
to make it more accurate. The current implementation however is limited to taking the
Levenshtein (i.e., string edit) distance.
The N-best list of name candidates (or equivalent lattice) generated by the inverse
lookup stage can be rescored in a more constrained way. For example, an FSG recogni-
tion network can be generated for rescoring. This can be seen as an information-retrieval
problem. The advantage of such a scheme is that it makes possible a substantial reduc-
tion in computational complexity with a small trade-off in accuracy. Performance of
such techniques depends on the accuracy of the n-gram based recognizer and the pronun-
ciation variations covered in the dictionary for name retrieval. The phone is not a good
unit for information-retrieval-based name recognition schemes because of factors such
as non-native pronunciation variations and the high rate of phone insertion and deletion
in natural speech. In addition the limited context information that can be embedded
in phone level units reduces the accuracy of the recognizer. The syllable is a better
unit for reverse lookup based schemes. The syllable sequence recognition accuracy is
higher than the phone sequence recognition accuracy because the syllable constrains the
decoding process. The reverse lookup process becomes equivalent to the FSG decoder
if we use word length units. The syllable provides a compromise between the fast but
very inaccurate phone sequence recognizer and the prohibitively slow word sequence
recognizer (FSG network), which provides the upper bound on the accuracy possible in
this decoding scenario. An additional advantage of the reverse lookup scheme is that
by reducing the name candidate list it provides the flexibility of using more complex
algorithms, which would have been computationally prohibitive for a large FSG net-
work. The reverse lookup approach is similar to the retrieval schemes used in spoken
document retrieval[TGL
+
00], OOV and named entity detection[GFW98]. However by
94
decoding with syllables we implicitly add syllable length phonotactic constraints which
coupled with the larger length of the unit (covering more acoustic context) helps in
boosting the recognition accuracy.
We will next describe the spoken names corpus that we used in our experiments and
the training setup.
Bigram Based
Phoneme Recognizer
John : jh aa n
John
Output
Juan : w aa n
Matching
Statistical String
Name Candidates
(John,Juan)
.
.
.
.
Dictionary
(~100K Names)
jh aa n
John,Juan FSG Rescoring of
Figure 6.2: Information retrieval scheme for name recognition provides scalability.
6.4 Training for spoken name recognition: Corpora and
Implementation
6.4.1 Initial TIMIT system
As the first step we built three independent recognition systems (phone, syllable and
hybrid) using the TIMIT speech corpus. For the TIMIT corpus, we had 3137 syllables
with about 70% of the words being either monosyllabic or bisyllabic (Table 6.1).
The speech data from TIMIT was down-sampled to 8 KHz. 26 mel frequency cep-
stral coefficients were extracted at a frame rate of 10ms using a 16ms Hamming win-
dow. First and second order differentials plus an energy component were used. For
the baseline phone-based recognizer, 46 three-state left-to-right phone models from the
95
CMU phoneset were initialized and trained on hand labeled data provided in the TIMIT
corpus. These were then cloned to yield triphone level models, which underwent rees-
timation. Tree based clustering was used for state tying to ensure proper training of the
models. Output distributions were approximated by eight Gaussians per state. Subse-
quently the syllable and the hybrid system were initialized as described in Section 6.2
and were trained on the acoustic data. We used the TIMIT models to seed the models
for the NAMES corpus as described in the next section.
6.4.2 NAMES corpus
The primary speech corpus of interest to us for spoken name recognition is the OGI
NAMES corpus[SNP02]. The NAMES corpus is a collection of name utterances, cov-
ering first, last and full names, collected from several thousand different speakers over
the telephone. The name pronunciation is fairly natural since the speakers were not
reading the names off a list. Word level transcriptions are provided for all name utter-
ances, and some of the utterances are also labeled phonetically. The phonetically labeled
files were used to make a names dictionary, which was augmented with some additional
name entries from public domain dictionaries like Cambridge university’s BEEP dictio-
nary and the CMU dictionary. The names corpus is sampled at 8 Khz and has about
6.3 hours of speech data. There are about 10000 unique names in the corpus with a
rich phonetic coverage that amounts to 40% of the possible phone pairs. Tables 1 and
2 describe the occurrence frequency for words of different syllabic count in the TIMIT
and NAMES corpora. As can be seen, most names are bi or tri syllabic unlike in TIMIT,
which has a higher monosyllabic content mainly due to functional words such as ‘and’
and ‘the’. Also, words with smaller syllable count are used more frequently in generic
sentences of the nature found in TIMIT. Syllable count distributions for a conversational
speech corpora such as Switchboard can be seen in[GHO
+
01].
96
6.4.3 Bootstrapping from TIMIT
We used the models trained on TIMIT to bootstrap the models for the NAMES database.
Both the TIMIT and the NAMES dictionaries were merged to yield a single phonetic
dictionary, which was then converted to a syllabic dictionary. For the phone level rec-
ognizer we used the context-independent phone models from TIMIT as initial proto-
types to build a NAMES CD phone system. For the syllable and the hybrid unit system
we used the final TIMIT models as prototypes for the syllables common between the
two databases, which are around 1200 in number. Table 6.3 shows the distribution of
the syllables common between TIMIT and the NAMES database for various syllable
lengths. We can see that a large number of the shorter syllables (2-3 phones in length)
can be initialized from TIMIT. Single phone syllables such as “ax” (as in “about” ax
b aw t) are not shown in Table 6.3 as they can be initialized directly from phone mod-
els. The remaining syllables in the NAMES lexicon were initialized using the techniques
described in Section 6.2 from the NAMES CD phone models.
6.5 Results and discussion
After training, the first set of recognition experiments was conducted on the TIMIT test
set to verify our ideas. Subsequently we focused on evaluation of our system on the
NAMES recognition task.
6.5.1 Preliminary TIMIT based Experiments
We developed a hybrid unit speech recognition system, which was trained and evaluated
on the TIMIT corpus using the train and test sets provided. The TIMIT corpus contains
a total of 6300 sentences with 1344 sentences in the test subset. A basic word bigram
97
Number of syllables 1 2 3 4 5
%Words 23% 47% 23% 5% 0.8%
Table 6.1: Distribution of words and their syllable count for the training part of the
TIMIT corpus. Total number of words in this training corpus was 40000.
Number of syllables 1 2 3 4 5
%Words 13% 50% 30% 6% 0.3%
Table 6.2: Distribution of words and their syllable count for the NAMES corpus. Total
number of words was 10000.
Syllable Length 2 3 4 5
Percentage of common syllables 30% 53% 15% 2%
Table 6.3: Distribution of syllables common to TIMIT and NAMES and their length.
Total number of common syllables is around 1200. This table does not include single
phone syllables.
language model trained on TIMIT was used in our experiments. We built two sets of
models, the first one used CD phone initialization of syllables as described in Section
6.4. For performance comparison purposes we also built syllable and word level systems
using the standard flat start and embedded training of the acoustic models [GHO
+
01].
Performance improvements through larger acoustic units
We compared the performance of recognizers using the syllable and word level units
with the phone-based system. Using the CD phone system for initialization guarantees
that the syllable and word level systems perform similar to the baseline phone case even
without reestimation. In table 6.4 we show the recognition accuracy obtained with three
sets of acoustic models : context-independent syllable, context-independent word, and
context-dependent phone models at different stages of parameter reestimation. The ini-
tial accuracy is identical to the CD phone system for the word case and is slightly lower
98
for syllable models, which can be attributed to the lack of context modeling across sylla-
ble boundaries. The recognition accuracy subsequently improves with reestimation for
units that have significant coverage in the training data. The context-dependent phone
models were not reestimated as no performance gains could be observed by additional
rounds of reestimation.
We compared the word recognition accuracy achieved with syllable and word sys-
tems trained using the standard flat start strategy. As can be seen in Table 6.5, the choice
for the initialization strategy makes a significant difference in performance. Assuming
that a typical syllable or word level model will have three or more phones, the number
of parameters to be estimated for the model is around 3 times that for the phone models.
Thus a large number of units in the flat start method are poorly trained. An analysis
of the recognition errors confirms that the performance difference between the flat start
method and CD phone-based initialization can be attributed to units that do not occur
frequently in the training data.
Hybrid lexical unit recognizer
An analysis of recognition performance on the training data was used to decide which
representation of a word is best. In our present implementation we choose from pure
syllabic, word, or phone representations only, i.e., no mixing of syllable and phone units
in a word. We also tried an alternative simpler scheme that uses instance counts to do the
model selection. The word representation was preferred over the syllable representation
if the word occurred more than a hundred times in the training data, and the syllable was
preferred over the phone if the occurrence count was more than 75. The performance
of the counts based scheme was found to be lower than the first scheme and involved
99
Recognizer Type First Reestimation Third Reestimation
Context-Independent
Syllable
72% 85%
Context-Independent
Word
74% 87%
Context-Dependent
Phone
74% 74%
Table 6.4: TIMIT word recognition accuracy results with syllable and word level units
at different stages of reestimation after CD phone initialization compared to baseline
phone recognizer.
Recognizer Type Flat initialization CD phone-based ini-
tialization
Context-Independent
Syllable
80% 85%
Context-Independent
Word
81% 87%
Table 6.5: TIMIT Word recognition accuracy results for syllable and word level units
with and without CD phone-based initialization after three reestimations.
experimentation with the count cut-off thresholds. Thus we discarded this approach and
used the first scheme in our experiments.
The complexity of the recognizers was evaluated in terms of the total number of
states the models required (see Table 6.6). The total number of states is linked to the
memory requirements and speed of the Viterbi decoder. As expected, the hybrid unit
recognizer has a substantially lower complexity as compared to the syllable or word
recognizers. Interestingly, the hybrid unit recognizers also had a slightly higher accu-
racy. We can attribute this to the unit selection process, which ensures that the hybrid
recognizers include only those units that are robustly trained. It is important to note
that the complexity in terms of the physical model states is much lower for the context-
dependent phone recognizer. The CD phone recognizer had around 4000 distinct phys-
ical states and about 18000 logical states. No form of state tying was used for word and
syllable models. It should be noted that for linear decoding systems such as HTK in FSG
100
Recognizer Type Accuracy Number of model states
in recognizers
Context-Independent
Word
87% 43380
Context-Independent
Syllable
85% 24460
Hybrid Unit Recog-
nizer
90% 13450
Table 6.6: Word recognition accuracy for TIMIT and complexity in number of states of
syllable, word and hybrid lexicon recognizers
mode, which do not minimize the decoding graph at the acoustic unit level, the decoding
complexity is independent of the number of states and depends just on the number of
words. For these recognizers the number of states should be taken as a measure of the
acoustic model memory requirements only.
The TIMIT corpus is designed primarily for phone recognition experiments and
the results should be interpreted with the proverbial pinch of salt. The performance
improvements on TIMIT that we could acheive using syllables over the context-
dependent phonetic system are much higher than what we could achieve in LVCSR
tasks. Consider, for example an LVCSR system where syllable based models were
investigated more recently. For the MALACH automatic speech transcription system
designed for spontaneous interview speech from holocaust survivors[SNR03] the WER
improvement achieved by using syllables was only about 2% ( in relative terms) when
compared to the corresponding phonetic system. However, we were still able to acheive
a 15% improvement in recognizing keywords and names, a task which is similar in
nature to the spoken name recognition task we have addressed here.
101
6.5.2 Evaluation on the spoken name recognition task
Comparative performance evaluation between the syllable-based and phone-based sys-
tems is presented for both the FSG network based spoken name recognition and the
information retrieval schemes. As discussed in Section 6.2, the syllable recognizer can
be initialized in two different ways. The first scheme, which provides full coverage of
the syllables in the lexicon, will be referred to as the syllable recognizer. The alternate
design strategy in which we restrict syllable units to those which have adequate cover-
age in the training data, will be referred to as the hybrid recognizer. The initialization
and training of all models were done in the manner described in Section 6.4 using both
the TIMIT and NAMES databases.
Scheme I: FSG Networks
As a first step, we performed comparative evaluation of the phone-based recognizer
with the syllable and the hybrid recognizer for a FSG based recognition task. In the
case of the hybrid recognizer, the syllable was preferred over the phone if the occur-
rence count was more than 50. The training vocabulary completely covered the list of
names for recognition. We randomly selected a Section of the NAMES database com-
prising 6000 utterances for evaluating the recognition performance, and the remaining
4000 utterances in the NAMES database were used for training. We compared the per-
formance of the three recognition systems on this set. The results are given in Table 6.7
. The results for phone models compare well with previous results on similar size name
lists[ABD
+
98].
As can be seen from these results for the FSG recognition task, the performance of
the two syllable-based recognizers is significantly better than that of the phone-based
recognizers.
102
Recognizer Type Word recognition Accuracy
(%)
Context-Independent Phone Recognizer 45
Context-Dependent Phone Recognizer 63
Context-Independent Hybrid Recognizer 75
Context-Independent Syllable Recog-
nizer
80
Table 6.7: Recognition rates for different FSG based spoken name recognition systems
on the 6000 utterance test set
We next compared how the performance of the phone-based and the (pure) syllable-
based recognizer scales with increasing word list size. We trained both the systems
on 1K names and increased the size of the test name list from 1K to 10K. Figure 6.3
shows the recognition accuracy with increasing vocabulary size. As can be seen from
the figure, the (rate of) performance drop for the syllable-based system is less than the
drop for the phone-based system.
Scheme II : Information Retrieval
In the case of very large name lists, we first used phone and syllable N-gram decod-
ing graphs to identify the underlying unit sequence (phones or syllables) and then used
reverse dictionary lookup based on statistical string matching to identify the name (see
Section 6.3). In the first stage for reverse lookup, we identified the two best unit
sequence candidates. We used bigram phone and syllable language models estimated
on the NAMES corpus to do phone and syllable sequence recognition. In the next stage
a name candidate list was selected using reverse lookup in the dictionary (Section 6.3).
For the syllable recognizer we split the recognizer output sequence to the correspond-
ing phone sequence in order to calculate the Levenshtein distance. All names whose
distance was less than a threshold were selected for creating the FSG network for the
second stage. The FSG network assigns equal weight to all the candidate names. The
103
threshold for selection was taken to be a function of the recognized sequence length and
the recognizer type. The recognition speed of reverse lookup scheme depends on the
computational complexity of the first stage and the number of name candidates gener-
ated for the second stage FSG rescoring. The simplicity of the phonotactic decoding
network leads to a very fast operation for the first stage. However the first stage output
has low accuracy and consequently a large number of candidate names need to be con-
sidered in the second stage FSG rescoring to maintain accuracy. The syllable system
has relatively high complexity for the first stage, but the second stage FSG size can be
much smaller because of the higher accuracy of syllable sequence recognition (Section
6.3). We choose the distance threshold such that the two recognizers had about the same
decoding speed. As can be seen from our results in Table 6.8 the syllable recognizer has
a significantly higher accuracy at the same decoding speed.
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
65
70
75
80
85
90
Wordlist size
Recognition accuracy
Phoneme recognizer
Syllable recognizer
Figure 6.3: Plot of recognizer accuracy vs word list size for phoneme and syllable rec-
ognizers.
104
Acoustic unit Word recognition accuracy after
FSG rescoring (%)
Context-Independent Phone 45
Context-Dependent Phone 61
Context-Independent Syllable 73
Table 6.8: Spoken name recognition accuracy for the information retrieval scheme after
FSG rescoring of compacted name list
6.5.3 Syllables and pronunciation variation in names
We did a preliminary analysis of the nature of improvements achieved by using syllable
models instead of phoneme models for spoken name recognition. Our analysis indicates
that the syllable system was much better at recognizing non-native names. However the
OGI corpus is woefully lacking in terms of non native content, making it difficult to
conclusively state that syllables help in modeling the pronunciation variation found in
non-native names. To study this further, we are in the process of collecting a spoken
names corpus at USC which will have a large chunk of names of foreign origin spoken
by both native and non-native speakers of English drawn from the student population at
USC. The observation that syllables help in modeling pronunciation variation for names
is also supported by our results on the MALACH corpus[SNR03], which has a large
collection of foreign names. Compared to the overall 2% WER improvment, the WER
improvement for names was about 15% with the syllable based models over context-
dependent phones.
6.6 Conclusion
In this chapter we presented a case for using syllables for spoken name recognition.
We also implemented a system trained and tested on TIMIT, which uses syllable and
word level units in conjunction with CD phone units. To address the problem of training
105
data sparsity for word and syllable units, we used CD phone-based initialization, which
guarantees that the higher level units will give equivalent performance to CD phone
recognizer even in the absence of any acoustic training data.
Our results, which compare the performance of CD phone-based system with sylla-
ble and word level systems for the TIMIT and the NAMES recognition task, show that
substantial performance gains can be achieved by using longer length acoustic units.
However, given the limited nature of the vocabulary and language models used, the
TIMIT experiments we conducted are not really indicative of a full scale LVCSR system.
We implemented a similar approach to syllable design in [SNR03] for the MALACH
task and found out that in general WER gains from using syllables were much less pro-
nounced than the gains observed specifically in terms of spotting keywords and names.
Using information retrieval techniques for spoken name recognition allows for a
substantial reduction in the recognizer complexity with a small trade-off in accuracy.
As our results indicate, the syllable is a very promising unit for such schemes. This
can be attributed to the low insertion and deletion rate of syllables. We are inves-
tigating techniques that will help us incorporate knowledge of frequent phone inser-
tion/deletion/substitution in the statistical search stage. This will help in improving the
name candidate list search for the FSG rescoring stage.
We believe that using larger units can also help in solving the spoken name pronun-
ciation generation. Names have varied pronunciations, and in tasks such as directory
assistance the name lists may have more than 100K names, making it impossible to
have manual generation or verification of the pronunciation dictionaries. Extending the
current phone-based techniques for pronunciation generation such that they use variable
length units (demisyllables/syllables/words) would open the possibility of compensating
phonemic representation ambiguity in the acoustic model itself.
106
We are in the process of collecting a spoken names corpus containing names from
different linguistic origins spoken by native and non-native speakers of English. We
believe that this corpus will help in better comparison of the performance of syllables
with phonetic models in dealing with the pronunciation variation found in non-native
speech. Depending on their linguistic origin names differ widely in their phonetic cover-
age and average lengths. For example Chinese names are usually shorter than American
names. The effect of these factors on spoken name recognition would be an interesting
study for the future.
107
Chapter 7
Conclusion and future work
This thesis presents my research work directed towards automating the process of build-
ing topic specific language models for speech recognition applications given a generic
text resource and a small set of in-domain utterances either as text or as raw speech data.
In most cases, only a small fraction of the data in generic text resources such as
the web or large collections such as GigaWord is suited to building limited domain
applications. Given the vast size and diversity of these resources, access to the data
from these sources is usually restricted through a search engine which given a set of
queries returns documents containing those queries. The first step in acquiring data for
building a topic based model is to generate the right query terms for acquiring data. I
have presented a Relative Entropy based scheme for generating queries from an initial
in-domain language model and a generic background language model that identifies
keywords and keyphrases, which helps in selecting data most relevant to a task. The
retrieved data is then evaluated for fitness to the task and in the next iteration we use
queries and URLS that returned the data most suited to the task.
Even with well crafted queries the data acquired from generic resources may con-
tain large amounts of out-of-domain text. For a source like the WWW, the acquired
data tends to be noisy and contains lot of spurious text corresponding to embedded web
advertisements, links etc. It is necessary to clean the data acquired and preprocess it
before using it to build language models. The data filtering task is carried out in two
stages. In the first stage, we remove spurious content using a rejection model that is
108
updated at every iteration. A relative entropy (R.E) based criterion is used to incre-
mentally grow a subset of sentences such that their distribution matches the domain of
interest. The data selection scheme make no assumptions as to how the data was col-
lected or the use of specific web crawling and querying techniques. The methods we
have developed can be seen to supplement the research effort by the machine translation
community on identifying web resources [RS03, FHV05] or using web counts [KLO02]
for language modeling. We also believe that this work can augment topic based LM
adaptation techniques. Topic based LM adaptation schemes typically use LSA [Bel00]
or variants [BJ06] to automatically split the available training text across multiple top-
ics. This allows for better modeling of each individual topic in the in-domain collection.
The trade off is that since the available text is split across topics, each individual model
is trained on less data. We believe that this problem can be addressed by selecting data
for each topic from a large generic corpora using the proposed data selection algorithm.
We introduce the use of Latent Dirichlet Allocation based unsupervised clustering
for generating query terms and merging data such that we are able to cover the set
of topics implicit in the domain of interest. This helps in ensuring that the adaptation
corpus has a balanced coverage of the domain of interest. In addition by augmenting raw
speech or asr hypothesis with text from the generic corpus by appropriate querying and
data selection methods, we are able to achieve significant performance improvements in
both WER and Perplexity.
An interesting question in this scenario where potentially large amounts of training
can be obtained for building language models at low cost, is the rate at which the lan-
guage models converges to the true distribution and how the WER correlates with the
language model. To study this, I did simulations in which increasing amount of data
samples were generated from a language model that resembles the task model. A lan-
guage model was estimated on this simulated data and relative entropy measure was
109
used to measure how close the estimated language model was to the seed distribution.
WER results are also presented.
7.1 Future work
7.1.1 Query Generation
The query generation process can benefit from use of better query feedback mechanisms.
In the current implementation, a naive version of query feedback was implemented by
selecting the top urls for recursive crawl and selecting the top query terms for refin-
ing queries in future iterations. This was partially motivated by the need to use public
search interfaces of web search engines. Query refinement schemes that assign frac-
tional weights to query terms can be used to augment search results based on the data
selection process such that query terms from documents that are more relevant to the
task are assigned a higher weight.
7.1.2 Data Selection
The effect of varying data granularity for data selection has not been studied in this
work. We have used sentence level selection, but selection process can also be naturally
extended to group of sentences, fixed number of words, paragraphs, or even entire doc-
uments. Selection of data in smaller chunks has the potential to select data better suited
to the task but may result in over-fitting to the existing in-domain distribution. In such a
case the adaptation model will provide little extra information to the existing model.
The proposed method can be combined with rank-and-select schemes described in
Section 4.8. A direction of research would be the use of ranking to reorder the data such
110
that the sequential selection process gives better results with lesser number of random-
ized searches.
The current framework relies on multiple traversals of data in random sequences to
identify the relevant subset. A online single-pass version of the algorithm would be
of interest in cases, where the text data is available as a continuous stream(One such
source is rss feeds from blogs and news sites). If updates from the stream sources
are frequent, iterating through the entire text collection is not feasible. A promising
idea for making the selection process single-pass is to use multiple instances of the
algorithm with different initial in-domain models generated by bagging. V oting across
these multiple instances can then be used to select data. Another idea worth investigating
is to select sentences with a probability proportional to the relative entropy gain instead
of the threshold based approach currently being used
Another interesting research problem is to extend the algorithm to work directly on
collections of n-gram counts. One motivation for research in this direction is that Google
has released aggregate unigram to 5-gram counts for their entire web snapshot [BF06].
7.1.3 Other applications
Other applications in the NLP domain can benefit from selective data acquisition. Spe-
cially the application of the ideas presented in this thesis to text classification should
yield interesting results. In the text classification context, the data acquisition can be
based on mutual information or TFIDF to generate initial query set. Transductive learn-
ing and co-training can be used to include downloaded data appropriately into the train-
ing set. Active learning can be also be used to reduce the set of labeled documents
required after data acquisition.
111
References
[ABD
+
98] A. Abella, B. Buntschuh, G. DiFabbrizio, C. Kamm, M. Mohri,
S. Narayanan, S. Marcus, and R. D. Sharp. VPQ: A spoken language
interface to large scale directory information. In Proceedings of ICSLP,
1998.
[BCR98] R. Billi, F. Canavesio, and C. Rullent. Automation of telecom italia direc-
tory assistance service:field trail results. In IEEE Workshop on interactive
Voice Technology for Telecommunication Applications (IVTTA), 1998.
[Bel00] J.R Bellegarda. Large vocabulary speech recognition with multispan sta-
tistical language models. IEEE Transactions on Speech and Audio Pro-
cessing, 8, 2000.
[BF06] Thorsten Brants and Alex Franz. Web 1T 5-gram Version 1. LDC Catalog
ID : LDC2006T13, 2006.
[BJ06] B.J.Hsu and J.Glass. Style and topic language model adaptation using
HMM-LDA. In Proceedings of EMNLP, 2006.
[BKK
+
03] N. J. Belkin, D. Kelly, G. Kim, J.-Y . Kim, H.-J. Lee, G. Muresan, M.-C.
Tang, X.-J. Yuan, and C. Cool. Query length in interactive information
retrieval. In Proceedings of ACM SIGIR, 2003.
[BM98] A. Berger and R. Miller. Just-in-time language modeling. In Proceedings
of ICASSP, 1998.
[BOS03] Ivan Bulyko, Mari Ostendorf, and Andreas Stolcke. Getting more mileage
from web text sources for conversational speech language modeling using
class-dependent mixtures. In Proceedings of HLT, 2003.
[BR03] Michiel Bacchiani and Brian Roark. Unsupervised language model adap-
tation. In Proceedings of ICASSP, 2003.
112
[BSWW03] Francoise Beaufays, Ananth Sankar, Shaun Williams, and Mitch Wein-
traub. Learning linguistically valid pronunciations from acoustic data. In
Proceedings of Eurospeech, 2003.
[Car97] R. C. Carrasco. Accurate computation of the relative entropy between
stochastic regular grammars. RAIRO (Theoretical Informatics and Appli-
cations), 31(5):437–444, 1997.
[CBR98] Stanley Chen, Douglas Beeferman, and Ronald Rosenfeld. Evaluation
metrics for language models. In Proceedings of DARPA Broadcast News
Transcription and Understanding Workshop, 1998.
[CG96] Stanley Chen and Joshua T. Goodman. An empirical study of smoothing
techniques for language modeling. In Proceedings of the 34th Annual
meeting of the ACL, 1996.
[CMU
+
95] Ronald A. Cole, Joseph Mariani, Hans Uszkoreit, Annie Zaenen, and Vic-
tor Zue. Survey of the state of the art in human language technology, 1995.
[CTC02] Steve Cronen-Townsend and W.Bruce Croft. Quantifying query ambigu-
ity. In Proceedings of HLT, 2002.
[DMB] Michael Jordan David M Blei, A. Y . Ng. Latent dirichlet allocation.
[DNHP97] Neeraj Deshmukh, Julie Ngan, Jonathan Hamaker, and Joseph Picone. An
advanced system to generate pronunciations of proper nouns. In Proceed-
ings of ICASSP, 1997.
[eld] TC-STAR: Technology and corpora for speech to speech translation.
http://www.tc-star.org.
[FHV05] Ying Zhang Fei Huang and Stephan V ogel. Mining key phrase translations
from web corpora. In Proceedings of EMNLP, 2005.
[Fis] M. Fisher. Syllabification Software. The Spoken Natural Language Pro-
cessing Group, National Institute of Standards and Technology, Gaithers-
burg, Maryland, U.S.A.
[GA02] Lucian Galescu and James Allen. Pronunciation of proper names with a
joint n-gram model for bi-directional grapheme to phoneme conversion.
In Proceedings of ICSLP, 2002.
[GFW98] Petra Geutner, Michael Finke, and Alex Waibel. Phonetic-distance-based
hypothesis driven lexical adaptation for transcribing multlingual broadcast
news. In Proceedings of ICSLP, 1998.
113
[GHO
+
01] A. Ganapathiraju, J. Hamaker, M. Ordowski, G. Doddington, and
J. Picone. Syllable-based large vocabulary continuous speech recognition.
IEEE Transactions on Speech and Audio Processing, May 2001.
[GJM03] Rayid Ghani, Rosie Jones, and Dunja Mladenic. Building minority lan-
guage corpora by learning to generate web search queries. Journal of
Knowledge and Information Systems (KAIS), 2003.
[GN96] H. Gish and K. Ng. Parameter trajectory models for speech recognition.
In Proceedings of ICSLP, 1996.
[Gre97] S. Greenberg. On the origins of speech intelligibility in the real world.
In Proceedings of the ESCA workshop on robust speech recognition for
unknown channels, Apr 1997.
[Gre98] S. Greenberg. Speaking in shorthand - a syllable-centric perspective for
understanding pronunciation variation. In Proceedings of the ESCA Work-
shop on Modeling Pronunciation Variation for Automatic Speech Recog-
nition, 1998.
[GSBT] T. L. Griffiths, M. Steyvers, D. Blei, and J.B Tenenbaum. Integrating
topics and syntax. In Advances in Neural Information Processing Systems.
[Hofa] Thomas Hofmann.
[Hofb] Thomas Hofmann. Probabilistic latent semantic analysis.
[Joa99] T. Joachims. Making large-scale svm learning practical. advances in kernel
methods - support vector learning. 1999.
[Joa03] T. Joachims. Transductive learning via spectral graph partitioning. In
”Proceeding of The Twentieth International Conference on Machine
Learning”, 2003.
[Kah76] D. Kahn. Syllable-Based Generalizations in English Phonology. PhD
thesis, Indiana University Linguistics Club, Bloomington, Indiana, USA,
1976.
[Kir96] K. Kirchhoff. Syllable-level desynchronisation of phonetic features for
speech recognition. In Proceedings of ICSLP, 1996.
[KLO02] Frank Keller, Maria Lapata, and Olga Ourioupina. Using the web to over-
come data sparseness. In Proceedings of EMNLP, 2002.
[Kor97] F. Korkmazskiy. Generalized mixture of HMM’s for continuous speech
recognition. In Proceedings of ICASSP, 1997.
114
[LC01] Victor Lavrenko and W. Bruce Croft. Relevance-based language models.
In Proceedings of Sigir, 2001.
[LDL
+
03] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip Yu. Building
text classifiers using positive and unlabeled examples. In Proceedings of
ICDM, 2003.
[Lee99] Lillian Lee. Measures of distributional similarity. In 37th Annual Meeting
of the Association for Computational Linguistics, pages 25–32, 1999.
[Lip96] R. Lippmann. Speech perception by humans and machines. In Proceed-
ings of the workshop on the auditory basis of speech perception, Jul 1996.
[Mas72] D.W Massaro. Perceptual images, processing time and perceptual units in
auditory perception. Psychological Review, 79:124–145, 1972.
[McC96] Andrew Kachites McCallum. Bow: A toolkit for statistical
language modeling, text retrieval, classification and clustering.
http://www.cs.cmu.edu/ mccallum/bow, 1996.
[MK06] Teruhisa Misu and Tatsuya Kawahara. A bootstrapping approach for
developing language model of new spoken dialogue systems by selecting
web texts. In Proceedings of ICSLP, 2006.
[MNRS99] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.
Building domain-specific search engines with machine learning tech-
niques. In AAAI Spring Symposium on Intelligent Agents in Cyberspace
1999, 1999.
[Nea03] S. Narayanan and P.G. Georgiou et al. Transonics: A speech to speech
system for english-persian interactions. In Proceedings of IEEE ASRU,
2003.
[NOH
+
05a] Tim Ng, Mari Ostendorf, Mei-Yuh Hwang, Ivan Bulyko, Manhung Siu,
and Xin Lei. Web-data augmented language model for mandarin speech
recognition. In Proceedings of ICASSP, 2005.
[NOH
+
05b] Tim Ng, Mari Ostendorf, Mei-Yuh Hwang, Manhung Siu, Ivan Bulyko,
and Xin Lei. Web-data augmented language model for mandarin speech
recognition. In Proceedings of ICASSP, 2005.
[OC02] P. Ogilvie and J. Callan. Experiments using the lemur toolkit. In Proceed-
ings of the 2001 Text REtrieval Conference (TREC 2001), 2002.
[OOW
+
95] J Odell, D Ollason, P Woodland, S Young, and J Jansen. The HTK Book
for HTK V2.0. Cambridge University Press, Cambridge, UK, 1995.
115
[OS05] O.Cetin and Andreas Stolcke. Language modeling in the ICSI-SRI spring
2005 meeting speech recognition evaluation. In ICSI Technical Report
TR-05-006, 2005.
[Pel01] Bryan Pellom. SONIC: the university of colorado continuous speech rec-
ognizer. 2001.
[PKM
+
05] Daniel Povey, Brian Kingsbury, Lidia Mangu, George Saon, Hagen Soltau,
and Geoff Zweig. ”fmpe: Discriminatively trained features for speech
recognition”. In Proceedings of ICASSP, 2005.
[PO00] H. Printz and P. Olsen. Theory and practice of acoustic confusability. In
Proceedings of ASRU, 2000.
[PRWZ01] Kiahore Papineni, Salim Roukos, Todd Ward, and Wie-Jing Zhu. Bleu:
a method for automatic evaluation of machine translation. IBM Research
report, 2001.
[Rat96] A. Ratnaparkhi. A maximum entropy part-of-speech tagger. In Proceed-
ings of EMNLP, 1996.
[RBF
+
00] M. Riley, W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough,
H. Nock, M. Saraclar, C. Wooters, and G. Zavaliagkos. Stochastic pronun-
ciation modeling from hand-labelled phonetic corpora. Speech Communi-
cation, 29:229–224, 2000.
[RDI99] Bhuvana Ramabhadran, Sabine Deligne, and Abraham Ittycheriah.
Acoustics-based baseform generation with pronunciation and/or phono-
tactic models. In Proceedings of Eurospeech, 1999.
[RJP03] Bhuvana Ramabhadran, Jing Juang, and Michael Picheny. Towards
automatic transcription of large spoken archives - english ASR for the
MALACH project. In Proceedings of ICASSP, 2003.
[RS03] Philip Resnik and Noah A. Smith. The web as a parallel corpus. Compu-
tational Linguistics, pages 349–380, 2003.
[RSM
+
06] B. Ramabhadran, O. Siohan, L. Mangu, G. Zweig, M. Westphal,
H. Schulz, and A. Soneiro. The IBM 2006 speech transcription system
for European parliamentary speeches. In Proceedings of ICSLP, 2006.
[Sea00] Andreas Stolcke and et al. The SRI march 2000 Hub-5 conversational
speech transcription system. In Proceedings of NIST Speech Transcription
Workshop, 2000.
116
[SGG05] Ruhi Sarikaya, Agustin Gravano, and Yuqing Gao. Rapid language model
development using external resources for new spoken dialog domains. In
Proceedings of ICASSP, 2005.
[SGN05] Abhinav Sethy, Panayiotis Georgiou, and Shrikanth Narayanan. Building
topic specific language models from web-data using competitive models.
In Proceedings of Eurospeech, 2005.
[SGN06] Abhinav Sethy, Panayiotis G. Georgiou, and Shrikanth Narayanan. Text
data acquisition for domain-specific language models. In Proceedings of
EMNLP, 2006.
[SNP02] Abhinav Sethy, Shrikanth Narayanan, and S. Parthasarthy. A syllable
based approach for improved recognition of spoken names. In Proceedings
of the ISCA Pronunciation Modeling Workshop, 2002.
[SNP05] Abhinav Sethy, Shrikanth Narayanan, and S. Parthasarathy. Hierarchical
speech recognition using syllable and word-level acoustic units with appli-
cation to spoken name recognition. 2005.
[SNR03] Abhinav Sethy, Shrikanth Narayanan, and Bhuvana Ramabhadran.
Improvements in English ASR for the MALACH project using syllable-
centric models. In Proceedings of IEEE ASRU, 2003.
[SO03] Izhak Shafran and Mari Ostendorf. Acoustic model clustering based on
syllable structure. Computer Speech and Language, 17(4):311–328, 2003.
[Spr01] R. Sproat. Normalization of non-standard words. Computer Speech and
Language, 2001.
[SRK05] Olivier Siohan, Bhuvana Ramabhadran, and Brian Kingsbury. Construct-
ing ensembles of asr systems using randomized decision trees. In Pro-
ceedings of ICASSP, 2005.
[SRN04] Abhinav Sethy, Bhuvana Ramabhadran, and Shrikanth Narayanan. Mea-
suring convergence in language model estimation using relative entropy.
In Proceedings of ICSLP, 2004.
[SS00] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based sys-
tem for text categorization. Machine Learning, 2000.
[Sto98a] Andreas Stolcke. Entropy-based pruning of backoff language models. In
Proceedings of DARPA Broadcast News Transcription and Understanding
Workshop, 1998.
117
[Sto98b] Andreas Stolcke. Entropy-based pruning of backoff language models.
In DARPA Broadcast News Transcription and Understanding Workshop,
1998.
[Sto02] Andreas Stolcke. SRILM - an extensible language modeling toolkit. In
Proceedings of ICSLP, 2002.
[TGL
+
00] J. Van Thong, D. Goddeau, A. Litvinova, B. Logan, P. Moreno, and
M. Swain. Speechbot: A speech recognition based audio indexing sys-
tem for the web, 2000.
[THT03] Gokhan Tur and Dilek Hakkani-Tur. Exploiting unlabeled utterances for
spoken language understanding. In Proceedings of Eurospeech, 2003.
[TSHT03] G. Tur, R. Schapire, and D. Hakkani-Tur. Active learning for spoken lan-
guage understanding. In Proceedings of ICASSP, 2003.
[VW03] A. Venkataraman and W. Wang. Techniques for effective vocabulary selec-
tion. In Proceedings of Eurospeech, 2003.
[WKMG98] Su-Lin Wu, Brian Kingsbury, Nelson Morgan, and Steven Greenberg.
Incorporating information from syllable-length time scales into automatic
speech recognition. In Proceedings of ICASSP, 1998.
[WSY06] Karl Weilhammer, Matthew N Stuttlem, and Steve Young. Bootstrapping
language models for dialogue systems. In Proceedings of ICSLP, 2006.
[YRN99] Ricardo Baeza Yates and Berthier Ribeiro-Neto. Modern information
Retrieval. Addison Wesley, 1999.
[ZEV04] Bing Zhao, Matthias Eck, and Stephan V ogel. Language model adaptation
for statistical machine translation via structured query models. In Proceed-
ings of Coling, 2004.
[Zhu05] Xiaojin Zhu. Semi-supervised learning literature survey. Technical
Report 1530, Computer Sciences, University of Wisconsin-Madison,
2005. http://www.cs.wisc.edu/jerryzhu/pub/ssl survey.pdf.
118
Abstract (if available)
Abstract
The ability to build task specific language models, rapidly and with minimal human effort, is an important factor for fast deployment of natural language processing applications such as speech recognition in different domains. Although in-domain data is hard to gather, we can utilize easily accessible large sources of generic text such as the Internet (WWW ) or the GigaWord corpus for building statistical task language models by appropriate data selection and filtering methods. We propose a query generation and data weighting strategy which iteratively acquires data from such sources using a set of adaptive models to greatly improve the performance achieved from models built from limited in-domain data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Categorical prosody models for spoken language applications
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Speech enhancement and intelligibility modeling in cochlear implants
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Emotional speech resynthesis
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Recognition and characterization of unstructured environmental sounds
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Noise aware methods for robust speech processing applications
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Robust automatic speech recognition for children
PDF
Mocap data compression: algorithms and performance evaluation
PDF
Data-driven methods in description-based approaches to audio information processing
Asset Metadata
Creator
Sethy, Abhinav
(author)
Core Title
Active data acquisition for building language models for speech recognition
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/26/2007
Defense Date
05/07/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
language modeling,OAI-PMH Harvest,speech recognition
Language
English
Advisor
Narayanan, Shrikanth S. (
committee chair
), Byrd, Dani (
committee member
), Jenkins, B. Keith (
committee member
), Ramabhadran, Bhuvana (
committee member
)
Creator Email
sethy@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m689
Unique identifier
UC1261521
Identifier
etd-Sethy-20070726 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-522609 (legacy record id),usctheses-m689 (legacy record id)
Legacy Identifier
etd-Sethy-20070726.pdf
Dmrecord
522609
Document Type
Dissertation
Rights
Sethy, Abhinav
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
language modeling
speech recognition