Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Weighted factor automata: A finite-state framework for spoken content retrieval
(USC Thesis Other)
Weighted factor automata: A finite-state framework for spoken content retrieval
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Weighted Factor Automata: A Finite-State Framework for Spoken Content Retrieval by Do˘ gan Can A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) August 2018 Copyright 2018 Do˘ gan Can To my incredible wife Esin. ii Acknowledgements I would like to thank my advisor Dr. Shrikanth Narayanan for guiding me through this long journey and giving me the opportunity to explore various research ideas even if they did not fit our short term research goals at the time. I would like to thank my qual and dissertation committee members Dr. Panayiotis Georgiou, Dr. Kevin Knight, Dr. Daniel Marcu and Dr. Prem Natarajan for accepting to be on my committees and for all the feedback they provided. A special thank you goes to Dr. Georgiou for always being there when I needed someone to talk to. I would like to thank all of my friends and colleagues at SAIL. You are too numerous to name here but I thank you for being the best friends and colleagues someone could hope for. I would like to thank my parents and my sister for bearing with me when I took on this long journey and supporting me in every way they could. Finally, my dear wife Esin, this work would not have been possible if it were not for your unwavering support. You have sacrificed so much for me to do this PhD. I am forever in your debt. You are my hero. iii This work uses IARPA-babel105b-v0.4 Turkish full language pack from the IARPA Babel Program language collection and was partially supported by the Intelligence Advanced Research Projects Activ- ity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD/ARL) contract number W911NF-12-C-0012, DARPA, NSF and NIH. The U.S. Government is authorized to reproduce and dis- tribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as neces- sarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables vii List Of Figures viii Abstract ix Chapter 1: Introduction 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Preliminaries 8 2.1 Semirings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Weighted Finite-State Transducers and Automata . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Factor Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 N-gram Mapping Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3: Lattice Indexing for Keyword Search 15 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Timed Factor Transducer of Weighted Automata . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 Construction of the Timed Factor Transducer . . . . . . . . . . . . . . . . . . . . 19 3.2.3 Factor Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.4 Search over the Timed Factor Transducer . . . . . . . . . . . . . . . . . . . . . . 25 3.2.5 Comparison with the Factor Transducer and the Modified Factor Transducer . . . . 25 3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Turkish Broadcast News (TBN) KWS System . . . . . . . . . . . . . . . . . . . . 27 3.3.2 English Broadcast News (EBN) KWS System . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 Index Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.2 Search Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 v Chapter 4: ComputingN-gram Posteriors from Lattices 35 4.1 Computation of N-gram Posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Chapter 5: Computing Information Retrieval Statistics from Spoken Corpora 49 5.1 Term Frequency Factor Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Document Posterior Factor Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3 Document Frequency Factor Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 TF-IDF Factor Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.5 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Chapter 6: Open Vocabulary Keyword Search 57 6.1 Retrieval Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.1.1 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.1.2 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1.3 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2 Weighted Sequence Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapter 7: Conclusion 82 Reference List 84 vi List Of Tables 2.1 Common semirings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 Breakdown of TBN Database (in hours) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 R-IV Query Set Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 DRYRUN06-IV Query Set Decomposition (w.r.t. Phonetic Length) . . . . . . . . . . . . . 28 3.4 Number of Factors and Index Size v.s. Total Lattice Size (STDDEV06 Data Set, Bold column indicates beam width 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Per query average search times (in ms) (Beam Width 4) . . . . . . . . . . . . . . . . . . . 31 3.6 Per query average search times (in ms) w.r.t. query length (TBN-R Data Set, R-IV Query Set, Beam Width 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 Per result average search times (in ms) w.r.t. query length (TBN-R Data Set, R-IV Query Set, Beam Width 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 Runtime Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Factor Automata Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Runtime Results: DF Factor Automata vs. Baseline . . . . . . . . . . . . . . . . . . . . . 54 5.2 Factor Automata Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1 KWS results for the development keywords. . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 KWS results for the evaluation keywords. . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3 ATWV/MTWV results for the evaluation keywords for different n-gram orders. . . . . . . 79 vii List Of Figures 3.1 Weighted automata (a)A 1 and (b)A 2 over the real semiringR along with the state timing listst 1 = [0; 1; 2; 3] andt 2 = [0; 1; 2; 3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 (a)B 1 and (b)B 2 over the real semiring obtained by applying the preprocessing algorithm to the automataA 1 andA 2 in Figure 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Construction ofT 1 from the weighted automatonB 1 in Figure 3.2(a) and the state timing listt 1 = [0; 1; 2; 3]: after factor generation. . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Construction ofT 1 from the weighted automatonB 1 in Figure 3.2(a) and the state timing listt 1 = [0; 1; 2; 3]: (a) after factor merging overRTT 0 , (b) after factor disambigua- tion, (c) after optimization overTTT . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 (a) TFTT overRTT 0 , (b) FT overR and (c) MFT overR obtained from the weighted automata and the state timing lists in Figure 3.1. Output labels on the non-final arcs of the MFT represent the associated time intervals, i.e. “a:0-2” means there is an “a” from time 0 to 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6 Index size vs. total lattice size (TBN-R Data Set) . . . . . . . . . . . . . . . . . . . . . . 31 3.7 Per result average search times vs. phonetic query length (Phonetic STDDEV06 Data Set, Phonetic DRYRUN06-IV Query Set, Beam Width 4) . . . . . . . . . . . . . . . . . . . . 32 4.1 Runtime comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Memory use comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.1 Edit transducerE used for expanding keywords. It does not allow any insertions or dele- tions at keyword boundaries. At most two consecutive insertions are allowed. Deletions after insertions are not allowed. is a special consuming symbol matching all symbols in the alphabet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2 Edit transducer E 0 used for aligning lattice arcs with keyword arcs. It does not allow deletions at the beginning of the keyword. is a special symbol matching any symbol in the alphabet. is a special symbol matching any symbol in the alphabet if there are no other matches at a state. is a regular symbol matching symbols on the lattice arcs. . . . 67 6.3 Weighted sequence alignment runtime vs lattice size. . . . . . . . . . . . . . . . . . . . . 76 viii Abstract Spoken Content Retrieval (SCR) integrates Automatic Speech Recognition (ASR) and Information Re- trieval (IR) to provide access to large multimedia archives based on their contents. There are several tasks of varying difficulty that fall under the SCR umbrella. Among them, Keyword Search (KWS) is one of the harder tasks, where the goal is to locate exact matches to an open vocabulary query term in a large heterogenous speech corpus. The retrieval operation is required to be fast, so the data must be indexed ahead of time for fast search. Since ASR transcripts are often highly erroneous in real world scenarios due to model weaknesses, especially in languages and domains where supervised resources are limited, all of these requirements must be met with imperfect information about which words occur where in the corpus. We present an efficient, flexible and theoretically-sound framework for SCR based on weighted finite- state transducers. While we mainly focus on the challenging KWS task, the algorithms and representations we propose are applicable in a wide variety of scenarios where the inputs can be represented as lattices, i.e. acyclic weighted finite-state automata. Our contributions include i) novel techniques for indexing and searching a collection of ASR lattices for KWS, ii) a new algorithm for computing and indexing exact posterior probabilities for all substrings in a lattice, iii) a recipe for computing and indexing probabilistic generalizations of statistics widely used in IR, such as term frequency (TF), inverse document frequency (IDF) and TF-IDF, for all substrings in a collection of lattices, iv) a new algorithm for computing and in- dexing posterior weighted alignments between substrings in a time aligned reference string and substrings in an ASR lattice, and v) a novel approach for performing open vocabulary KWS by explicitly modeling ASR errors and redistributing lattice-based posterior estimates based on sub-word level confusions. ix Chapter 1 Introduction The ever-increasing availability of vast multimedia archives calls for solutions to efficiently index and search them. Spoken content retrieval (SCR) is a key information technology integrating automatic speech recognition (ASR) and information retrieval (IR) to provide large scale access to spoken content. In an ideal world, the ASR component would accurately convert speech to text and text retrieval methods would be applied on the recognition output. Unfortunately, state-of-the-art ASR systems are far from being reliable when it comes to transcribing unconstrained speech audio recorded in uncontrolled environments. This is especially true for heterogenous spoken archives and domains with limited supervised resources. Hence, relying entirely on ASR transcripts is not an option for most real world SCR tasks. In a realistic SCR scenario, the end-user should be able to perform open-vocabulary search over a large collection of spoken documents in a matter of seconds. Therefore, the speech corpus must be indexed prior to search without the advance knowledge of the query terms. This is a challenging task. In text retrieval, the corpus is reliable in the sense that it is known whether a particular word occupies a particular position in a document. In SCR, however, the corpus is the output of the ASR component which is inherently unreliable. The ASR output is not only rife with errors (insertions, deletions and substitutions), but also is limited to the ASR system vocabulary which does not necessarily cover all possible query words. As a consequence, we have to somehow address these shortcomings while indexing spoken documents. 1 Lattices are acyclic graphs that store weighted hypotheses in a compact form. Indexing ASR lattices, instead of the best ASR hypothesis, is a widely used SCR method [72, 4, 75] for dealing with low quality ASR output. In this method, spoken documents are divided into short segments called utterances. For each utterance an ASR lattice, representing ASR hypotheses, is generated. Then a probability is assigned to whether a word sequence occupies a particular position, i.e a time interval, in an utterance given the ASR lattice produced for that segment. These probabilities are stored in a soft (probabilistic) inverted index. During search, soft entries allow multiple word sequences to occupy the same position in a document thereby significantly increasing the recall rates for in-vocabulary (IV) query terms. Open vocabulary search is a demanding criterion that complicates the SCR problem altogether. Since the query terms may well be beyond the coverage of the ASR vocabulary, word level retrieval methods are usually not satisfactory. Sub-word indexing, another well-studied SCR method [72, 75, 47, 18], enables the retrieval of out-of-vocabulary (OOV) query terms by performing the search at the sub-word level. In this approach, sub-word level ASR output is stored in an inverted index and retrieval is performed by converting the query to its sub-word representation and searching the index for matching sequences. Sub- word indexing is typically combined with lattice indexing to get even better results. Most SCR systems tackle the problem of producing sub-word level representations for OOV query terms at the retrieval stage by employing pronunciation models. However, it is not an easy task to generate proper pronunciations for queries that are not covered by the ASR vocabulary. For that reason, most SCR systems utilize inexact matching methods or confusion models to handle the discrepancies between the actual and hypothesized pronunciations. As with any IR application, keeping a balance between recall and precision is essential in SCR tasks. All above mentioned methods have the common goal of increasing recall rates in SCR applications. For that purpose, they relax the conditions under which an index entry is accepted as a match to the query. In this relaxed setup, higher recall rates usually translate into lower precision figures which call for effective methods to discriminate actual hits from false alarms. On that account, ranking or scoring initial retrieval results and filtering out potential false positives constitutes an indispensable ingredient of any SCR system. 2 While these retrieval techniques are crucial for high-performing SCR systems, they also bring in sig- nificant processing and storage overhead, and hence necessitate space and search-time efficient implemen- tations. Since processing requirements are the actual factors that determine whether an SCR technique is applicable to real world problems, the applicability of these methods is contingent on space and search- time efficient implementations that provide flexible retrieval mechanisms which can adapt to the query at hand, e.g. different querying schemes for IV and OOV words. ASR lattices store a large amount of connectivity information which is hard to capture with off-the- shelf text retrieval systems. In [72], the authors propose an exact approach that constructs an inverted index from ASR lattices while storing the full connectivity information. They outline a method for exact calculation ofn-gram expected counts from the information contained in the index. In [4], the computation and indexing of exactn-gram expected counts is formulated as a sequence of transformations on the input lattices which significantly reduces the amount of calculation that should be performed at retrieval time. Other researchers argue that the information contained in ASR lattices is redundant for SCR applications [19, 86, 47], and discard most of the connectivity information present in ASR lattices while indexing and resort to approximate lattice representations, such as confusion networks (CNs) [48] and position specific posterior lattices (PSPLs) [19]. These structures project the complex network of connections in the original lattice to a strictly linear representation that is easier to index and search. The inverted indices derived from these approximate structures are typically significantly smaller than those derived from lattices since it is sufficient to index individual word hypotheses to answer word sequence queries. In tasks like spoken document retrieval (SDR), where both the query and the documents contain a large amount of redundancy for retrieval, it is hard to argue against the value of approximate structures. In fact, single best ASR output is deemed satisfactory for most SDR applications as long as the ASR accuracy is relatively high. In other SCR tasks, however, keeping the connectivity information may play a crucial role. Spoken utterance retrieval (SUR) and keyword search (KWS), a.k.a. spoken term detection, are two such tasks that aim to find respectively the utterances and the time intervals in those utterances which contain the exact sequence of words given as the query. In these tasks, the optimal measures for detection are 3 respectively the utterance and occurrence posteriors. The posteriors computed from approximate lattice structures underestimate the exact posteriors computed from ASR lattices since some of the probability mass is redistributed to word sequences that are not in the original lattices. In some cases, this flattening of posteriors can be beneficial for retrieval since word sequences not hypothesized by the ASR decoder, hence not contained in ASR lattices, can be retrieved. However in other cases this disruption of posteriors is detrimental to the retrieval performance since the implicit factorization of word sequence posteriors into individual word posteriors can significantly underestimate the probability of word sequences that were actually hypothesized by the ASR system. Using weighted finite-state automata (WFSA) [57] to represent ASR lattices and approximate struc- tures derived from ASR lattices has much appeal due to the general-purpose search, optimization and combination algorithms supplied by the WFSA framework. The problem of indexing and searching a set of acyclic WFSA can be posed as an extension to the extensively studied problem of searching for patterns in a collection of text documents [35, 27]. An efficient solution [12] to the latter problem makes use of a structure known as the factor automaton [54]. A factor automaton (FA) is an efficient data structure for representing all factors (substrings) of a set of strings (e.g. a finite-state automaton). It is a very efficient sequence index and is suitable for SCR applications like KWS and SUR, where exact sequence matches are desired. In general, it is desirable to associate a weight with each path in a factor automaton to store auxiliary information about each factor. Conceptually, this is a simple association of weights to paths and can be easily achieved by enumerating the paths and assigning the desired weight to each path. However this ap- proach is not feasible even for a relatively small factor automaton since the number of paths is exponential in the size of the automaton and the weight computation to be done for each path is typically non-trivial. The fundamental challenge in associating desired weights with factors is achieving this mapping without enumerating the paths in the input automaton or the output factor automaton. The factor automata construction algorithm described in [4] associates each factor with the expected count of that factor given the probability distribution defined by the ASR lattice generated for an input 4 utterance. These expected counts are then used to rank utterances with respect to how likely they are to include a query term in SUR. The algorithm exploits the fact that expected counts can be computed as a sum over individual occurrences of each factor and utilizes general weighted automata determinization, minimization and epsilon removal algorithms [53] to efficiently transform the input lattice (represented as a WFSA) to a weighted factor automaton. Expected counts are not the only retrieval relevant statistic that can be derived from ASR lattices. When the input automata are probabilistic, there are several meaningful statistics that can be associated with factors. In SUR, ideally we would like to use posterior probabilities rather than expected counts computed from ASR lattices for ranking utterances with respect to how likely they are to include a query term. In keyword search (KWS), we would like each factor to be associated with a set of (occurrence- posterior, start time, end time) triplets. In spoken document retrieval (SDR), we would like factors to be associated with their term frequency and TF-IDF values. While computing these items for a single factor is straightforward, computing them for all factors of a WFSA within a reasonable amount of time and space is non-trivial. 1.1 Contributions In this dissertation, we present an efficient, flexible and theoretically-sound framework for SCR based on weighted finite-state transducers. While the focus is on the KWS problem, the algorithms and representations presented here are appli- cable to a wide range of content-based retrieval problems where the content to be indexed and searched is sequential and the output of the frontend recognition component can be represented as a weighted lattice of hypothesized sequences. These include not only other SCR problems, such as SUR and SDR, but also problems from other domains, such as indexing and retrieving the output of an optical character recognition system or a DNA sequencer. The main contributions of this dissertation are: 5 1. Novel techniques for indexing and searching a collection of ASR lattices for KWS (previously pub- lished as [13]). 2. A new algorithm for computing and indexing exact posterior probabilities for all substrings in a lattice (previously published as [17]). 3. A recipe for computing and indexing probabilistic generalizations of statistics widely used in IR, such as term frequency (TF), inverse document frequency (IDF) and TF-IDF, for all substrings in a collection of lattices (previously published as [16]). 4. A new algorithm for computing and indexing posterior-weighted alignments between substrings in a time aligned reference string and substrings in a corresponding ASR lattice. 5. A novel approach for performing open vocabulary KWS by explicitly modeling ASR errors and redistributing lattice-based posterior estimates based on sub-word level confusions. 1.2 Organization This dissertation is organized as follows. Chapter 2 gives the definitions and notation related to semirings and weighted finite-state transducers that is used throughout the rest of this work. Chapter 3 describes the algorithms and data structures we developed for indexing and searching ASR lattices to perform the KWS task, and provides experimental results showing the advantages of our approach. Chapter 4 describes the algorithm we developed for computing and indexing exact posterior probabilities for all substrings in a lattice, and provides experimental results showing the advantages of our approach compared to previous state-of-the-art. In Chapter 5 we provide a recipe for computing and indexing probabilistic generaliza- tions of statistics widely used in IR, such as term frequency (TF), inverse document frequency (IDF) and TF-IDF, for all substrings in a collection of lattices by extending the weighted factor automaton concept to collections. In Chapter 6 we describe a novel approach for performing open vocabulary KWS by ex- plicitly modeling ASR errors and redistributing lattice-based posterior estimates based on sub-word level 6 confusions. We train our models on posterior-weighted sub-word sequence alignments derived from time aligned reference texts and corresponding ASR lattices. We also describe a new algorithm for computing and indexing these alignments. In Chapter 7 we provide our conclusions and discuss future directions. 7 Chapter 2 Preliminaries In this section, we first introduce the general algebraic notion of a semiring along with semirings widely used in text and speech processing [43, 5, 53]. Then we recap the relevant string and finite-state automata definitions and terminology used throughout the paper. Please refer to [53] for an in depth introduction to weighted finite state automata. 2.1 Semirings Definition 1 A monoid is a triple (K; ; 1) where is a closed associative binary operator on the setK, and 1 is the identity element for . A monoid is commutative if is commutative. Definition 2 A semiring is a 5-tuple (K;; ; 0; 1) where (K;; 0) is a commutative monoid, (K; ; 1) is a monoid, distributes over and 0 is an annihilator for . A semiring is idempotent if8a2K;aa = a. If is also commutative, we say that the semiring is commutative. Lemma 1 If (K;; ; 0; 1) is an idempotent semiring, then the relation defined by 8a;b2K : (ab) () (ab =a) 8 Table 2.1: Common semirings. SEMIRING SET 0 1 Boolean f0; 1g _ ^ 0 1 Real R + [f+1g + 0 1 Max-times R + [f+1g max 0 1 Log R[f1; +1g log + +1 0 Tropical R[f1; +1g min + +1 0 Arctic R[f1; +1g max + 1 0 a log b = log(e a +e b ) is a partial order overK, called the natural order [51] overK. It is a total order if and only if the semiring has the path property:8a;b2K; ab =a orab =b. Table 2.1 lists common semirings. In speech and language processing, two semirings are of particular importance. The log semiringL is isomorphic to the familiar real or probability semiring via the negative- log morphism and can be used to combine probabilities in the log domain. The tropical semiringT , which is isomorphic to the max-times semiring, provides the algebraic structure necessary for most shortest-path algorithms and can be derived from the log semiring using the Viterbi approximation. Note thatT is idempotent and the natural order overT 8a;b2R[f1; +1g : (minfa;bg =a) () (ab) is a total order, the usual order of real numbers [51]. Alternatively one can use the max-convention to obtain another idempotent semiringT 0 , which we will call the arctic semiring (as opposed to tropical), over which the natural order is another total order, this time the reverse order of real numbers. Definition 3 The product semiring of two partially-ordered semiringsA = (A; A ; A ; 0 A ; 1 A ) andB = (B; B ; B ; 0 B ; 1 B ) is defined as AB = (AB; ; ; 0 A 0 B ; 1 A 1 B ) 9 where and are component-wise operators, e.g. 8a 1 ;a 2 2A;b 1 ;b 2 2B : (a 1 ;b 1 ) (a 2 ;b 2 ) = (a 1 A a 2 ;b 1 B b 2 ): The natural order overAB, given by ((a 1 ;b 1 ) (a 2 ;b 2 )) () (a 1 A a 2 =a 1 ;b 1 B b 2 =b 1 ); defines a partial order, known as the product order, even ifA andB are totally-ordered. Definition 4 The lexicographic semiring of two partially-ordered semiringsA andB is defined as AB = (AB; ; ; 0 A 0 B ; 1 A 1 B ) where is a component-wise multiplication operator and is a lexicographic priority operator,8a 1 ;a 2 2 A,b 1 ;b 2 2B: (a 1 ;b 1 ) (a 2 ;b 2 ) = 8 > > > > > > < > > > > > > : (a 1 ;b 1 B b 2 ) a 1 =a 2 (a 1 ;b 1 ) a 1 =a 1 A a 2 6=a 2 (a 2 ;b 2 ) a 1 6=a 1 A a 2 =a 2 UnlikeAB, the natural order overAB, given by ((a 1 ;b 1 ) (a 2 ;b 2 )) () (a 1 =a 1 A a 2 6=a 2 ) or (a 1 =a 2 andb 1 =b 1 B b 2 ); defines a total order, known as the lexicographic order, whenA andB are totally-ordered. 10 More generally, one can define the product and lexicographic orders (or semirings) on the Cartesian product ofn ordered sets. SupposefA 1 ;A 2 ; ;A n g is ann-tuple of sets, with respective total orderings f 1 ; 2 ; ; n g. The product order offA 1 ;A 2 ; ;A n g is defined as (a 1 ;a 2 ;:::;a n ) (b 1 ;b 2 ;:::;b n ) () a i i b i 8in: Similarly, the lexicographic order offA 1 ;A 2 ; ;A n g is defined as (a 1 ;a 2 ;:::;a n ) (b 1 ;b 2 ;:::;b n ) () 9m> 0; a i =b i 8i<m; a m m b m : That is, for one of the termsa m m b m and all the preceding terms are equal. We should also note that product (or lexicographic) semiring onfA 1 ;A 2 ; ;A n g can be recursively defined using the associativity of (or) operator: A 1 A 2 A n = (( (A 1 A 2 ) )A n ): 2.2 Weighted Finite-State Transducers and Automata Definition 5 A weighted finite-state transducer T over a semiring (K;; ; 0; 1) is an 8-tuple T = (; ;Q;I;F;E;;) where: , are respectively the finite input and output alphabets;Q is a finite set of states;I;FQ are respectively the set of initial and final states;EQ([f"g)([f"g)KQ is a finite set of arcs; :I!K, :F!K are respectively the initial and final weight functions. Given an arce2E, we denote byi[e] its input label,o[e] its output label,w[e] its weight,s[e] (orp[e]) its source or previous state andt[e] (orn[e]) its target or next state. A path =e 1 e k is an element of E with consecutive arcs satisfyingt[e i1 ] = s[e i ]; i = 2;:::;k: We extendt ands to paths by setting t[] = s[e k ] andt[] = s[e 1 ]. The labeling and the weight functions can also be extended to paths by definingi[] =i[e 1 ]:::i[e k ],o[] =o[e 1 ]:::o[e k ] andw[] =w[e 1 ] ::: w[e k ]. We denote by (q;q 0 ) 11 the set of paths fromq toq 0 and by (q;x;y;q 0 ) the set of paths fromq toq 0 with input labelx2 and output labely2 . These definitions can be extended to subsetsS;S 0 Q, e.g. (S;x;y;S 0 ) = [ q2S;q 0 2S 0 (q;x;y;q 0 ): An accepting or successful path in a transducerT is a path in (I;F ). A stringx is accepted byT if there exists an accepting path labeled withx on the input side. T is unambiguous if for any stringx2 there is at most one accepting path labeled withx on the input side.T is deterministic if it has at most one initial state and at any state no two outgoing transitions share the same input label. The weight associated by a transducerT to any input-output string pair (x;y)2 is given by JTK(x;y) = M 2(I;x;y;F) (s[]) w[] (t[]) and JTK(x;y) is defined to be 0 when (I;x;y;F ) =;. We denote byd[A] the-sum of the weights of all accepting paths ofA when it is defined and inK. d[A] can be viewed as the shortest-distance from the initial states to the final states. The size of a transducerT is defined asjTj =jQj +jEj. A weighted finite-state automatonA can be defined as a weighted finite-state transducer with identical input and output labels. The weight associated by A to (x;x) is denoted by JAK(x). Similarly, in the graphical representation of weighted automata, output labels are omitted. A weighted automatonA defined over the probability semiring (R + ; +;; 0; 1) is said to be proba- bilistic if for any stateq2Q, 2(q;q) w[], the sum of the weights of all cycles atq, is well-defined and inR + and P x2 JAK(x) = 1. A probabilistic automatonA is said to be stochastic if at each state the weights of outgoing arcs and the final weight sum to one. 12 2.3 Factor Automata Definition 6 Given two stringsx;y2 ,x is a factor (substring) ofy ify = uxv for someu;v2 . More generally, x is a factor of a language L if x is a factor of some string y2 L. The factor automatonS(y) of a stringy is the minimal deterministic finite-state automaton recognizing exactly the set of factors ofy. The factor automatonS(A) of an automatonA is the minimal deterministic finite-state automaton recognizing exactly the set of factors ofA, that is the set of factors of the strings accepted by A. Factor automaton [54] is an efficient and compact data structure for representing a full index of a set of strings, i.e. an automaton. It is a great fit for retrieval applications where exact sequence matches are de- sired and there are no limits on query length and content. The search operation is as simple as intersecting [53] an automaton compiled from the queryq with the indexS. Since the index is deterministic, search complexity is linear in the length of query stringO(jqj). Further, any finite state relation, e.g. a regular expression, can be compiled into a query automaton and retrieved from the index. Factor automatonS(y) of a stringy can be built in linear time and its size is linear in the size of the input stringjyj [11, 26]. The size of the factor automatonS(A) of an automatonA is also linear in the size of the input [54, 55]. There exist algorithms for construction ofS(A) in linear time whenA is a prefix tree [55]. In general, however, it is not known whether a linear time algorithm exists for constructing factor automata of arbitrary automata [55]. The best known algorithm for constructing factor automata of arbitrary automata [4, 56] uses automata determinization, minimization and epsilon removal algorithms [53] and therefore is worst case exponential in the size of the input automaton. Even though, it is worst case exponential, this algorithm works well in practice and experimental results indicate that its runtime is approximately linear in the size of the input when the input is acyclic. 13 2.3.1 N-gram Mapping Transducer We denote by n then-gram mapping transducer [10, 28] of ordern. This transducer maps label sequences ton-gram sequences of ordern. n is similar in form to the weighted finite-state transducer representation of a backoff n-gram language model [3]. We denote by A n the n-gram lattice of order n obtained by composing latticeA with n , projecting the resulting transducer onto its output labels, i.e. n-grams, to obtain an automaton, removing "-transitions, determinizing and minimizing [53]. A n is a compact lattice ofn-gram sequences of ordern consistent with the labels and scores of latticeA. A n typically has more states thanA due to the association of distinctn-gram histories with states. 2.3.2 Alignments Let and be two finite alphabets, and let be defined by = [f"g [f"gf(";")g. An element! of the free monoid can be viewed as one of via the concatenation: ! = (a 1 ;b 1 ) (a n ;b n )! (a 1 a n ;b 1 b n ): We denote byh the corresponding morphism from to and writeh(!) = (a 1 a n ;b 1 b n ). Definition 7 An alignment! of two stringsx over the alphabet andy over the alphabet is an element of such thath(!) = (x;y). As an example, (a;")(b;")(a;b)(";b) is an alignment ofaba andbb: x = aba" y = ""bb We denote byA(x;y) the set of all alignments between the stringsx andy. 14 Chapter 3 Lattice Indexing for Keyword Search In this chapter, we consider the problem of constructing an exact inverted index for ASR lattices with time information, i.e. we index all substrings seen in the lattices along with their time alignments and posterior probabilities. Since the number of substrings is exponential in data size, in general it is infeasible to maintain an exact index with constant search complexity, e.g. a simple hash table keyed on substrings. In text retrieval, this problem is avoided by keeping a word or n-gram index which results in a search complexity linear in the query length. However, unlike a string, an ASR lattice is not a linear structure which can be factored into words. Keeping a single word or n-gram index, on the other hand, is suboptimal since (i) such an index can not answer whether a substring actually matches a partial path in a lattice and (ii) substring occurrence probabilities have to be approximated using the probabilities assigned to individual words or n-grams. Then, the challenge is to come up with an exact index structure which will efficiently store all substring occurrences while keeping the search complexity linear in the query length. In the following sections, we generalize the index structure of [4] to accommodate the timing informa- tion and employ it in the Keyword Search (KWS) task. The proposed structure is a general deterministic sequence index which retains auxiliary weight information about lattice nodes, e.g. node timings in the case of KWS. We provide retrieval experiments with IV query sets and show that the proposed method is effective for both word-based and phonetic indexing. 15 Section 3.1 gives a brief review of relevant works that lead up to the current study. Section 3.2 describes the construction of the proposed inverted index structure from raw ASR lattices. Section 3.3 details our ASR architecture and the data used in experiments. Section 3.4 provides the KWS experiments evaluating the performance of the proposed index structure over a large data set. Finally in Section 3.5 we summarize the advantages of the proposed structure and discuss future directions. 3.1 Related Work In [4], the factor transducer (FT) structure [54] is extended to indexing weighted finite-state automata and employed in the SUR task. In this context, the index stores soft entries in the form of (utterance ID, expected count) pairs. Each successful index path encodes a factor appearance. Input labels of each such path carry a factor, output labels carry the utterance ID and path weight gives the expected count of the factor in the corresponding utterance. Expected term counts, mere generalizations of the traditional term frequencies, provide a good relevance metric for the SUR task. Being a deterministic automaton (except final transitions), this index offers a search complexity linear in the sum of the query length and the number of utterances in which the query term appears. Keyword Search (KWS), a.k.a Spoken Term Detection (STD), is the task of finding all of the exact occurrences of each given keyword, a sequence of words, in a large corpus of speech material. Since it is required to find the exact locations in time, ideally an inverted index for KWS should provide soft-hits as (utterance ID, start time, end time, relevance score) quadruplets. In [61], the factor transducer structure is utilized in a two-stage STD system which performs utterance retrieval followed by time alignment over the audio segments. This two-stage KWS strategy is problematic: (i) it requires a costly alignment operation which is performed online and (ii) the index stores within utterance expected term counts which are not direct relevance measures for KWS since a query term may appear more than once in an utterance. In [15], we presented a method to obtain posterior probabilities over time intervals instead of expected counts over utterances along with a modified factor transducer (MFT) structure which stores (utterance ID, start time, 16 0 1 a 2 b b 3 a (a) 0 1 b/1 2 a/2 a/1 3 b/1 (b) Figure 3.1: Weighted automata (a)A 1 and (b)A 2 over the real semiringR along with the state timing lists t 1 = [0; 1; 2; 3] andt 2 = [0; 1; 2; 3]. end time, posterior probability) quadruplets. The MFT stores the timing information on the output labels and allows to perform the KWS task in a single step greatly reducing the time spent for online retrieval. Furthermore, posterior probabilities provide a direct relevance metric for KWS solving the second issue of the two-stage strategy. On the flip side, the MFT has its own deficiencies: (i) search complexity is suboptimal since the index is non-deterministic — timing information is on the arc labels — and (ii) timing information has to be quantized to control the level of non-determinism. 3.2 Timed Factor Transducer of Weighted Automata This section presents an algorithm for the construction of an efficient timed index for a large set of speech utterances. We propose a new factor transducer structure, timed factor transducer (TFT), which stores the timing information on the arc weights, thereby solving the issues associated with the non-deterministic factor transducer of [15]. For easy comparison, we follow the development in [4]. We assume that for each speech utterance u i of the dataset in considerationfu i ji = 1;:::;ng, a weighted automatonA i over the log semiring with alphabet (e.g. phone or word lattice output by ASR), and a listt i of state timings are given. Figure 3.1 gives examples of automata over the real semiringR. The problem consists of creating a timed index that can be used for the direct search of any factor of any string accepted by these automata. Note that this problem crucially differs from the classical text indexing problems in that the input data is uncertain. 17 Our index construction algorithm is based on general weighted automata and transducer algorithms. The main idea is that the timed index can be represented by a weighted finite-state transducerT mapping each factor x to (i) the set of automata in whichx appears, (ii) start-end times of the intervals wherex appears in each automaton and (iii) the posterior probabilities ofx actually occurring in each automaton in the corresponding time interval. We start with preprocessing each input automaton to obtain a posterior lattice in which non-overlapping arc clusters are separately labeled. Then from each processed input automaton we construct an intermediate factor transducer which recognizes exactly the set of factors of the input. We convert these intermediate structures into deterministic transducers by augmenting each factor with a disambiguation symbol and then applying weighted automata optimization. Finally, we take the union of these deterministic transducers and further optimize the result to obtain a deterministic inverted index of the entire dataset. Following sections detail the consecutive stages of the algorithm. 3.2.1 Preprocessing When the automata A i are word/phone lattices output by an ASR system, the path weights correspond to scores assigned by the language and acoustic models. We can apply to A i a general weight-pushing algorithm in the log semiring [50] which converts these weights into the desired ( log) posterior prob- abilities given the lattice. Since each input automatonA i is acyclic, i.e. a lattice, the complexity of the weight-pushing algorithm is linear in the size of the input (O(jA i j)). The algorithm given by [4] generates a single index entry for all the occurrences of a factor in an utterance. This is the desired behavior for the SUR problem. In the case of KWS, we would like to keep separate index entries for non-overlapping occurrences in an utterance since we no longer search for the utterances but the exact time intervals containing the query term. This separation can be achieved by clustering the arcs with the same input label and overlapping time-spans. The clustering algorithm is as follows. For each input label: (i) sort the collected (start time, end time) pairs with respect to end times, (ii) identify the largest set of non-overlapping (start time, end time) pairs and assign them as cluster heads, (iii) classify the rest of the arcs according to maximal overlap. We effectively convert the input automaton 18 0 1 a:1/0.5 2 b:1/0.5 b:1/1 3 a:2/1 (a) 0 1 b:1/0.333 2 a:1/0.667 a:1/1 3 b:2/1 (b) Figure 3.2: (a)B 1 and (b)B 2 over the real semiring obtained by applying the preprocessing algorithm to the automataA 1 andA 2 in Figure 3.1. to a transducer where each arc carries a cluster identifier on the output label. In other words, each (input label, output label) pair of the transducer designates an arc cluster. Note that the clustering operation does not introduce additional paths, i.e. it simply assigns each arc to an arc cluster. Figure 3.2 illustrates the application of the preprocessing algorithm to the automata of Figure 3.1. 3.2.2 Construction of the Timed Factor Transducer Let B i = (; ;Q i ;I i ;F i ;E i ; i ; i ) denote an "-free transducer over the log semiringL obtained by applying the weight pushing and clustering algorithms (of the previous section) to the automatonA i . The output string associated by B i to each input string it accepts gives the string of cluster identifiers. The weight associated byB i to each input-output string pair can be interpreted as the posterior probability of that pair for the utteranceu i given the models used to generate the automata. More generally,B i defines an occurrence probabilityP i (x;y) for each string pair (x;y)2 whereP i (x;y) is the probability of factor (x;y) givenB i . P i (x;y) is simply the sum of the probabilities of all successful paths inB i that contain (x;y) as a factor. For each stateq2Q i , we denote by i [q] the shortest distance from the initial 19 statesI i toq (= log of the forward probability) and by i [q] the shortest distance fromq to the final statesF i (= log of the backward probability): i [q] = log M 2(Ii;q) ( i (p[]) +w[]) (3.1) i [q] = log M 2(q;Fi) (w[] + i (n[])) (3.2) The shortest distances i [q] and i [q] can be computed for all statesq2Q i in linear time (O(jB i j)) since B i is acyclic [51]. Let i denote the set of all paths (including partials) inB i . Then,P i (x;y) is given by: logP i (x;y) = log M i[]=x;o[]=y 2i i [p[]] +w[] + i [n[]]: (3.3) Note that since there is a unique occurrence of the factor pair (x;y) in the utteranceu i ,P i (x;y) is a proper posterior probability even when there are multiple occurrences ofx inu i . Without the output symbolsy, Equation 3.3 would yield the expected count ofx inu i . Lett i [q] denote the timing of stateq2 Q i ,t s=e i (x;y) denote the start/end time of the factor (x;y) in B i . Then, t s i (x;y) = min i[]=x;o[]=y 2i t i [p[]]; (3.4) t e i (x;y) = max i[]=x;o[]=y 2i t i [n[]]: (3.5) Equations 3.3, 3.4 and 3.5 define the quantities we need to store for each factor. Now, we want to construct an index transducer which will map each factor (x;y) to a ( logP i (x;y);t s i (x;y);t e i (x;y)) triplet. We do this by first constructing a transducer which indexes each factor occurrence separately, i.e. each path corresponds to a factor occurrence and the weight of this path gives the corresponding (posterior prob- ability, start time, end time) triplet. To obtain the mapping we are after, we optimize this transducer on 20 0 1 ε:ε/1,0,0 3 ε:ε/1,2,0 2 ε:ε/0.5,1,0 4 ε:ε/1,3,0 5 ε:1/1,0,0 b:1/0.5,0,0 a:1/0.5,0,0 ε:1/1,0,2 a:2/1,0,0 ε:1/1,0,1 b:1/1,0,0 ε:1/1,0,3 Figure 3.3: Construction ofT 1 from the weighted automatonB 1 in Figure 3.2(a) and the state timing list t 1 = [0; 1; 2; 3]: after factor generation. theLT T 0 semiring so that overlapping factor occurrences are merged by adding their posterior probabilities inL with log operation, start times inT with min operation and end times inT 0 with max operation. After overlapping factors are merged, we no longer need to work overLTT 0 , so we switch to the more familiarTTT semiring (equivalent of tropical semiring onR 3 ) which allows pruning and shortest path operations. From the weighted transducer B i overL, and the state timing list t i , one can derive a timed factor transducerT i overTTT in four steps: 1. Factor Generation. In the general case we index all of the factors as follows: • Map each arc weight:w2L! (w; 1; 1)2LTT 0 ; • Create a unique initial stateq I 62Q i ; • Create a unique final stateq F 62Q i ; •8q2Q i , create two new arcs: – an initial arc: (q I ;";"; ( i [q];t i [q]; 1);q), – and a final arc: (q;";i; ( i [q]; 1;t i [q]);q F ). 2. Factor Merging. We merge the paths carrying the same factor-pair by viewing the result of factor generation as an acceptor, i.e. encoding input-output labels as a single label, and applying weighted 21 1 2 b:1/1,0,1 4 ε:1/1,0,0 3 a:2/1,0,1 ε:1/1,0,0 ε:1/1,0,0 0 a:1/0.5,0,1 b:1/1,0,2 a:2/1,2,3 (a) 1 2 b:ε/1,0,1 4 1:1/1,0,0 3 a:ε/1,0,1 2:1/1,0,0 3:1/1,0,0 0 a:ε/0.5,0,1 b:ε/1,0,2 a:ε/1,2,3 (b) 1 2 b:ε/0.5,0,1 4 1:1/0.5,0,0 3:1/1,2,2 3 a:ε/1,0,1 2:1/1,0,0 3:1/1,0,0 0 a:ε/1,0,1 b:ε/1,0,2 (c) Figure 3.4: Construction of T 1 from the weighted automaton B 1 in Figure 3.2(a) and the state timing listt 1 = [0; 1; 2; 3]: (a) after factor merging overRT T 0 , (b) after factor disambiguation, (c) after optimization overTTT . "-removal, determinization and minimization over theLTT 0 semiring. After the overlapping factors are merged, we map the arc weights: (w 1 ;w 2 ;w 3 )2LTT 0 ! (w 1 ;w 2 ;w 3 )2TTT: 3. Factor Disambiguation. We remove the cluster identifiers on the non-final arcs and insert disam- biguation symbols into the final arcs. For each edgee2E i : • Ifn[e]62F i , then assigno[e] = "; • Ifn[e]2F i , then assigni[e] =[e]. 22 4. Optimization. The result of factor disambiguation can be optimized by viewing it as an acceptor and applying weighted determinization and minimization over theTTT semiring. Figures 3.3 and 3.4 illustrate the TFT construction from a weighted automaton and a list of state timings. Factor generation step creates an intermediate factor transducer which maps exactly the set of factors of B i to the utterance IDi (possibly many times with different weights). During factor merging, overlapping factor occurrences are reduced to a single path. Let ~ T i denote the result of factor merging. It is clear from Equations 3.3, 3.4, 3.5 that for any factor (x;y)2 : J ~ T i K(x;yi) = ( logP i (x;y);t s i (x;y);t e i (x;y)) (3.6) whereyi denotes the concatenation of the string of cluster identifiersy and the automaton identifieri. The intermediate transducer ~ T i (Figure 3.4a) is not deterministic over the input labels. To make it so, we remove the cluster identifiers and augment each path with a disambiguation symbol[e] by modifying the final transitionse2E i ;n[e]2F i . It is convenient to use the previous statep[e] (more precisely a symbol derived from it) as this auxiliary disambiguation symbol in a practical implementation, i.e. [e] = p[e]. After this operation (Figure 3.4b) each final transition carries the symbol of its origin state as an input label. These symbols make sure that non-overlapping factors labeled with the same input string are kept separate during optimization. The resulting transducerT i (Figure 3.4c) is deterministic over the input labels and includes the augmented paths, i.e. each path inT i corresponds to an input-output factor pair in ~ T i . The timed factor transducerT (Figure 3.5a) of the entire data-set is constructed by • taking the unionU of individual TFTs: U = [ i T i ; i = 1;:::;n; • encoding the input-output labels of U as a single label and applying weighted "-removal, deter- minization, and minimization over theTTT semiring; 23 • and finally definingT as the transducer obtained after decoding the labels ofU and modifying the final arcs carrying disambiguation symbols and utterance IDs by encoding both labels as a single output label and setting the input label to ". The final optimization step merges only the partial paths since there is no successful path shared between T i ; i = 1;:::;n — each successful path has the unique automaton identifier on its final output label. The natural order ofT T T , being a total order, allows pruning before or during the final optimization if needed. Figure 3.5a illustrates the fully optimized TFT of the entire data-set. Even though this picture suggests that the TFT is nothing more than a prefix tree, this is not true in general 1 . As a matter of fact, this is exactly why this structure is a feasible index for a large collection of automata. Unlike a prefix tree, by allowing multiple incoming arcs, we are able to index exponentially many factors in a structure linear in the size of the input lattices (see Section 3.4.1). 3.2.3 Factor Selection Instead of the above given method of indexing each and every factor along with its time span and posterior probability, we can utilize factor selection filters in WFST form to restrict, transform or re-weight the index entries. [4] introduces various filters that are applied at various stages of the algorithm. Each filter is composed with some automaton, obtained in the course of the algorithm, to achieve a specific filtering operation. One such filter is a pronunciation dictionary which maps words to phone sequences. This filter is applied to the word lattices to obtain phonetic lattices. In our case, applying such a filter warrants an update of the state timings accordingly. Another example is a simple grammar which restricts the factors. This filter is applied after the factor generation step and removes the factors that are not accepted by the grammar. We utilize such a grammar to reject the silence symbol, i.e. factors including the silence symbol are not indexed. 1 Consider replacingB 2 with a simple transducer which has two states f0;1g, a single arc(0; a;1;1;1) and the state timing list [0;1]. The resulting TFT (before removing the disambiguation symbols) would be the same as the transducer in Figure 3.4c, except for an additional arc(1;1;2;(1;0;0);4). 24 3.2.4 Search over the Timed Factor Transducer The user query is typically an unweighted string, but it can also be given as an arbitrary weighted au- tomatonX. This covers the case of Boolean queries or regular expressions which can be compiled into automata. The responseR to a queryX is another automaton obtained by • composingX withT on the input side [49] and projecting the resulting transducer onto its output labels; • removing the " transitions and finally sorting with the shortest-path algorithm. R is a simple acceptor. Each successful path in R is a single arc (from the initial state to one of the final states) which carries an encoded automaton identifier on its label i[], and a ( log posterior probability, start time, end time) triplet on its weightw[]2T T T . A simple traversal overR in the arc order gives results sorted with the natural order ofT T T . Note thatR can be pruned before traversal to retain only the most likely responses. The pruning threshold may be varied to achieve different operating points. The full inverted index T is search-time optimal since it is a deterministic transducer except for the final " transitions which have encoded automaton identifiers on the output. Assuming we can access any arc ofT (that originates from a given state and matches a given input label) in constant time, the search complexity for a string queryx is given asO(jxj +r) wherer is the number of results, i.e. the number of arcs inR. 3.2.5 Comparison with the Factor Transducer and the Modified Factor Transducer For easy comparison, Figures 3.5b and 3.5c give the FT [4] and the MFT [15] obtained from the automata of Figure 3.1. Structurally, the FT is very similar to the TFT. The major difference is that the FT does not store any timing information. The MFT, on the other hand, is quite different from both the FT and the TFT. Timing information is encoded in the output labels, i.e. each output label on a non-final arc represents a time interval. In Section 3.1 we pointed out the issues related to both structures. The proposed method 25 1 3 b:ε/1,0,1 7 ε:1/1,2,2 ε:1/0.5,0,0 ε:2/1,0,1 2 4 a:ε/1,0,1 ε:1/1,0,1 ε:2/1,2,2 ε:2/0.333,0,0 5 a:ε/0.5,0,1 ε:1/0.5,0,0 ε:2/1,0,1 6 b:ε/0.333,0,1 ε:1/1,0,1 ε:2/0.333,0,0 ε:1/1,0,0 ε:2/1,0,0 0 a:ε/1,0,1 b:ε/1,0,1 (a) 1 3 b:ε/1 7 ε:1/1.5 ε:2/1 2 4 a:ε/1 ε:1/1 ε:2/1.333 5 a:ε/0.5 ε:2/0.1 ε:1/0.5 6 b:ε/0.333 ε:1/1 ε:2/0.333 ε:1/1 ε:2/1 0 a:ε/1 b:ε/1 (b) 5 7 ε:1/1 6 a:2-3/1 3 2 a:1-2/1 ε:2/1 1 b:2-3/1 ε:2/1 4 b:1-2/1 ε:1/1 ε:2/1 ε:1/1 0 b:1-2/0.5 b:0-2/0.5 b:0-1/0.333 a:0-2/0.667 a:1-2/0.333 a:0-1/0.5 b:2-3/1 a:2-3/1 (c) Figure 3.5: (a) TFTT overRTT 0 , (b) FT overR and (c) MFT overR obtained from the weighted automata and the state timing lists in Figure 3.1. Output labels on the non-final arcs of the MFT represent the associated time intervals, i.e. “a:0-2” means there is an “a” from time 0 to 2. 26 alleviates the issues of the FT by indexing the timing information and keeping separate entries for non- overlapping factors — note the extra final transitions of the TFT. Issues of the MFT, on the other hand, are resolved by embedding the timing information into the arc weights. Once the cluster identifiers are removed, the final TFT can be made fully deterministic except for the final transitions. Also note that we no longer have the quantization problem which was a by-product of keeping timing labels. 3.3 Experimental Setup In this study, we present results on two different KWS systems, one of them in Turkish and the other in English. Both systems utilize IBM’s Attila Speech Recognition Toolkit [76] and our OpenFst [5] based KWS tools. Following sections detail the ASR training and KWS experimentation data used in each system. 3.3.1 Turkish Broadcast News (TBN) KWS System Bo˘ gazic ¸i University Speech Processing Group has been collecting a large database of Turkish Broadcast News since 2006. Currently, TBN database includes 350 hours of manually transcribed speech data col- lected from one radio (V oA) and four TV channels (CNN T¨ urk, NTV , TRT1, TRT2). In this study we used non-overlapping subsets of the TBN database (given in Table 3.1) for building ASR systems and performing KWS experiments. Table 3.1: Breakdown of TBN Database (in hours) T (Training) H (Held-out) R (Retrieval) All 184.0 3.1 163.3 350.4 Our KWS system utilizes the T, H and R subsets of the TBN database for ASR training, ASR opti- mization and KWS experiments respectively. This system is meant to mimic a realistic scenario where a 27 large database of spoken documents is indexed and searched. R subset, which includes 1.2 M words over 163 hours of speech, constitutes a fairly large evaluation set for speech retrieval experiments. The ASR engine was built with the IBM Attila toolkit using the T subset. It is a word-based system with a vocabulary of 200 K words. The language model (200 K word-based model presented in [6]) was derived from the manual transcripts of the T subset and a large text corpus of size 184 M words [70]. The WER of the ASR system on the H and R subsets are 25.9% and 29.9% respectively. In KWS experiments, we used the R-IV query set which was selected from the reference transcriptions of the R subset. R-IV query terms are confined to the ASR vocabulary. Table 3.2 gives the decomposition of R-IV query set with respect to query length. Table 3.2: R-IV Query Set Decomposition 1-word 2-word 3-word 4-word Total 2312 1725 256 115 4408 3.3.2 English Broadcast News (EBN) KWS System This system uses standard data sets from NIST’s 2006 Spoken Term Detection Evaluation [59]. Exper- iments utilize the Broadcast News subset of the STDDEV06 data set, which includes 25 K words over 3 hours of speech, and the IV (in-vocabulary) subset of the DRYRUN06 query set which includes 1058 terms. Table 3.3 gives the decomposition of the DRYRUN06-IV query set with respect to the phonetic query length. The ASR engine is the one used by IBM during the 2006 NIST STD evaluations [47]. Ar- chitectural details of the IBM research prototype ASR system can be found in [76]. The WER of the ASR system on STDDEV06 data set is 12.7%. Table 3.3: DRYRUN06-IV Query Set Decomposition (w.r.t. Phonetic Length) 1 2 3 4 5 6 7 8 9 10 11+ 3 36 107 125 115 110 79 83 83 51 266 28 3.4 Experiments In this section, we provide experiments comparing three KWS schemes: Two-Stage Retrieval with FT[15], Retrieval with MFT[15] and Retrieval with TFT. Our comparisons are in terms of index size and average search time. We also analyze the change of average search time w.r.t. query length (only for the last two schemes). For the experiments of this section, we first extracted a fairly large word lattice (five back-pointers per word trace) for each utterance. Then, we pruned the raw lattices with different logarithmic beam widths and conducted the same set of experiments for each beam width. We observed that for all KWS schemes in consideration, a beam width of 4 is ideal for actual system operation, i.e. minimal index size and search time, without incurring a significant loss in retrieval performance. We should note that there is no significant difference between the three KWS schemes as far as the term detection performance is concerned (Actual Term Weighted Value [59] 0.81 for the word-based TBN system and 0.80 for the word-based/phonetic EBN KWS systems). NIST STD 2006 Evaluation Plan requires the results to contain no more than 0.5 s gap between the adjacent words of a query term. We exploit this requirement by indexing only the factors that do not contain such gaps. We process input lattices to identify long gaps (> 0.5 s) and use a silence symbol to mark them. Then after the factor generation step, we employ a simple restriction grammar to filter out the factors including the silence symbol. Shorter gaps are mapped to " symbols and removed prior to index construction. 3.4.1 Index Size In retrieval applications, index size is an important application concern. Preferably, it should be as small as possible but maybe even more importantly it should not grow exponentially as the data size increases. In our case, the data size depends on the total amount of speech and the beam width of ASR lattices used 29 in index construction. Table 3.4 demonstrates how fast the number of factors increases as we increase the beam width even with a small data set like STDDEV06. Table 3.4: Number of Factors and Index Size v.s. Total Lattice Size (STDDEV06 Data Set, Bold column indicates beam width 4) log 10 (Total Lattice Size) 5.64 6.22 6.61 6.85 7.01 log 10 (TFT Size) 5.58 6.40 7.16 7.66 7.87 log 10 (Number of Factors) 12.67 20.06 25.24 29.34 32.38 Figure 3.6 plots the increase of index size w.r.t. the size of input lattices — lattice beam increases from 1 to 10. The size of factor automata are expected to be linear in the size of input lattices. Figure 3.6 demonstrates this linear dependence for all structures in consideration. Note that MFT has the smallest size at all beam widths (For our data set it is even smaller than the total size of input lattices!). While this may appear rather unexpected — since FT carries much less information compared to MFT —, we should not forget that MFT is not a deterministic machine. Before the final optimization step of the index construction algorithm, there are much less common paths to merge in the case of MFT due to the time labels. This leads to a smaller final index since the lattice structure of the index is largely preserved after determinization. The size difference between FT and TFT, on the other hand, can be attributed to two key features of TFT: (i) storage of the time alignment information on the path weights and (ii) separation of non-overlapping factors via clustering. 3.4.2 Search Time The most important concern in a retrieval application is the search time since search is usually performed online unlike index construction. Table 3.5 gives the per query average search times for a lattice beam of 4. Due to the costly second stage, employing the FT in the KWS task results in two to three orders of magnitude slower search. The MFT and the TFT give similar search time performances as far as the 30 1 2 3 4 5 6 7 8 9 0 0.5 1 1.5 2 2.5 3 3.5 x 10 8 x10 7 Total Lattice Size ( ! i |B i |) Index Size (|T |) Timed Factor Transducer Modified Factor Transducer Factor Transducer Figure 3.6: Index size vs. total lattice size (TBN-R Data Set) Table 3.5: Per query average search times (in ms) (Beam Width 4) TBN System EBN System FT (2-Stage) 202:71 54:67 Modified FT 4:01 0:24 Timed FT 4:53 0:22 whole query sets are concerned. Since most of the query terms comprise of a single word, this behavior is not surprising. Even though the MFT is not a deterministic machine, it has a much smaller memory footprint compared to the TFT, and hence we obtain comparable search times. While non-determinism does not really matter when the query is a single word, it becomes more and more important as the queries get longer. Recall that the TFT has a search time complexity linear in the sum of the query length and the number of results. The MFT, on the other hand, has an average search time complexity linear in the product of the query length and the number of results — worst case complexity is exponential. Table 3.6 gives per query average search times over the TBN-R data set w.r.t. query length. 31 Table 3.6: Per query average search times (in ms) w.r.t. query length (TBN-R Data Set, R-IV Query Set, Beam Width 4) 1-word 2-word 3-word 4-word All Modified FT 6.45 1.34 1.21 1.22 4.01 Timed FT 8.16 0.58 0.27 0.24 4.53 0 1 2 3 4 5 6 7 8 9 10 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 Beam Width = 4 Query Length Average Search Time (in ms) Timed Factor Transducer Modified Factor Transducer Figure 3.7: Per result average search times vs. phonetic query length (Phonetic STDDEV06 Data Set, Phonetic DRYRUN06-IV Query Set, Beam Width 4) First thing to notice about Table 3.6 is that the MFT is faster than the TFT only when the query is a single word. As the query gets longer, the TFT becomes much faster due to its deterministic structure. In the MFT, every distinct time alignment of a word leads to a new path to be traversed. Thus, each path matching the beginning of the query has to be traversed until a mismatch is found to determine the successful paths matching the whole query. On the other hand, in the TFT, only one arc — hence only one path — needs to be traversed at each state until the single partial path matching the query is found. After that, results are read from the final transitions leaving the single destination state of that partial path. Table 3.7 gives per result average search times w.r.t. query length over the TBN-R data set. Our search time analysis in Section 3.2.4 assumed that we could access any arc (given a state and a label) at 32 Table 3.7: Per result average search times (in ms) w.r.t. query length (TBN-R Data Set, R-IV Query Set, Beam Width 4) 1-word 2-word 3-word 4-word Modified FT 0.02 0.34 0.66 0.87 Timed FT 0.02 0.09 0.15 0.19 constant time. However, our OpenFst based implementation keeps a list of arcs for each state rather then a hash map, i.e. access time isO(logD) whereD represents the average out-degree. Even though it is not possible to observe an exact linear dependence on the query length, we can clearly observe the difference between the two methods. Figure 3.7 is particularly interesting since it compares the two methods in a phonetic KWS setting. To obtain these graphs, we converted the EBN word lattices into phone lattices using the pronunciation dictionary of the ASR system and constructed phonetic indexes. Once mapped to their phonetic counterparts, we obtained longer query strings for search. As demonstrated by the graphs of Figure 3.7, in a sub-word scenario a deterministic index is crucial for high performance. 3.5 Summary Efficient indexing of ASR lattices (word or sub-word level) for KWS is not a straightforward task. We generalized the SUR indexing method of [4] by augmenting the index with timing information and used the resulting structure to perform single-stage KWS. Proposed index structure is deterministic, hence the search complexity is linear in the query length. As demonstrated by the comparisons given in Section 3.4.2, single-stage KWS schemes significantly improve the search time over the two-stage scheme. We further analyzed the differences between the single-stage methods and demonstrated that the TFT significantly outperforms the MFT as the query length increases. This fact becomes even more valuable in the case of sub-word indexing due to longer query strings. We presented the core index construction algorithm for the general problem of indexing lattices, but it can also be used to index approximate structures like CNs or PSPLs. Since approximate structures include 33 the posterior scores and the clustering information by construction, we no longer need the preprocessing step. The resulting index stores the timings/positions of the nodes in the case of CNs/PSPLs. Since the TFT inherently stores the proximity information (by means of time alignments), it can be utilized in other SR applications like spoken document retrieval. Furthermore, since the query can be any weighted automaton, we can search for complex relations between query words without changing the index. Any finite state relation, e.g. a regular expression, can be compiled into a query automaton and retrieved from the index. We gave an example to this type of search in [15] where we compiled weighted pronunciation alternatives into a query automaton to search the index for the OOV term occur- rences. Another possibility is to relax the exact string match objective and allow for gaps between query terms. Although this is not expected in KWS, it might be useful in other speech retrieval applications. Implementing such a search is trivial in our framework since we can easily modify the query automaton in such a way that an arbitrary number of words can be inserted between actual query terms. Searching for arbitrary permutations of query words, even allowing the insertion of other words in between these permutations, is yet another trivial extension which can be achieved without changing the index. 34 Chapter 4 ComputingN-gram Posteriors from Lattices Many complex speech and natural language processing (NLP) pipelines such as Automatic Speech Recog- nition (ASR) and Statistical Machine Translation (SMT) systems store alternative hypotheses produced at various stages of processing as weighted acyclic automata, also known as lattices. Each lattice stores a large number of hypotheses along with the raw system scores assigned to them. While single-best hypoth- esis is typically what is desired at the end of the processing, it is often beneficial to consider a large number of weighted hypotheses at earlier stages of the pipeline to hedge against errors introduced by various sub- components. Standard ASR and SMT techniques like discriminative training, rescoring with complex models and Minimum Bayes-Risk (MBR) decoding rely on lattices to represent intermediate system hy- potheses that will be further processed to improve models or system output. For instance, lattice based MBR decoding has been shown to give moderate yet consistent gains in performance over conventional MAP decoding in a number of speech and NLP applications including ASR [33] and SMT [81, 10, 28]. Most lattice-based techniques employed by speech and NLP systems make use of posterior quantities computed from probabilistic lattices. We are interested in two such posterior quantities: i)n-gram expected count, the expected number of occurrences of a particular n-gram in a lattice, and ii) n-gram posterior probability, the total probability of accepting paths that include a particularn-gram. Expected counts have applications in the estimation of language model statistics from probabilistic input such as ASR lattices [3] and the estimation of term frequencies from spoken corpora while posterior probabilities come up in MBR 35 decoding of SMT lattices [81], relevance ranking of spoken utterances and the estimation of document frequencies from spoken corpora [40, 16]. The expected countc(xjA) ofn-gramx given latticeA is defined as c(xjA) = X y2 # y (x)p(yjA) (4.1) where # y (x) is the number of occurrences ofn-gramx in hypothesisy andp(yjA) is the posterior proba- bility of hypothesisy given latticeA. Similarly, the posterior probabilityp(xjA) ofn-gramx given lattice A is defined as p(xjA) = X y2 1 y (x)p(yjA) (4.2) where 1 y (x) is an indicator function taking the value 1 when hypothesisy includesn-gramx and 0 oth- erwise. While it is straightforward to compute these posterior quantities from weighted n-best lists by examining each hypothesis separately and keeping a separate accumulator for each observedn-gram type, it is infeasible to do the same with lattices due to the sheer number of hypotheses stored. There are efficient algorithms in literature [3, 4] for computingn-gram expected counts from weighted automata that rely on weighted finite state transducer operations to reduce the computation to a sum overn-gram occurrences eliminating the need for an explicit sum over accepting paths. The rather innocent looking difference be- tween Equations 4.1 and 4.2, # y (x) vs. 1 y (x), makes it hard to develop similar algorithms for computing n-gram posteriors from weighted automata since the summation of probabilities has to be carried out over paths rather thann-gram occurrences [10, 28]. The problem of computingn-gram posteriors from lattices has been addressed by a number of recent works [81, 2, 10, 28] in the context of lattice-based MBR for SMT. In these works, it has been reported that the time required for lattice MBR decoding is dominated by the time required for computingn-gram posteriors. Our interest in computingn-gram posteriors from lattices stems from its potential applications in spoken content retrieval [20, 40, 16]. Computation of document frequency statistics from spoken corpora relies on estimatingn-gram posteriors from ASR lattices. In this context, a spoken document is simply a 36 collection of ASR lattices. Then-grams of interest can be word, syllable, morph or phoneme sequences. Unlike in the case of lattice-based MBR for SMT where the n-grams of interest are relatively short – typically up to 4-grams –, then-grams we are interested in are in many instances relatively long sequences of sub-word units. We present an efficient algorithm for computing the posterior probabilities of alln-grams in a lattice and constructing a minimal deterministic weighted finite-state automaton associating each n-gram with its posterior for efficient storage and retrieval. Ourn-gram posterior computation algorithm builds upon the custom forward procedure described in [28] and introduces a number of refinements to significantly improve the time and space requirements: • The custom forward procedure described in [28] computes unigram posteriors from an input lattice. Higher order n-gram posteriors are computed by first transducing the input lattice to an n-gram lattice using an order mapping transducer and then running the custom forward procedure on this higher order lattice. We reformulate the custom forward procedure as a dynamic programming algorithm that computes posteriors for successively longern-grams and reuses the forward scores computed for the previous order. This reformulation subsumes the transduction of input lattices to n-gram lattices and obviates the need for constructing and applying order mapping transducers. • Comparing Eq. 4.1 with Eq. 4.2, we can observe that posterior probability and expected count are equivalent for ann-gram that do not repeat on any path of the input lattice. The key idea behind our algorithm is to limit the costly posterior computation to only thosen-grams that can potentially repeat on some path of the input lattice. We keep track of repeatingn-grams of ordern and use a simple impossibility argument to significantly reduce the number ofn-grams of ordern+1 for which posterior computation will be performed. The posteriors for the remainingn-grams are replaced with expected counts. This filtering ofn-grams introduces a slight bookkeeping overhead but in return dramatically reduces the runtime and memory requirements for longn-grams. 37 • We store the posteriors forn-grams that can potentially repeat on some path of the input lattice in a weighted prefix tree that we construct on the fly. Once that is done, we compute the expected counts for alln-grams in the input lattice and represent them as a minimal deterministic weighted finite- state automaton, known as a factor automaton [4, 54], using the approach described in [4]. Finally we use general weighted automata algorithms to merge the weighted factor automaton representing expected counts with the weighted prefix tree representing posteriors to obtain a weighted factor automaton representing posteriors that can be used for efficient storage and retrieval. 4.1 Computation of N-gram Posteriors In this section we present an efficient algorithm based on the n-gram posterior computation algorithm described in [28] for computing the posterior probabilities of alln-grams in a lattice and constructing a weighted factor automaton for efficient storage and retrieval of these posteriors. We assume that the input lattice is an "-free acyclic probabilistic automaton. If that is not the case, we can use general weighted automata "-removal and weight-pushing algorithms [53] to preprocess the input automaton. Algorithm 1 Compute N-gram Posteriors 1 forn 1;:::;N do 2 A n Min(Det(RmEps(ProjOut(A n )))) 3 [q] n (q);8 stateq2Q n 4 ~ [q][x] 0;8 stateq2Q n ;8 labelx2 n 5 p(xjA) 0;8 labelx2 n 6 for each states2Q n do . In topological order 7 for each arc (s;x;w;q)2E n do 8 [q] [q][s] w 9 ~ [q][x] ~ [q][x][s] w 10 for each labely2 ~ [s] do 11 ify66=x then 12 ~ [q][y] ~ [q][y] ~ [s][y] w 13 ifs2F n then 14 for each labelx2 ~ [s] do 15 p(xjA) p(xjA) ~ [s][x] n (s) 16 P Min(ConstructPrexTree(p)) 38 Algorithm 1 reproduces the original algorithm of [28] in our notation. Each iteration of the outer- most loop starting at line 1 computes posterior probabilities of all unigrams in then-gram latticeA n = ( n ;Q n ;I n ;F n ;E n ; n ; n ), or equivalently all n-grams of order n in the lattice A. The inner loop starting at line 6 is essentially a custom forward procedure computing not only the standard forward prob- abilities[q], the marginal probability of paths that lead to stateq, [q] = M 2 (I;q) (s[]) w[] (4.3) = M e2E t[e] =q [s[e]] w[e] (4.4) but also the label specific forward probabilities ~ [q][x], the marginal probability of paths that lead to state q and include labelx. ~ [q][x] = M 2 (I;q) 9u;v2 :i[] =uxv (s[]) w[] (4.5) = M e2E t[e] =q i[e] =x [s[e]] w[e] M e2E t[e] =q i[e]6=x ~ [s[e]][x] w[e] (4.6) Just like in the case of the standard forward algorithm, visiting states in topological order ensures that forward probabilities associated with a state has already been computed when that state is visited. At each states, the algorithm examines each arce = (s;x;w;q) and updates the forward probabilities for stateq in accordance with the recursions in Equations 4.4 and 4.6 by propagating the forward probabilities computed fors (lines 8-12). The conditional on line 11 ensures that the label specific forward probability ~ [s][y] is propagated to stateq only if labely is different from labelx, the label on the current arc. In other words, if a labely repeats on some path leading to stateq, then contributes to ~ [q][y] only once. This is exactly what is required by the indicator function in Equation 4.2 when computing unigram posteriors. Whenever a final state is processed, the posterior probability accumulator for each label observed on paths reaching 39 that state is updated by multiplying the label specific forward probability and the final weight associated with that state and adding the resulting value to the accumulator (lines 13-15). It should be noted that this algorithm is a form of marginalization [28], rather than a counting procedure, due to the conditional on line 11. If that conditional were to be removed, this algorithm would computen-gram expected counts instead of posterior probabilities. The key idea behind our algorithm is to restrict the computation of posteriors to only thosen-grams that may potentially repeat on some path of the input lattice and exploit the equivalence of expected counts and posterior probabilities for the remainingn-grams. It is possible to extend Algorithm 1 to implement this restriction by keeping track of repeatingn-grams of ordern and replacing the output labels of appropriate arcs in n+1 with " labels. Alternatively we can reformulate Algorithm 1 as in Algorithm 2. In this formulation we computen-gram posteriors directly on the input latticeA without constructing then-gram lattice A n . We explicitly associate states in the original lattice with distinct n-gram histories which is implicitly done in Algorithm 1 by constructing the n-gram lattice A n . This explicit association lets us reuse forward probabilities computed at ordern while computing the forward probabilities at ordern + 1. Further, we can directly restrict then-grams for which posterior computation will be performed. In Algorithm 2, [n][q][h] represents the history specific forward probability of stateq, the marginal probability of paths that lead to stateq and include lengthn stringh as a suffix. [n][q][h] = M 2 (I;q) 9z2 :i[] =zh (s[]) w[] (4.7) = M e2E t[e] =q g2 [n1][s[e]] gi[e] =h [n 1][s[e]][g] w[e] (4.8) [n][q][h] is the analogue of [q] in Algorithm 1. It splits the forward probability of state q (Equation 4.3), among lengthn suffixes (or histories) of paths that lead to stateq. We can interpret [n][q][h] as the forward probability of state (q;h) in then-gram latticeA n+1 . Here (q;h)2Q n+1 denotes the unique state 40 Algorithm 2 Compute N-gram Posteriors (Reformulation) 1 R[0] f"g 2 [0][q]["] [q];8 stateq2Q 3 forn 1;:::;N do 4 R[n] ; 5 [n][q][x] 0;8 stateq2Q;8 ngramx2 n 6 ^ [q][h][x] 0;8 stateq2Q;8 historyh2 n1 ;8 ngramx2 n 7 p(xjA) 0;8 ngramx2 n 8 for each states2Q do . In topological order 9 for each historyg2 [n 1][s] whereg2R[n 1] do 10 for each arc (s;i;w;q)2E do 11 x gi . Concatenate history and label 12 h x[1 :n] . Drop first label 13 ifh2R[n 1] then 14 [n][q][x] [n][q][x] [n 1][s][g] w 15 ^ [q][h][x] ^ [q][h][x] [n 1][s][g] w 16 for each ngramy2 ^ [s][g] do 17 ify66=x then 18 ^ [q][h][y] ^ [q][h][y] ^ [s][g][y] w 19 else 20 R[n] R[n][fyg 21 ifs2F then 22 for each historyg2 ^ [s] do 23 for each ngramx2 ^ [s][g] do 24 p(xjA) p(xjA) ^ [s][g][x] (s) 25 P 0 ConstructPrexTree(p) 26 C ComputeExpectedCounts(A;N) 27 P Min(Det(RmEps((C RmWeight(P 0 ))P 0 ))) corresponding to stateq in the original latticeA and stateh in the mapping transducer n+1 . ^ [q][h][x] represents the history andn-gram specific forward probability of stateq, the marginal probability of paths that lead to stateq, include lengthn 1 stringh as a suffix and includen-gramx as a substring. ^ [q][h][x] = M 2 (I;q) 9z2 :i[] =zh 9u;v2 :i[] =uxv (s[]) w[] (4.9) = M e2E t[e] =q g2 [jhj][s[e]] gi[e] =x [jhj][s[e]][g] w[e] M e2E t[e] =q g2 ^ [s[e]] gi[e]6=x ^ [s[e]][g][x] w[e] (4.10) 41 ^ [q][h][x] is the analogue of ~ [q][x] in Algorithm 1. R[n] represents the set ofn-grams of ordern that repeat on some path ofA. We start by definingR[0],f"g, i.e. the only repeatingn-gram of order 0 is the empty string ", and computing [0][q]["][q] using the standard forward algorithm. Each iteration of the outermost loop starting at line 3 computes posterior probabilities of alln-grams of ordern directly on the lattice A. At iteration n, we visit the states in topological order and examine each length n 1 historyg associated withs, the state we are in. For each historyg, we go over the set of arcs leaving state s, construct the currentn-gramx by concatenatingg with the current arc labeli (line 11), construct the lengthn1 historyh of the target stateq (line 12), and update the forward probabilities for the target state history pair (q;h) in accordance with the recursions in Equations 4.8 and 4.10 by propagating the forward probabilities computed for the state history pair (s;g) (lines 14-18). Whenever a final state is processed, the posterior probability accumulator for eachn-gram of ordern observed on paths reaching that state is updated by multiplying then-gram specific forward probability and the final weight associated with that state and adding the resulting value to the accumulator (lines 21-24). We track repeatingn-grams of ordern to restrict the costly posterior computation operation to only thosen-grams of ordern + 1 that can potentially repeat on some path of the input lattice. The conditional on line 17 checks if any of then-grams observed on paths reaching state history pair (s;g) is the same as the currentn-gramx, and if so adds it to the set of repeatingn-grams. At each iterationn, we check if the current lengthn 1 historyg of the state we are in is inR[n 1], the set of repeatingn-grams of order n 1 (line 9). If it is not, then non-gramx =gi can repeat on some path ofA since that would requireg to repeat as well. Ifg is inR[n 1], then for each arce = (s;i;w;q) we check if the lengthn 1 history h =g[1 :n 1]i of the next stateq is inR[n 1] (line 13). If it is not, then then-gramx =g[0]h can not repeat either. We keep the posteriorsp(xjA) forn-grams that can potentially repeat on some path of the input lattice in a deterministic WFSAP 0 that we construct on the fly.P 0 is a prefix tree where each path corresponds to ann-gram posterior, i.e. i[] =x =) w[] =(t[]) =p(xjA). Once the computation of posteriors for possibly repeatingn-grams is finished, we use the algorithm described in [4] to construct a weighted 42 factor automaton C mapping all n-grams observed in A to their expected counts, i.e. 8 in C, i[] = x =) w[] =c(xjA). We useP 0 andC to construct another weighted factor automatonP mapping all n-grams observed inA to their posterior probabilities, i.e.8 inP ,i[] =x =) w[] =p(xjA). First we remove then-grams accepted byP 0 fromC using the difference operation [53], C 0 =C RmWeight(P 0 ) then take the union of the remaining automatonC 0 andP 0 , and finally optimize the result by removing "-transitions, determinizing and minimizing P = Min(Det(RmEps(C 0 P 0 ))): 4.2 Experiments and Discussion In this section we provide experiments comparing the performance of Algorithm 2 with Algorithm 1 as well as a baseline algorithm based on the approach of [81]. All algorithms were implemented in C++ using the OpenFst Library [5]. Algorithm 1 implementation is a thin wrapper around the reference imple- mentation. All experiments were conducted on the 88K ASR lattices (total size: #states + #arcs = 33M, disk size: 481MB) generated from the training subset of the IARPA Babel Turkish language pack, which includes 80 hours of conversational telephone speech. Lattices were generated with a speaker dependent DNN ASR system that was trained on the same data set using IBM’s Attila toolkit [77]. All lattices were pruned to a logarithmic beam width of 5. Figure 4.1 gives a scatter plot of the posterior probability computation time vs. the number of lattice n-grams (up to 5-grams) where each point represents one of the 88K lattices in our data set. Similarly, Figure 4.2 gives a scatter plot of the maximum memory used by the program (maximum resident set size) during the computation of posteriors vs. the number of latticen-grams (up to 5-grams). Algorithm 2 43 0 2 4 6 8 10 x 10 4 0 20 40 60 80 100 120 140 160 180 Number of lattice n −grams Computation time (sec) Algorithm 2 Algorithm 1 Figure 4.1: Runtime comparison Table 4.1: Runtime Comparison Maxn-gram length 1 2 3 4 5 6 10 all log 10 (#n-grams) 3.0 3.8 4.2 4.5 4.8 5.1 6.3 11.2 Baseline (sec) 5 15 32 69 147 311 5413 - Algorithm 1 (sec) 0.5 0.6 0.9 1.6 3.9 16 997 - Algorithm 2 (sec) 0.7 0.8 0.9 1.1 1.2 1.3 1.7 1.0 Expected Count (sec) 0.3 0.4 0.5 0.6 0.7 0.8 1.0 0.5 requires significantly less resources, particularly in the case of larger lattices with a large number of unique n-grams. To better understand the runtime characteristics of Algorithms 1 and 2, we conducted a small experi- ment where we randomly selected 100 lattices (total size: #states + #arcs = 81K, disk size: 1.2MB) from our data set and analyzed the relation between the runtime and the maximumn-gram lengthN. Table 4.1 gives a runtime comparison between the baseline posterior computation algorithm described in [81], Algo- rithm 1, Algorithm 2 and the expected count computation algorithm of [4]. The baseline method computes posteriors separately for each n-gram by intersecting the lattice with an automaton accepting only the 44 0 2 4 6 8 10 x 10 4 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of lattice n −grams Max memory use (MB) Algorithm 2 Algorithm 1 Figure 4.2: Memory use comparison Table 4.2: Factor Automata Comparison FA Type Unweighted Expected Count Posterior #states + #arcs (M) 16 20 21 On disk size (MB) 219 545 546 Runtime (min) 5.5 11 22 paths including thatn-gram and computing the total weight of the resulting automaton in log semiring. Runtime complexities of the baseline method and Algorithm 1 are exponential in N due to the explicit enumeration ofn-grams and we can clearly see this trend in the 3rd and 4th rows of Table 4.1. Algorithm 2 (5th row) takes advantage of the WFSA based expected count computation algorithm (6th row) to do most of the work for longn-grams, hence does not suffer from the same exponential growth. Notice the drops in the runtimes of Algorithm 2 and the WFSA based expected count computation algorithm when all n-grams are included into the computation regardless of their length. These drops are due to the expected count computation algorithm that processes alln-grams simultaneously using WFSA operations. Limiting the maximum n-gram length requires pruning long n-grams, which in general can increase the sizes of intermediate WFSAs used in computation and result in longer runtimes as well as larger outputs. 45 When there is no limit on the maximum n-gram length, the output of Algorithm 2 is a weighted factor automaton mapping each factor to its posterior. Table 4.2 compares the construction and storage requirements for posterior factor automata with similar factor automata structures. We use the approach described in [4] for constructing both the unweighted and the expected count factor automata. We construct the unweighted factor automata by first removing the weights on the input lattices and then applying the determinization operation on the tropical semiring so that path weights are not added together. The storage requirements of the posterior factor automata produced by Algorithm 2 is similar to those of the expected count factor automata. Unweighted factor automata, on the other hand, are significantly more compact than their weighted counterparts even though they accept the same set of strings. This difference in size is due to accommodating path weights which in general can significantly impact the effectiveness of automata determinization and minimization. 4.3 Related Work Efficient computation of n-gram expected counts from weighted automata was first addressed in [3] in the context of estimating n-gram language model statistics from ASR lattices. Expected counts for all n-grams of interest observed in the input automaton are computed by composing the input with a simple counting transducer, projecting on the output side, and removing "-transitions. The weight associated by the resulting WFSA to each n-gram it accepts is simply the expected count of that n-gram in the input automaton. Construction of such an automaton for all substrings (factors) of the input automaton was later explored in [4] in the context of building an index for spoken utterance retrieval (SUR) [72]. This is the approach used for constructing the weighted factor automatonC in Algorithm 2. While expected count works well in practice for ranking spoken utterances containing a query term, posterior probability is in theory a better metric for this task. The weighted factor automatonP produced by Algorithm 2 can be used to construct an SUR index weighted with posterior probabilities. 46 The problem of computingn-gram posteriors from lattices was first addressed in [81] in the context of lattice-based MBR for SMT. This is the baseline approach used in our experiments and it consists of building a separate FSA for eachn-gram of interest and intersecting this automaton with the input lattice to discard those paths that do not include thatn-gram and summing up the weights of remaining paths. The fundamental shortcoming of this approach is that it requires separate intersection and shortest distance computations for eachn-gram. This shortcoming was first tackled in [2] by introducing a counting trans- ducer for simultaneous computation of posteriors for alln-grams of ordern in a lattice. This transducer works well for unigrams since there is a relatively small number of unique unigrams in a lattice. However, it is less efficient forn-grams of higher orders. This inefficiency was later addressed in [10] by employing n-gram mapping transducers to transduce the input lattices ton-gram lattices of ordern and computing unigram posteriors on the higher order lattices. Algorithm 1 was described in [28] as a fast alternative to counting transducers. It is a lattice specialization of a more general algorithm for computingn-gram pos- teriors from a hypergraph in a single inside pass [30]. While this algorithm works really well for relatively shortn-grams, its time and space requirements scale exponentially with the maximumn-gram length. Al- gorithm 2 builds upon this algorithm by exploiting the equivalence of expected counts and posteriors for non-repeatingn-grams and eliminating the costly posterior computation operation for mostn-grams in the input lattice. 4.4 Summary We have described an efficient algorithm for computingn-gram posteriors from an input lattice and con- structing an efficient and compact data structure for storing and retrieving them. The runtime and memory requirements of the proposed algorithm grow linearly with the length of the n-grams as opposed to the exponential growth observed with the original algorithm we are building upon. This is achieved by limit- ing the posterior computation to only thosen-grams that may repeat on some path of the input lattice and using the relatively cheaper expected count computation algorithm for the rest. This filtering ofn-grams 47 introduces a slight bookkeeping overhead over the baseline algorithm but in return dramatically reduces the runtime and memory requirements for longn-grams. 48 Chapter 5 Computing Information Retrieval Statistics from Spoken Corpora Computing document similarity is a fundamental need in SCR just as in the case of text retrieval. Inverse document frequency (IDF) [68, 37] is an important term specificity measure used in almost every IR system in some form. In its most basic form, it is defined as log of the fraction of documents that include a term. IDF computation involves computing document frequency (DF), simply defined as the number of documents that include a term. Term frequency (TF) is another important measure, which in its most basic form is defined as the number of occurrences of a term in a document. The more general class of term weighting schemes known as TF-IDF, which involves multiplying an IDF measure by a TF measure, constitutes the basis for quantifying document similarity in almost every IR system. While it is fairly straightforward to compute TF and DF measures from textual documents, the same cannot be said for spoken ones since the computation should be carried out over probability distributions over strings, e.g. ASR lattices. Approximating TF and DF measures by the statistics of 1-best ASR output is a common strategy used in SCR. However, proper estimation of these measures over ASR lattices provides significant gains in performance over the 1-best baseline [40]. Vector space model is a classical information retrieval approach to modeling textual documents. In this model, documents and query terms are represented as vectors and similarity is measured in terms of inner products. Each dimension corresponds to a particular term and the value of that dimension is known as the term weight. If a term occurs in a document, then the corresponding entry in the document vector is set 49 to a non-zero value, typically the TF-IDF value derived from the document collection. While representing documents as vectors of terms is a feasible approach when indexing textual documents, it does not scale very well to indexing spoken documents, specifically ASR lattices derived from those documents, since lattices used in SCR are typically fairly large and hence contain a very large number of weighted strings to be indexed. In our finite-state SCR framework, each spoken document is represented as a factor automaton. Each path in these automata corresponds to a particular term and the path weight is the term weight. Similarity is measured by intersecting weighted factor automata and performing a shortest distance computation in the log semiring. This operation effectively computes an inner product between weighted factor automata. Representing documents as weighted factor automata can be thought as a generalization of the vector space model to the case where each document is a probability distribution over strings. When we generalize documents to probability distributions over strings, the standard term weighting schemes such as TF, DF and TF-IDF only make sense in the expected sense. Hence in this framework all term weights are proper expectations over the probability distributions defined by lattices. Given a collection ofn spoken documentsfD i ji = 1;:::;ng, we view each spoken documentD i = fA ij jj = 1;:::;m i g as a collection of latticesA ij generated by an ASR system for each spoken utterance in that document. We will assume that the probability of a factor occurring in one lattice is independent of the probability of that factor occurring in another lattice. We further assume that we are given two separate weighted factor automata for each latticeA ij : the expected count factor automatonS EC ij [4] and the posterior factor automatonS P ij [17] (as described in Chapter 4). 5.1 Term Frequency Factor Automaton The expected term frequency factor automatonS TF i mapping each factor in a spoken documentD i to its expected frequency in that document is obtained simply by taking the WFSA union (sum) of the expected 50 count factor automatafS EC ij jj = 1;:::;m i g that we were given for the latticesfA ij jj = 1;:::;m i g in that document and applying weighted"-removal, determinization, and minimization over the log semiring. S TF i =Min(Det(RmEps( [ j S EC ij ))) 5.2 Document Posterior Factor Automaton We can construct the document posterior factor automatonS P i mapping each factor in a spoken document D i to the posterior probability of that factor occurring in D i by merging individual posterior factor au- tomatafS P ij jj = 1;:::;m i g that we were given for the latticesfA ij jj = 1;:::;m i g in that document. This merge operation is more involved than the simple automata union operation we used for merging ex- pected count factor automata since the posterior probabilities we are merging are not additive like expected counts. Consider the following base case. Given a document consisting of just two latticesD =fA 1 ;A 2 g and the posterior probabilitiesp(xjA 1 ) andp(xjA 2 ), we would like to compute the posterior probability p(xjD) of a factorx occurring inD. Under the assumption that the posterior probabilitiesp(xjA 1 ) and p(xjA 2 ) are independent of each other,x occurring inD is the union of individual occurrence events and hence p(xjD) =p(xjA 1 ) +p(xjA 2 )p(xjA 1 )p(xjA 2 ): The summation and multiplication operations are easily expressed in weighted automata language as WFSA union and WFSA intersection operations since they are always defined for semirings. The subtraction (or equivalently negation) operation, on the other hand, may or may not be defined for a semiring. Since negation operation is not defined for the log semiring, we can not simply subtract the multiplicationp(xjA 1 )p(xjA 2 ) from the summationp(xjA 1 ) +p(xjA 2 ) in the log semiring. To achieve 51 this particular computation we first map everything to signed log semiringS which keeps track of sign information separately from the log weight using the mapping x! (sgn(x); log(jxj)): Letw 1 = (s 1 ;y 1 ) andw 2 = (s 2 ;y 2 ) be two weights inS. The sum of these two weights is defined as w 1 w 2 = (sgn(z 1 +z 2 ); log(e jz1+z2j )) wherez 1 =s 1 e y1 andz 2 =s 2 e y2 . Once the subtraction is completed in signed log semiring, we apply weighted "-removal, determiniza- tion, and minimization over the signed log semiring and finally map the resulting automata back to standard log semiring. Since the individual weights we are subtracting from each other are guaranteed to result in a proper probability value, i.e. p(xjD)2 [0; 1], mapping these weights back to log semiring is always well defined. The procedure outlined here can be used to merge two posterior factor automata into a single one. We can iterate this procedure with additional posterior factor automata until all of them are merged into a single one. 5.3 Document Frequency Factor Automaton We can efficiently construct the document frequency factor automaton S DF of the entire collection of spoken documentsfD i ji = 1;:::;ng simply by taking the union U of individual document posterior factor automatafS P i ji = 1;:::;ng, concatenating a simple automaton N = (f"g;f"g;f0; 1g;f0g;f1g;f(0;";"; log(n); 1)g; 1; 1) 52 representing the size of the collection andU, and finally applying weighted "-removal, determinization, and minimization over the log semiring S DF =Min(Det(RmEps(N [ i S P i ))): 5.4 TF-IDF Factor Automaton We can combine each expected term frequency factor automatonS TF i with the document frequency factor automatonS DF of the entire collection to construct a TF-IDF factor automatonS TF;TFIDF i for each doc- umentD i where path weights represent (TF, TF-IDF) pairs. The desired relation between these automata can be expressed as: JS TF;TFIDF i K(x) = (JS TF i K(x);JS TF i K(x) logJS DF K(x)) This operation can be carried out with the weighted intersection operation over a special semiring structure known as the expectation (or entropy) semiring [31, 25]. Expectation semiringE is defined as follows: E = ((R[f1; +1g) (R[f1; +1g);; ; (0; 0); (1; 0)) (x 1 ;y 1 ) (x 2 ;y 2 ) = (x 1 +x 2 ;y 1 +y 2 ) (x 1 ;y 1 ) (x 2 ;y 2 ) = (x 1 x 2 ;x 1 y 2 +x 2 y 1 ) Let logA denote the weighted automaton derived fromA by replacing each weightw2R + by logw and let 1 (A) and 2 (A) denote the weighted automata over the expectation semiring derived fromA by 53 Table 5.1: Runtime Results: DF Factor Automata vs. Baseline Max length 1 2 3 6 10 all log 10 (# factors) 3.0 3.8 4.2 5.1 6.3 11.2 Baseline time (s) 5 15 32 311 5413 - DF FA time (s) - - - - - 0.1 replacing each weightw by the pair (w; 0) and (1;w) respectively. The factor automata representing TF, DF and (TF, TF-IDF) statistics satisfy the following identity in the expectation semiring: S TF;TFIDF i = 1 (S TF i )\ 2 ( logS DF ) Here the second term on the right can be recognized as the factor automaton representing IDF statistics. Hence, TF-IDF computation reduces to weighted automata intersection in the expectation semiring. Con- sider the vector space model defined over factors, i.e. each dimension corresponds to a factor. Inner product computation in this vector space between a query factor automaton 1 (S TF Q ) representing term frequen- cies over the expectation semiring and each factor automatonfS TF;TFIDF i ji = 1;:::;ng can be carried out by intersecting the two automata and then performing a single-source shortest-distance computation [51] over the entropy semiring: < 1 (S TF Q );S TF;TFIDF i >=d[ 1 (S TF Q )\S TF;TFIDF i ] One application for this inner product is the computation of cosine similarity between two spoken docu- ments, e.g. a voiced query and a spoken document. 5.5 Experiments and Discussion We conducted experiments on the training subset of the Turkish language pack provided by the IARPA Babel program which includes 80 hours of conversational telephone speech. Lattices were generated with 54 Table 5.2: Factor Automata Comparison FA Type UW TF DP DF TF-IDF i jS i j (M) 16 20 21 14 27 On disk (MB) 251 315 324 209 511 Time (min) 5 8 16 16+2 8+16+2+1 a speaker dependent DNN ASR system that was trained on the same data set using IBM’s Attila toolkit. All lattices were pruned to a logarithmic beam width of 5. Estimating document frequencies of single-word factors in a collection of lattices has previously been addressed in [40]. Their recipe for computing probability of occurrence consists of composing the input lattice with a simple finite-state filter that rejects the paths including the target word, computing the total probability of the remaining paths and complementing. Computation is carried out one factor at a time, i.e. factors are enumerated and each one is processed independently. We implemented a generalized ver- sion of this recipe which can be used with multi-word factors for our baseline results. Both this baseline algorithm and the factor automata construction algorithms in consideration were implemented using the OpenFst Library [5]. Table 5.1 gives a runtime comparison between the baseline and the DF factor au- tomata construction algorithm. We randomly selected 100 lattices from our data set (total size: #states + #arcs = 81K, disk size: 1.2MB) and compared the total runtime while changing the maximum factor length with a finite-state length restriction filter [3] for the baseline algorithm. Runtime complexity of the baseline method is exponential in the maximum factor length (or linear in the number of factors) due to the enumeration of factors. Proposed method takes advantage of weighted transducer algorithms to do the computation jointly for all factors. Table 5.2 compares the total runtime and storage requirements for various factor automata. For these experiments, we used the entire training set which includes 88K lattices (total size: #states + #arcs = 33M, disk size: 481MB) and a maximum factor length of 3. First column (UW) represents the unweighted factor automata obtained by removing all weights from the input lattices. Storage requirements seem to be comparable for the types of factor automata in consideration. The runtime for the construction of DF 55 factor automaton (18 min) includes the time spent for the construction of DP factor automata (16 min). Similarly the runtime for the construction of TF-IDF factor automata (27 min) includes the time spent for the construction of TF (8min) factor automata and DF (18min) factor automaton. 5.6 Summary We considered the problem of computing expected document frequency, and TF-IDF statistics for all substrings seen in a collection of lattices by means of factor automata. We presented a recipe to efficiently construct weighted factor automata representing DF and TF-IDF statistics for all substrings. Compared to the state-of-the-art in computing these statistics from lattices, our approach i) generalizes the statistics from single tokens to contiguous substrings, ii) provides significant gains in terms of run-time and storage requirements and iii) constructs efficient inverted index structures for retrieval of such statistics. 56 Chapter 6 Open Vocabulary Keyword Search Probably the hardest problem in ASR based KWS is that some word sequences might be missing from the ASR output due to model weaknesses, such as the deficiencies caused by limited system vocabulary, and beam search no matter how deep the lattices are. This is especially problematic for retrieving keywords containing rare or out-of-vocabulary (OOV) words that are of particular interest in information retrieval. Further, missing word sequences lead to a redistribution of posterior probability estimates among word sequences that are present in the ASR output. As a result, the posteriors computed from lattices tend to overestimate the posteriors for common word sequences, which are favored by the ASR system, and underestimate the posteriors for rare ones. This discrepancy can cause additional detection errors beyond the ones made when missing word sequences are searched. The issue of retrieving keyword occurrences missing from ASR output is typically addressed by look- ing for matches at the sub-word level [72, 75, 47, 18, 15, 14, 71]. The sub-word units used for this purpose can be phonemes/graphemes, syllables, morphemes, morphs or any other unit that is meaningful for the tar- get language and compatible with the frontend ASR system. In fact most high performance KWS systems utilize multiple types of sub-word units in addition to words. The exact process by which keywords are mapped to sub-word sequences depends both on the keywords themselves and the specific sub-word units being mapped to. For instance, if the sub-word units are phonemes, phonemic keyword representations are constructed using the ASR lexicon for in-vocabulary (IV) words and a grapheme-to-phoneme (G2P) 57 model for OOV words. Often, k-best sequences generated by the G2P model are considered for search and phoneme sequences are further expanded using phoneme-to-phoneme (P2P) confusion models that model typical phoneme errors introduced by the ASR system. The final set of phoneme sequences generated for a keyword are either directly looked up in a phoneme index built with phoneme lattices or converted back to IV proxy word sequences using the ASR lexicon and looked up in the word index [23]. While these methods are instrumental in finding keyword occurrences that are not in word lattices, they also introduce a large number of false positives since query sequences match occurrences of not only the keyword being searched but also other acoustically similar word sequences. Further, since the posterior scores associated with index hits no longer reflect the posterior belief of the ASR system about the specific keyword being searched, relying on hit posteriors to discriminate between false and true positives does not work nearly as well as in the case of words. Regardless that is typically what is done. The probabilities obtained from the models used in query generation, such as G2P and P2P, are multiplied with the hit posteriors to obtain a score for each hit and finally overlapping hits are merged by summing up their scores. This approach has the advantage of being fast but it leaves a lot to be desired as far as assigning proper scores to each candidate keyword occurrence is concerned. In this chapter, we describe a novel approach that builds upon the sub-word retrieval techniques widely used in ASR based KWS. Given a set of sub-word sequences, typically phonemes, derived from a keyword, we first expand them using a query expansion model. This query expansion model is trained on WFSTs representing weighted sub-word sequence alignments. These alignment WFSTs are extracted from ASR lattices produced for training utterances and the Viterbi alignments of corresponding reference texts. Each alignment WFST includes the sub-word level alignments between a keyword selected from the reference sub-word sequence and all sub-word sequences in the ASR lattice that minimally align with it. The weight associated with each alignment is the exact posterior computed for that alignment given the ASR lattice and the Viterbi alignment of the reference text. The query sub-word sequences are further restricted by requiring that they are valid sub-word sequences that can actually be produced by the ASR system. Once 58 query sequences are generated, we retrieve all matches from the sub-word level index and assign a poste- rior score to each candidate keyword occurrence using a scoring model. We consider two types of scoring models: context-free and context-aware. The context-free scoring models do not explicitly consider the keyword context and are trained on the same alignment WFSTs used for training the query expansion model. The context-aware scoring models, on the other hand, explicitly condition the probability of a keyword occurrence on its lattice context, i.e. the sub-word sequences surrounding the keyword. These models are trained on WFSTs representing weighted sub-word sequence alignments in context, which are derived from the ones used for training the query expansion model by concatenating them on the left/right with the reference sub-word sequences preceding/succeeding the selected keyword. Each candidate key- word occurrence is scored by first retrieving the sub-word lattice it was found in, composing this lattice with a keyword specific edit transducer, which aligns all lattice paths within the candidate region with the keyword sub-word sequences, composing the edited sub-word lattice with the scoring model and finally computing the total weight of the resulting WFST in the appropriate semiring. The rest of the chapter is structured as follows. In Section 6.1 we introduce our open vocabulary retrieval framework. Section 6.1.1 describes how we expand the query sequences. Section 6.1.2 describes how we retrieve the candidate keyword occurrences. Section 6.1.3 describes how we score each candidate keyword occurrence. In Section 6.2, we describe a novel sequence time alignment algorithm for joint computation of all weighted sequence alignments between an ASR lattice produced for a training utterance and the Viterbi alignment of the corresponding reference text. In Section 6.3, we describe how we train the query expansion and occurrence scoring models on weighted sub-word sequence alignments. In Section 6.4, we provide experiments on a standard KWS dataset from the BABEL program that demonstrate the effectiveness of our approach. In Section 6.5, we discuss how our approach relates to other work from literature. In Section 6.6, we summarize our approach and discuss possible extensions. 59 6.1 Retrieval Framework Confidence estimation plays an important role in ASR based KWS since the techniques used for improving the recall of keyword occurrences, such as lattice indexing and sub-word retrieval, result in a large number of false positives that need to be filtered out. A natural and theoretically-sound choice for the KWS confidence measure is the lattice posterior of a keyword occurrence P (k te t b jL;t) = P k 2(I;F) P (Oj k )P ( k ) P 2(I;F) P (Oj)P () whereP (k te t b jL;t) represents the posterior probability of the keywordk occurring within the closed time interval [t b ;t e ] given the lattice L = (; ;Q;I;F;E;;) and the lattice state times t, O represents the acoustic features,P (Oj) represents the acoustic model scores,P () represents the language model scores,2 (I;F ) represents an accepting path in the lattice, 2 (I;Q te t b ) represents a prefix path from the initial states I to the states Q te t b =fq 2 Qjt b t[q] t e g within the time interval [t b ;t e ], 2 (Q te t b ;F ) represents a suffix path from Q te t b to the final states F , and k 2 (Q te t b ;k;k;Q te t b ) represents a keyword path within the time interval. Basing the decision of whether a keyword occurs in a given time interval or not on the posterior probability minimizes the Bayes risk and the lattice posterior is a good approximation of the true posterior in most cases. Further, lattice posteriors for all word/sub-word sequences in a lattice can be efficiently computed and indexed for fast search [13] (see Chapter 3). Although lattice posterior generally performs well as a confidence measure for KWS, its performance is not uniform across keywords due to the weaknesses in ASR models and the beams used in decoding. Generally speaking, ASR models are biased towards word sequences that are common in the training cor- pus and this bias is further exacerbated by beam search, which only expands promising partial hypotheses during decoding. As a result, lattice posteriors tend to overestimate the posteriors of common word se- quences and underestimate the posteriors of rare ones. The extreme example of this bias is the case of OOV words. While sub-word level techniques enable the retrieval of rare and OOV keywords that are missing from the word level ASR output, they also introduce a large number of false positives since sub-word 60 query sequences match occurrences of other word sequences as well. Further, the posteriors computed from sub-word lattices are not the posteriors for the specific keyword being searched but the posteriors for the specific sub-word sequences being searched. This subtle distinction makes lattice posterior a less than ideal measure for filtering out false positives introduced by sub-word search. We try to address these problems related to lattice posteriors by scoring candidate keyword occurrences retrieved from the index within their lattice contexts. Specifically, we redistribute the lattice posteriors according to ~ P (k te t b jL;t) = X 2(I;F) P (k te t b j;t)P (jL) (6.1) max 2(I;F) P (k te t b j;t)P (jL) (6.2) whereP (jL) represents the lattice posterior of an accepting path andP (k te t b j;t) represents the poste- rior probability of keywordk occurring within the time interval [t b ;t e ] given the lattice path and lattice state timest. Note that here we make a distinction between keywords and the label sequences on lattice paths. We treat keyword occurrences and label sequence occurrences as dependent but separate events, i.e. keywords and label sequences are separate random variables. The assumption behind this scoring model is that ASR errors are predictable and by modeling these errors we can estimate better posteriors. We assume that keyword occurrences missing from the ASR output are replaced with similar sounding word sequences that make sense within their specific utterance contexts. P (k te t b j;t) is an explicit model of errors introduced by ASR within the keyword region [t b ;t e ] taking into account an entire lattice path predicted by the system. We modelP (k te t b j;t) solely based on the time alignments between the keyword labelsk and the lattice path labelsi[] P (k te t b j;t) = P (k te t b ;jt) P (jt) P ! k 2A(x k ;k) P (x ;! k ;x ) P (x) (6.3) 61 whereA(x k ;k) represents the set of alignments between the keyword labelsk andx k =i[ k ] the labels on within the keyword region [t b ;t e ];x =i[ ] represents the labels on preceding the keyword region, x =i[ ] represents the labels on succeeding the keyword region, and finallyx =x x k x represents the entire label sequence on. The unique aspect of the scoring model in Equation 6.3 is that it scores alignments! k 2 A(x k ;k) within the context ofx andx , i.e. label confusions are conditioned on the path labels preceding/succeeding the keyword region. The scoring model consists of two separate sequence models:P (x ;! k ;x ) andP (x).P (x ;! k ;x ) models the distribution of time restricted alignments f! ! k ! jh(! ) = (x ;");! k 2A(x k ;k);h(! ) = (x ;")g between the path labelsx =x x k x and the keyword labelsk. We marginalize over the set of restricted alignments to compute the joint distributionP (x ;x k ;x ;k).P (x) models the distribution of path labels, i.e. it is a language model, and is used for normalizing the scores obtained from the joint distribution. In addition to the context-aware scoring model in Equation 6.3, we also consider simpler context-free scoring models that do not explicitly take keyword context into account. The first of these models removes the explicit dependence on the context sequencesx andx , P (k te t b j;t) P ! k 2A(x k ;k) P (! k ) P (x k ) (6.4) but is otherwise identical to the context-aware model in Equation 6.3. The second context-free model further simplifies the computation by replacing the sum operation in Equation 6.4 with the maximum operation, P (k te t b j;t) max ! k 2A(x k ;k) P (! k ) P (x k ) (6.5) and can be plugged into Equation 6.2 to efficiently compute the Viterbi approximation of the lattice poste- rior defined in Equation 6.1. 62 Apart from the scoring model used for assigning posterior scores to hypothesized keyword occurrences, we also use a query expansion modelP (!) to retrieve candidate keyword occurrences from the index. The query expansion model scores an alignment!2 A(x;k) between the keywordk and a label sequencex without considering any context. It is used for generating query sequences that are likely to be confused with the keyword independent of context. Since ASR lattices typically contain a large number of alternative label sequences for any region a keyword might occur in, we do not need to consider a large number of query label sequences to retrieve most of the regions a keyword might occur in. Hence, we limit the query sequences to the most likely ones as predicted byP (!). Since our retrieval framework is based on WFSTs, we assume that the query expansion model, the scoring model and the index [13] admit WFST representations so they can be easily integrated into the retrieval pipeline. In the next sections we describe how each stage of the retrieval pipeline works in our framework. 6.1.1 Query Expansion To expand keyword label sequences into query label sequences we first composeK, the WFSA represent- ing the keyword,E the unweighted FST representing allowed edit operations, i.e. insertions, deletions and substitutions, andX the unweighted FSA representing label sequences that can be found in ASR lattices. The edit transducer we use allows a subset of all possible edit operations to limit the number of sequences that we score with the query expansion model. We do not allow any insertions or deletions at keyword boundaries and limit the number of consecutive insertions. Also, we do not allow deletions after insertions. Figure 6.1 gives the edit transducerE that allows at most two consecutive insertions. The specific form of label restriction transducerX depends on how sub-word lattices are generated. If the sub-word lattices are generated by a sub-word level ASR system, then all sub-word sequences can be observed in the lattices, henceX is a simple sub-word loop allowing all sub-word sequences. Otherwise, if sub-word lattices are generated by converting word lattices to sub-word lattices, thenX is derived from the lexicon mapping words to sub-word sequences. Given an unweighted acyclic lexicon FSTD mapping each 63 0 1 σ: σ σ:σ 2 σ:ε 3 ε:σ σ:σ σ:ε ε: σ ε: σ σ: σ Figure 6.1: Edit transducerE used for expanding keywords. It does not allow any insertions or deletions at keyword boundaries. At most two consecutive insertions are allowed. Deletions after insertions are not allowed. is a special consuming symbol matching all symbols in the alphabet. word to a set of sub-word sequences, we first project it to its output labels representing sub-words, then remove epsilons, determinize and minimize to obtain an FSA ^ X accepting only those sub-word sequences that correspond to a word in the lexiconD. We then construct two new FSAs, ^ X and ^ X that accept only the prefixes and suffixes of paths in ^ X. The prefix FSA ^ X is computed by setting all final states in ^ X non-final and non-final states final. The suffix FSA ^ X is computed by removing all arcs in ^ X originating from the unique initial state and adding new epsilon arcs from the unique initial state to all other states. Finally, we constructX by taking the Kleene closure of ^ X, concatenating it with the prefix/suffix FSAs on the left/right and optimizing the resulting machine: X = Minimize(Determinize(RmEps( ^ X ^ X ^ X ))) The result of composingK,E andX is encoded into a WFSA, intersected with the WFSAK2X represent- ing the query expansion model where each arc label represents an encoded transduction and subsequently decoded to obtain a set of keyword query transductions KX = Decode(Encode(KEX)K2X) 64 The composition betweenE andX can be computed ahead of retrieval time since it does not depend on the keyword. Finally, the query label sequencesX K are determined by projectingKX to its output labels representing query sequences, removing epsilons and computing the n-shortest/best unique paths: X K = UniqueNShortestPaths(RmEps(ProjectOut(KX))) Since we are interested in the n-shortest unique paths only, all of the operations up to this point that must be carried out at retrieval time can also be done lazily [5] by expanding only the paths needed for computing the n-shortest paths, hence avoiding unnecessary computations. 6.1.2 Search We assume that the KWS index WFSTI can be efficiently composed with a WFSA representing the query sequences. Here we assume we are working with the index described in [13]. Once n-shortest paths are computed, we can optionally optimize the resulting query machine X K = Determinize(RmEps(X K )) before composing it with the index to retrieve the index hits H K =X K I Each path in H K represents a (utterance-id, begin-time, end-time, score) tuple. While overlapping and identical label sequences in ASR lattices are merged before they are put into the index, the tuples we retrieve can include overlapping entries since different sequences in our query automaton can match the same lattice region. After deduplicating these tuples based on time overlap, we end up with a list of candidate keyword occurrence tuples. 65 The query expansion model assigns a score to each query sequence and each index entry is weighted with a lattice posterior. The scores in index hit tuples represent the multiplication of these two. Although, these hit scores are not used to decide whether a keyword occurs in a hit region – we use a separate scoring model – they can still be used to limit the index hits that will be scored. 6.1.3 Scoring The scoring model (Equation 6.3) consists of two separate sequence models: X2K which models the distribution of alignments between the path label sequences and the keyword, andN which is simply a language model scoring lattice paths. We useN to normalize the scores ofX2K. Since we do not need to know the keyword to applyN to a lattice, we rescore all lattices in the test corpus ahead of retrieval time by intersecting them withN 0 which is the WFSA obtained by inverting the weights onN, i.e. negated weights in the log domain. Once candidate keyword occurrences are retrieved from the index, we score each one with the scoring model. To score each occurrence with its lattice context, we need to encode each lattice path in such a way that each path represents in order the lattice path labels preceding the keyword region, an alignment between the keyword labels and the lattice path labels in the keyword region, and the lattice path labels succeeding the keyword region. We first replace the output labels of lattice arcs in the candidate keyword region with a unique keyword label. Then we compose the relabeled latticeL with an edit transducer E 0 and the FSA representing the keywordK. L K =L E 0 K The edit transducerE 0 aligns the lattice paths with the keyword paths allowing only lattice paths labeled with symbols to be aligned with the keyword paths. We encode the processed latticeL K as a WFSA and intersect it withX2K to score each alignment. Finally we compute the posterior given in Equation 66 0 ρ:ε 1 κ: ε κ:σ ε:σ κ:ε κ:σ 2 ρ: ε ρ: ε Figure 6.2: Edit transducerE 0 used for aligning lattice arcs with keyword arcs. It does not allow deletions at the beginning of the keyword. is a special symbol matching any symbol in the alphabet. is a special symbol matching any symbol in the alphabet if there are no other matches at a state. is a regular symbol matching symbols on the lattice arcs. 6.1 by computing the total weight of the resulting machine in log semiring which also takes care of the marginalization over alignments. 6.2 Weighted Sequence Alignments The models we use for query expansion and keyword occurrence scoring are trained on WFSTs repre- senting weighted sequence alignments seen in the training data. To efficiently compute and index these alignments, we developed a new algorithm which jointly computes all sequence alignments between an ASR lattice produced for a training utterance and the Viterbi alignment of the corresponding reference text. The output of the algorithm is an alignment index WFST which can be queried to efficiently re- trieve all sequences in the ASR lattice that are aligned with the query sequence, which can be any valid subsequence/factor of the reference sequence. The weighted sequence alignment algorithm we describe next does not perform any string alignment in the traditional sense. The lattice and references arcs are aligned solely based on their time spans, i.e. arc labels are not considered. We use the input labels of the paths in the alignment index WFST to store the labels on lattice arcs, the output labels to store the labels on reference arcs, and the path weights to store the posteriors of associated lattice paths. The intuition behind the algorithm is that we want to pair each reference path R , a subsequence of reference arcs, with all lattice pathsf L g that minimally align 67 with R and the posteriors computed for those lattice paths. Once we make these pairings for all reference subsequences, we use WFST algorithms to merge identical alignments in the log semiring, i.e. we sum the posteriors of identical paths. At the end of this process, we obtain an alignment index WFST mapping each reference factory to a posterior probability distribution over lattice factors that minimally aligny. The alignment index WFST can be viewed as a weighted factor automaton over encoded input output label pairs, i.e. it is a deterministic and minimal WFSA over the encoded label set accepting all alignment sequences between the lattice and the reference. We construct the the alignment index WFST by examining lattice arcs one by one and for each lattice arce adding new paths that paire with overlapping reference paths aligned withe in time. If a lattice arc e is the first lattice arc of an alignment path, we add an " arc from the unique initial state to the first state on the alignment path. This " arc is weighted with the forward probability leading up to the origin/source state of e. Similarly, if a lattice arc e is the last lattice arc of an alignment path, we add an " arc from the last state on the alignment path to the unique final state. This " arc is weighted with the backward probability stemming from the destination/next state ofe. If a reference path2 aligned with a lattice arce includes more than one arc, we add input-" arcs consuming additional reference labels as opposed to merging multiple labels into a multigram [29] as typically done in G2P models [8]. For each lattice arce, we consider two overlapping sequences of reference arcs: 1. B : sequence of reference arcs that begin within the time span ofe 2. E : sequence of reference arcs that end within the time span ofe These are used to decide which sequence of reference arcs will be aligned withe in each of the four distinct types of alignment paths we add to the alignment index: 1. In alignment paths that neither begin nor end withe,e is aligned with the arc sequence E . 2. In alignment paths that does not begin but end withe,e is aligned with each prefix of the arc sequence E . 68 3. In alignment paths that begin but does not end withe,e is aligned with each suffix of the arc sequence B \ E . 4. In alignment paths that begin and end withe,e is aligned with each contiguous subsequence of the arc sequence B \ E . The important thing to notice here is that we do not explicitly enumerate the lattice paths. The lattice paths aligned with each reference path is constructed implicitly by pairing each lattice arc with the reference arcs it is aligned with. This allows us to efficiently compute all alignments between reference factors and lattice factors that are minimally aligned with them. Formally, the weighted sequence alignment algorithm takes as input two acyclic WFSAs L and R, representing an ASR lattice and a reference alignment produced for the same utterance, along with corre- sponding state time listst L andt R ; and outputs an alignment index WFSTA. We requireL to be an"-free acceptor over the log semiring where the time of all initial states is 0 and the time of all final states isT (length of the utterance, typically the number of acoustic frames). Without loss of generality we require thatL is a stochastic automaton. If this is not the case,L can be made stochastic using the general WFSA weight pushing algorithm over the log semiring [53]. We requireR to be an unweighted linear chain ac- ceptor, i.e. a single path from the unique initial state with time 0 to the unique final state with time T . We say that a lattice factorx is minimally aligned with a reference factory if the there is a lattice path L labelled withx and a reference path R labelled withy such that the time span of L encloses the time span of R and the time span of any sub-path of L does not enclose the time span of R . Each path in alignment indexA represents the time alignment! between its output label sequencey, a reference factor, and its input label sequencex, a lattice factor that is minimally aligned withy, with the path weight giving the exact posterior probability of! givenL,R,t L andt R . The weighted sequence alignment algorithm (see Algorithm 3) works in two stages: construction and optimization. We start the construction stage, by computing the forward probabilities for all states inL (line 12). Since,L is assumed to be stochastic, we do not need to compute backward probabilities which 69 Algorithm 3 Compute Weighted Sequence Alignments Input: L = (; ;Q L ;I L ;F L ;E L ; L ; L );t L Input: R = (; ;Q R ;I R ;F R ;E R ; R ; R );t R Output: A = (; 0 ;Q A ;I A ;F A ;E A ; A ; A ) 1 function ADDPATH(p;i;o;w;q) 2 k joj 3 ifk = 0 then 4 E A E A [ (p;i;";w;q) 5 else 6 forj2 1:::k 1 do 7 r MAKESTATE(Q A ) 8 Q A Q A [frg 9 E A E A [ (p;";o[j]; 1;r) 10 p r 11 E A E A [ (p;i;o[k];w;q) 12 COMPUTEFORWARD(L) 13 0 ;Q A ;I A ;F A ;E A ;Q L [fq I ;q F g;fq I g;fq F g;; 14 A [q] 18 stateq2I A ; 08 stateq2Q A nI A 15 A [q] 18 stateq2F A ; 08 stateq2Q A nF A 16 for each arc (p;l;l;w;q)2E L do 17 B (e2E R jt L [p]t R [p[e]]<t L [q]) 18 E (e2E R jt L [p]<t R [n[e]]t L [q]) 19 I B \ E 20 ADDPATH(p;l;o[ E ];w;q) 21 fore j in E wherei[e j ]6= " do 22 r STATEPAIRMAP(q;n[e j ]) 23 Q A Q A [frg 24 MAKELABEL(t R [n[e j ]]) 25 0 0 [fg 26 E A E A [ (r;";; 1;q F ) 27 ADDPATH(p;l;o[e 1 e j ];w;r) 28 fore j in I wherei[e j ]6= " do 29 r STATEPAIRMAP(p;p[e j ]) 30 Q A Q A [frg 31 MAKELABEL(t R [p[e j ]]) 32 0 0 [fg 33 E A E A [ (q I ;";;[p];r) 34 ADDPATH(p;l;o[e j e j I j ];w;r) 35 fore j in I wherei[e j ]6= " do 36 r B STATEPAIRMAP(p;p[e j ]) 37 for ^ e k in I wherei[^ e k ]6= " do 38 r E STATEPAIRMAP(q;n[^ e k ]) 39 ift R [p[e j ]]<t R [n[^ e k ]] then 40 ADDPATH(r B ;l;o[e j ^ e k ];w;r E ) 41 A REMOVEEPSILON(A) 42 A ENCODE(A) 43 A DETERMINIZE(A) 44 A MINIMIZE(A) 45 A DECODE(A) 70 are all equal to 1. We start the construction ofA, which is initially an empty WFST, by copying the states fromL. After adding all lattice states toA with the same state IDs, we also add a unique initial stateq I and a unique final stateq F . Then we iterate over the lattice arcs and for each lattice arc (p;l;l;w;q) perform the following steps: 1. Find the longest reference path B where all component arcs begin within the time span of the current lattice arc (line 17). 2. Find the longest reference path E where all component arcs end within the time span of the current lattice arc (line 18). 3. Find the longest reference path I = B \ E that is a subsequence of both B and E (line 19). 4. Add a path fromp toq with input label sequencei[] = l, output label sequenceo[] = o[ E ], and weightw[] =w (line 20). 5. For each non-" reference arce j in E : (a) Add a new stater representing the state pair (q;n[e j ]) if it hasn’t already been added (lines 22-23). Ifr is just added, also add a new arc (r;";; 1;q F ) (line 26) where is a new label representing the time instancet R [n[e j ]]. (b) Add a path from p to r with input label sequence i[] = l, output label sequence o[] = o[e 1 e j ], and weightw[] =w (line 27). 6. For each non-" reference arce j in I : (a) Add a new stater representing the state pair (p;p[e j ]) if it hasn’t already been added (lines 29-30). Ifr is just added, also add a new arc (q I ;";;[p];r) (line 33) where is a new label representing the time instancet R [p[e j ]]. (b) Add a path from p to r with input label sequence i[] = l, output label sequence o[] = o[e j e j I j ], and weightw[] =w (line 26). 71 7. For each non-" reference arce j in I : (a) Look up stater B representing the state pair (p;p[e j ]) (line 36). (b) For each non-" reference arc ^ e k in I : i. Look up the stater E representing state pair (p;n[^ e j ]) (line 38). ii. Ift R [p[e j ]] < t R [n[^ e k ]], add a path fromr B tor E with input label sequencei[] = l, output label sequenceo[] =o[e j ^ e k ], and weightw[] =w (lines 39-40). The symbol on each initial arc (q I ;";;[p];r) (line 33) represents the beginning time of reference sequences on alignment paths that begin at stater. Similarly, the symbol on each final arc (r;";; 1;q F ) (line 26) represents the ending time of reference sequences on alignment paths that end at stater. These labels are optional and can be replaced with " labels if the sequence time alignments are not needed. We include them for two reasons: 1. They keep the alignments for reference factors that repeat within the reference label sequence sepa- rate. 2. They allow us to query the alignment index in a deterministic fashion so that all alignments within a specific reference time interval can be retrieved by expanding only the states and arcs that will be included in the result. We exploit these in the next section when constructing alignments for each keyword we select from the reference text. In the optimization stage (lines 41-45), we remove the epsilon arcs inA, encode the input output labels, apply weighted determinization and minimization, and finally decode the labels to obtain the alignment index with the desired characteristics. Note that all optimization operations are done over the log semiring to sum up posteriors for identical alignments that are merged into a single path in the output. 72 6.3 Models We model sub-word confusions using joint n-gram models [32, 24, 9, 14], which are among the most popular models for sequence-to-sequence transliteration tasks, such as G2P and P2P. These models can be represented as WFSTs and trained on fractional counts which makes them very attractive in our retrieval framework. As far as the rest of the retrieval framework is concerned, the only requirement for the specific model type used for scoring sub-word confusions is that it admits a WFST representation. Hence, the joint n-gram models that we use can be supplanted by other sequence-to-sequence models that admit either an explicit or implicit WFST representation such as the neural models described in [66, 85]. The joint n-gram model used for computing the nominator of context-aware alignment scoring model is unique in the sense that although it uses the same machinery as other joint n-gram models it is in fact a hybrid n-gram model scoring sub-word confusions in a given context. It works like a regular n-gram language model in the context regions and like a joint n-gram confusion model in the keyword region. We train the joint n-gram confusion models on fractional counts computed from WFSTs representing weighted sequence alignments after encoding the input output label pairs on each arc. This is the same approach taken in [60]. The alignment WFSTs are extracted from ASR lattices produced for the training utterances and Viterbi alignments of the corresponding reference texts using the alignment index WFSTs described in Section 6.2. Each alignment WFSTA K used for training the query expansion model includes the weighted label sequence alignments between a keywordK selected from a reference label sequence R and all label sequences in the corresponding ASR latticeL that minimally align with it. Given an FST K 0 representing a time-anchored keyword selected from a reference label sequence R, we compute the alignment WFSTA K by composing the alignment index WFSTA withK 0 on the right: A K =AK 0 Keywords are anchored in time by appending the appropriate labels, that represent the begin and end time of the keyword, to the input side ofK, i.e. i[K 0 ] = B k 1 :::k n E ,o[K 0 ] = "k 1 :::k n " where B is the 73 label representing the begin time, E is the label representing the end time, andk 1 :::k N are the keyword labels. Each alignment WFST A K obtained in this way is a probabilistic WFST where path weights, which represent the posteriors for lattice label sequences, sum up to one in probability semiring. Each context augmented alignment WFST A 0 K used for training the context-aware scoring model is derived from A K by concatenating A K on the left/right with FSTs representing the reference label sequences preceding/succeeding the selected keyword. Note that the resulting context augmented alignment WFSTs are also probabilistic since this concatenation operation does not affect the weights assigned to successful paths. We generate the data used for training the joint n-gram confusion models in three steps. First we build the alignment index WFSTsfA i ji = 1:::Ng for all utterancesfu i ji = 1:::Ng in the training corpus. Then we randomly segment the reference label sequencesfR i ji = 1:::Ng into valid keywordsfK ij j i = 1:::N;j = 1:::M i g. Finally, we produce the alignment WFSTs used in trainingf(A Kij ;A 0 Kij )j i = 1:::N;j = 1:::M i g by composing each time aligned keyword FST K 0 ij with the corresponding alignment index WFSTA i . In addition to the joint n-gram models used for modeling sub-word confusions, we also train regular n-gram models on reference sub-word sequencesfR i ji = 1:::Ng. These n-gram models represent the denominator portions of the scoring models in Equations 6.3, 6.4, 6.5 and are used for normalizing the scores produced by the joint n-gram models. 6.4 Experiments In this section we provide experiments comparing our open vocabulary KWS approach with the KWS approach [80] implemented by the Kaldi speech recognition toolkit [64]. Our implementation is also based on Kaldi. Both retrieval pipelines make use FST representations and operations provided by the OpenFst software library [5]. The differences between the two retrieval setups are limited to the query generation and candidate keyword occurrence scoring steps. 74 All experiments were conducted on the IARPA BABEL Turkish Full language pack using a graphemic lexicon automatically induced from the training text [80]. All retrieval experiments were performed on the development set which is standard practice for this setup. The Kaldi based graphemic ASR system was trained on the training set which includes roughly 80 hours of telephone speech. The acoustic model used for generating ASR lattices is a TDNN model [62, 80] trained with the LF-MMI objective [65]. The language model used for decoding the development set is a standard trigram language model trained on the reference text for the training data. The word error rate (WER) of the ASR system on the development set is 40.6%. We used a unigram language for generating the training lattices [63]. With the unigram language model, the WER of the ASR system on the training set is 36.4%. The query expansion and scoring models were trained on fractional counts computed from 88K ASR lattices and Viterbi alignments generated for the training set. We used the OpenGrm software library [67] for estimating the n-gram models on fractional counts. All n-gram models used in the experiments were estimated using the Witten-Bell method (k = 10) and were shrunk with the relative entropy pruning technique ( = 1:0e 6) [67]. 6.4.1 Results We evaluated the performance of the novel weighted sequence alignment algorithm described in Section 6.2 by running it on the 88K ASR lattices and Viterbi alignments generated for the training set. Figure 6.3 provides a scatter plot of the runtime of the algorithm vs lattice size (total number of states and arcs). Note that the runtime of the proposed algorithm is worst case exponential in the size of the input lattice since it relies on WFSA determinization. However, we can observe from the scatter plot that for almost all input lattices in our dataset the algorithm runs in approximately linear time even though it computes weighted alignments between exponentially many label sequence pairs. We evaluated the KWS performance using the development and evaluation keyword sets provided in the IARPA BABEL Turkish language pack. The evaluation was limited to keywords which include at least 5 graphemes which resulted in a total of 284 development keywords (280 IV + 4 OOV) and 1700 75 0 1000 2000 3000 4000 5000 lattice size 0 2 4 6 8 10 runtime (sec) Figure 6.3: Weighted sequence alignment runtime vs lattice size. evaluation keywords (1664 IV + 36 OOV). For each KWS experiment, we present the number of correct detections, false alarms (FA), misses as well as term weighted value (TWV) metrics typically used in KWS evaluations [1]. The results for IV and OOV keywords are given separately although the retrieval system does not make a distinction between them, i.e. all keywords are processed as grapheme sequences and are searched in a collection of graphemic ASR lattices. The first experiment evaluates all of the scoring models described in Section 6.1 using the devel- opment keyword set. The results are given in Table 6.1. The alignment WFSTs used for training the query expansion and scoring models used in this experiment were derived from training lattices that were pruned to a logarithmic beam width of 5. All n-gram models employed by the query expansion and scor- ing models are 5-gram models trained on fractional counts. The baseline results were obtained with the default phonetic/graphemic search implementation provided by Kaldi [23]. The fundamental difference 76 between this baseline implementation and the proposed systems is that the baseline system relies on a sim- ple grapheme-to-grapheme confusion model for expanding and scoring query grapheme sequences. This model is estimated by aligning best ASR hypotheses generated for training utterances with the reference transcripts using minimum edit distance and counting the grapheme-to-grapheme confusions. In Table 6.1, we refer to the scoring models defined in Equations 6.3, 6.4 and 6.5 as ”Ctx Scoring”, ”Sum Scor- ing” and ”Max Scoring” respectively. The ”No Scoring” model is an approximate implementation of the ”Max Scoring” model. In this approach we normalize the scores produced by the query expansion model using the denominator component of the ”Max Scoring” model before composing the query FST with the index FST. This operation is equivalent to using the ”Max Scoring” model along with the approximate lattice posterior in Equation 6.2. The results in Table 6.1 suggest that all of the proposed methods perform markedly better than the baseline method, yet there is no significant difference between them as far as detection performance is concerned. Table 6.1: KWS results for the development keywords. Model Queries #Correct #FA #Miss ATWV MTWV OTWV STWV Baseline IV 600 624 822 0.4301 0.4312 0.5183 0.6174 OOV 0 10 6 -0.0690 0.0917 0.2155 0.3333 No Scoring IV 794 1423 628 0.6081 0.6081 0.7421 0.8888 OOV 2 29 4 0.2998 0.2998 0.4187 0.7500 Max Scoring IV 839 1474 583 0.5967 0.5967 0.7275 0.8888 OOV 1 29 5 0.0498 0.2323 0.5498 0.7500 Sum Scoring IV 842 1507 580 0.5907 0.5967 0.7288 0.8888 OOV 1 32 5 0.0291 0.2875 0.5429 0.7500 Ctx Scoring IV 833 1552 580 0.6037 0.6113 0.7387 0.8880 OOV 2 43 4 0.2032 0.3151 0.5636 0.7500 We further evaluated some of the proposed scoring models using the significantly larger evaluation keyword set. The results are given in Table 6.2. The ”Baseline”, ”No Scoring” and ”Max Scoring” models used in this experiment are the same models used in the previous experiment. The ”No Scoring (No LM)” model is identical to the ”No Scoring” model except for the fact that the language model (LM) scores in the ASR lattices were removed before computing the alignment WFSTs used for training the query 77 expansion and scoring models. This approach was inspired by the often used KWS technique of removing or downscaling language model scores in sub-word ASR lattices when estimating the posteriors for OOV keywords. The results in Table 6.1 are consistent with the results obtained for the development keywords. Further, they suggest that removing, or in general downscaling, LM scores in ASR lattices can lead to better confusion models. Table 6.2: KWS results for the evaluation keywords. Model Queries #Correct #FA #Miss ATWV MTWV OTWV STWV Baseline IV 3560 6000 5005 0.3778 0.3789 0.4875 0.6418 OOV 18 107 62 0.1508 0.1845 0.2656 0.3949 No Scoring IV 4194 10383 4371 0.4414 0.4426 0.6009 0.8419 OOV 24 178 56 0.2739 0.2951 0.4585 0.7137 No Scoring IV 4325 10226 4240 0.4513 0.4513 0.6036 0.8445 (No LM) OOV 24 186 56 0.3067 0.3503 0.4886 0.7315 Max Scoring IV 3987 9021 4578 0.4330 0.4350 0.5932 0.8419 OOV 25 165 55 0.2799 0.2799 0.4400 0.7137 To better understand the where the gains in KWS performance are coming from, we conducted an- other experiment using the evaluation keyword set. The results are given in Table 6.3. The top block of results were obtained with confusion models trained using the single best ASR hypotheses produced for the training utterances. The bottom block of results were obtained with the same training lattices used for generating Table 6.2. The first observation we can make from these results is that proposed models sig- nificantly outperform the baseline model in Table 6.2 even if the models are trained on alignment WFSTs derived from single best ASR hypotheses. The second observation is that confusion models trained on alignment WFSTs derived from large training lattices are significantly more effective when the keywords are OOV . The third observation is that high order n-gram confusion models are good for OOV keywords while low order n-gram confusion models are good for IV keywords. 78 Table 6.3: ATWV/MTWV results for the evaluation keywords for different n-gram orders. Model Queries 1-gram 3-gram 5-gram 6-gram No Scoring IV 0.5311/0.5311 0.4117/0.4136 0.4385/0.4385 NA/NA (1best) OOV 0.2130/0.2629 0.2561/0.2796 0.2484/0.2719 NA/NA No Scoring IV 0.5312/0.5321 0.4345/0.4351 0.4513/0.4513 0.4507/0.4507 (No LM, Beam=5) OOV 0.2622/0.2843 0.2626/0.2879 0.3067/0.3503 0.3044/0.3472 6.5 Related Work ASR hypotheses are typically stored in the form of weighted directed acyclic graphs known as lattices. Using weighted finite-state transducers (WFSTs) [57] to represent ASR lattices has much appeal due to the general-purpose search, optimization and combination algorithms supplied by the WFST framework. Using WFSTs to index and search ASR lattices for KWS has a similar appeal since they provide a pow- erful mathematical and computational framework for efficiently retrieving and scoring soft/probabilistic matches from a large collection of lattices [3, 13]. Our retrieval approach is based on WFST algorithms, index structures represented as WFSTs and models that admit WFST representations to provide a powerful and flexible framework for open vocabulary KWS. We build upon the sub-word retrieval techniques widely used in KWS [72, 75, 47, 18, 15, 14, 21, 71, 41, 58, 42]. While ASR-based KWS systems typically provide state-of-the-art results for IV keywords, the per- formance degrades significantly for OOV keywords. This degradation is expected since OOV keywords are not natively hypothesized by the ASR system and are retrieved based on their similarity to IV token sequences. There are other KWS approaches in literature that do not rely on ASR lattices, such as point process models [46] and posteriorgram based techniques [34], that have been shown to work well for OOV keywords. When used in combination with an ASR-based KWS system, these approaches can often improve the overall KWS performance. Estimation and normalization of confidence measures for KWS has been widely studied in literature [83, 84, 82, 79, 73, 78]. A thorough overview and discussion of numerous confidence measures used in KWS can be found in [82]. We try to address the problems related to ASR lattice posteriors by scoring 79 candidate keyword occurrences retrieved from the index within their lattice contexts. We explicitly model the errors made by the ASR system and redistribute the posteriors based on sub-word level confusions between the ASR hypotheses and the keyword. Although the models we use do not use any features beyond sub-word level alignments, this is not a limitation imposed by the retrieval framework. The only requirement for the models is that they can be represented as WFSTs. We believe the query generation and occurrence scoring steps can be improved by using stronger sequence models that take into account other information available in ASR lattices such as arc durations or state level alignments. Even neural network models that are currently the state-of-the-art in a number of sequence-to-sequence transduction tasks can be used in our framework as long as they admit either explicit or implicit WFST representations such as the models used in [66, 85]. The query expansion and keyword occurrence scoring models we use are joint n-gram models [32, 24, 9, 14]. We train these with fractional counts computed on a complete set of weighted sequence alignments derived from ASR lattices. To our knowledge, training sub-word confusion models in this way has not been tried before. In this regard, the most similar approach from literature is training of G2P models on weighted alignment lattices that contain all possible alignments between the grapheme and phoneme sequences where weights are the scores estimated by the Expectation Maximization procedure [60]. Our weighted alignment lattices are limited to the errors made by the ASR system for each keyword region and the weights are lattice posteriors. The weighted sequence alignment algorithm we described does not do any string alignment between input automata in the traditional sense [44, 52]. It is closely related to the algorithms used for computing weighted factor automata of other automata [3, 13, 16, 17]. The output of the algorithm is an efficient posterior-weighted sequence alignment index that can be used to retrieve all sequences aligned with a query sequence. Although, it hasn’t been explored here, it is possible to extend this sequence alignment index structure to store additional information about the aligned sequences, such as feature vectors describing the sequences or their context, which can then be used in model training and retrieval. 80 6.6 Summary We described an open vocabulary KWS framework based on WFSTs which was designed with the goal of computing theoretically-sound lattice-based posteriors for candidate keyword occurrences independent of keyword characteristics. Our retrieval approach attempts to achieve this task by explicitly modeling the errors made by the ASR system and redistributing the posteriors at retrieval time based on sub-word level confusions between the word sequences hypothesized the ASR system and the keyword. The re- trieval pipeline employs two separate sequence confusion models represented as WFSTs, a query expan- sion model and a keyword occurrence scoring model, to efficiently retrieve candidate keyword occurrences and score them in context. We described the algorithms used for computing everything within the WFST framework. We also described a novel sequence alignment algorithm for joint computation of all weighted sequence alignments between an ASR lattice and the Viterbi alignment of the corresponding reference text. This sequence alignment algorithm was used to efficiently compute weighted sequence alignments used in training the joint n-gram confusion models on fractional counts. We provided experiments on a standard KWS dataset from the BABEL program that demonstrate the effectiveness of our approach. 81 Chapter 7 Conclusion In this dissertation, we presented an efficient, flexible and theoretically-sound framework for SCR based on weighted finite-state transducers. We described novel algorithms and data structures for efficiently computing and indexing statistics and other information meaningful for SCR applications from input lat- tices. We laid out a novel approach for performing open vocabulary KWS by explicitly modeling ASR errors and redistributing lattice-based posterior estimates based on sub-word level confusions. We pro- vided experimental results demonstrating the advantages of the algorithms, data structures and techniques we described. While the focus was on the KWS problem, the algorithms and representations presented here are applicable to a wide range of content-based retrieval problems where the content to be indexed and searched is sequential and the output of the frontend recognition component can be represented as a weighted lattice of hypothesized sequences. We believe our open vocabulary KWS framework (Chapter 6) can be extended in a number of ways to better model the keyword posteriors. The scoring model we employed is entirely based on the alignments between the keyword sub-word sequence and the sub-word sequences on lattice paths. This model can be enriched by considering additional features such as sub-word durations or long distance relationships between keywords and lattice sequences. Also, the lattices produced when scoring keyword occurrences can be rescored with stronger language models based on recurrent neural networks to better gauge the fit between the hypothesized keyword occurrence and its context. There are also a number of opportunities 82 for optimizing the retrieval pipeline. While it is often beneficial to produce and index deep lattices for retrieving candidate keyword occurrences, the lattices used for scoring keyword occurrences do not need to be very large and can be pruned with a small beam width without a large degradation in performance. Also, the number of query sequences searched in the index and the number of keyword occurrences scored can be reduced by applying judicious pruning beams to intermediate results. A further optimization can be made by approximating the lattice based scoring strategy employed here with a single best context path for each keyword occurrence. In this approximate scoring scenario, the best context paths for each candidate keyword region can be computed and indexed ahead of time to significantly reduce the computation that must be done at retrieval time. Learning acoustic word embeddings from supervised as well as unsupervised speech data is a promis- ing research direction that has recently shown promise in a number of speech tasks related to KWS in- cluding keyword spotting [22], zero- and very-low-resource query-by-example keyword search [45, 74], unsupervised segmentation and clustering of speech [39], and cross-view word discrimination [38]. While acoustic word embeddings can also be used for KWS [7, 69] by embedding the keywords and potential speech segments into a common representation space, existing approaches are not yet competitive with ASR-based KWS. Unlike in the case of the closely related keyword spotting task, where keywords of interest are known ahead of time, training occurrences are plentiful or relatively easy to acquire and test occurrences are to be detected in an online fashion, in the KWS setting, it is hard to learn acoustic word embedding models that can compete with ASR. Further, finding occurrences in continuous speech that best match a keyword by comparing representations is a costly operation that, if done naively, scales linearly with the size of the corpus. Since candidate occurrence boundaries are not known, the retrieval opera- tion needs to consider a large number of overlapping segments of varying duration extracted from speech utterances and find matching segments in a short period of time. Employing methods like randomized approximate nearest neighbor search [36, 45] can help in this task but the viability of these methods in the KWS setting, especially in the context of highly competitive ASR-based approaches, remains to be seen. 83 Reference List [1] Draft KWS16 Keyword Search Evaluation Plan. Technical report, National Institute of Standards and Technology (NIST), 2016. [2] Cyril Allauzen, Shankar Kumar, Wolfgang Macherey, Mehryar Mohri, and Michael Riley. Expected sequence similarity maximization. In HLT-NAACL, pages 957–965. The Association for Computa- tional Linguistics, 2010. [3] Cyril Allauzen, Mehryar Mohri, and Brian Roark. Generalized algorithms for constructing statisti- cal language models. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL ’03, pages 40–47, Stroudsburg, PA, USA, 2003. Association for Com- putational Linguistics. [4] Cyril Allauzen, Mehryar Mohri, and Murat Saraclar. General indexation of weighted automata: Application to spoken utterance retrieval. In HLT-NAACL Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, pages 33–40, Boston, MA, USA, 2004. [5] Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Ninth Interna- tional Conference on Implementation and Application of Automata, (CIAA 2007), volume 4783 of Lecture Notes in Computer Science, pages 11–23. Springer, 2007. http://www.openfst.org. [6] E. Arısoy, Do˘ gan Can, Has ¸im Sak, Sıddıka Parlak, and Murat Sarac ¸lar. Turkish broadcast news transcription and retrieval. IEEE Transactions on Speech and Audio Processing, 12(2):291–301, June 2009. [7] Kartik Audhkhasi, Andrew Rosenberg, Abhinav Sethy, Bhuvana Ramabhadran, and Brian Kings- bury. End-to-end asr-free keyword search from speech. IEEE Journal of Selected Topics in Signal Processing, 11(8):1351–1359, 2017. [8] Maximilian Bisani and Hermann Ney. Multigram-based grapheme-to-phoneme conversion for lvcsr. In Eighth European Conference on Speech Communication and Technology, 2003. [9] Maximilian Bisani and Hermann Ney. Joint-sequence models for grapheme-to-phoneme conversion. Speech communication, 50(5):434–451, 2008. [10] Graeme Blackwood, Adri` a de Gispert, and William Byrne. Efficient path counting transducers for minimum bayes-risk decoding of statistical machine translation lattices. In Proceedings of the ACL 2010 Conference Short Papers, pages 27–32, Uppsala, Sweden, July 2010. Association for Compu- tational Linguistics. [11] A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, M. T. Chen, and J. Seiferas. The smallest automaton recognising the subwords of a text. Theoretical Computer Science, 40:31–55, 1985. [12] A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34(3):578–595, 1987, July. 84 [13] D. Can and M. Saraclar. Lattice indexing for spoken term detection. Audio, Speech, and Language Processing, IEEE Transactions on, 19(8):2338–2347, Nov 2011. [14] Dogan Can, Erica Cooper, Arnab Ghoshal, Martin Jansche, Sanjeev Khudanpur, Bhuvana Ramab- hadran, Michael Riley, Murat Saraclar, Abhinav Sethy, Morgan Ulinski, and Christopher White. Web derived pronunciations for spoken term detection. In SIGIR ’09: Proceedings of the 32th Annual In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval, pages 83–90, 2009. [15] Dogan Can, Erica Cooper, Abhinav Sethy, Chris White, Bhuvana Ramabhadran, and Murat Saraclar. Effect of pronunciations on oov queries in spoken term detection. In Proceedings of the IEEE Inter- national Conference on Acoustics, Speech, and Signal Processing, pages 3957–3960, Los Alamitos, CA, 2009. IEEE Computer Society. [16] Dogan Can and Shrikanth Narayanan. On the computation of document frequency statistics from spoken corpora using factor automata. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, pages 6–10, Lyon, France, 2013. [17] Dogan Can and Shrikanth S. Narayanan. A dynamic programming algorithm for computing n-gram posteriors from lattices. In Proceedings of the 2015 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 2388–2397. Association for Computational Linguistics, September 2015. [18] U.V . Chaudhari and M. Picheny. Improvements in phone based audio search via constrained match with high order confusion estimates. In IEEE Workshop on Automatic Speech Recognition & Under- standing, pages 665–670, Dec. 2007. [19] Ciprian Chelba and Alex Acero. Position specific posterior lattices for indexing speech. In ACL ’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 443–450, Morristown, NJ, USA, 2005. Association for Computational Linguistics. [20] Ciprian Chelba, Timothy J. Hazen, and Murat Saraclar. Retrieval and browsing of spoken content. Signal Processing Magazine, IEEE, 25(3):39–49, May 2008. [21] Guoguo Chen, Sanjeev Khudanpur, Daniel Povey, Jan Trmal, David Yarowsky, and Oguz Yilmaz. Quantifying the value of pronunciation lexicons for keyword search in lowresource languages. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8560–8564. IEEE, 2013. [22] Guoguo Chen, Carolina Parada, and Tara N Sainath. Query-by-example keyword spotting using long short-term memory networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5236–5240. IEEE, 2015. [23] Guoguo Chen, Oguz Yilmaz, Jan Trmal, Daniel Povey, and Sanjeev Khudanpur. Using proxies for oov keywords in the keyword search task. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 416–421. IEEE, 2013. [24] Stanley F Chen. Conditional and joint models for grapheme-to-phoneme conversion. In Eighth European Conference on Speech Communication and Technology, 2003. [25] Corinna Cortes, Mehryar Mohri, Ashish Rastogi, and Michael Riley. On the computation of the relative entropy of probabilistic automata. Int. J. Found. Comput. Sci., 19(1):219–242, 2008. [26] M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45(1):63–86, 1986. 85 [27] Maxime Crochemore and Wojciech Rytter. Jewels of stringology. World Scientific Publishing Co. Inc., River Edge, NJ, 2003. [28] Adri` a de Gispert, Graeme Blackwood, Gonzalo Iglesias, and William Byrne. N-gram posterior prob- ability confidence measures for statistical machine translation: an empirical study. Machine Trans- lation, 27(2):85–114, 2013. [29] Sabine Deligne and Fr´ ed´ eric Bimbot. Inference of variable-length linguistic and acoustic units by multigrams1. Speech Communication, 23(3):223–241, 1997. [30] John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Och. Model combination for machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 975–983, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. [31] Jason Eisner. Expectation semirings: Flexible EM for finite-state transducers. In Gertjan van Noord, editor, Proceedings of the ESSLLI Workshop on Finite-State Methods in Natural Language Process- ing (FSMNLP), Helsinki, August 2001. Extended abstract (5 pages). [32] Lucian Galescu and James F Allen. Bi-directional conversion between graphemes and phonemes us- ing a joint n-gram model. In 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis, 2001. [33] Vaibhava Goel and William J Byrne. Minimum bayes-risk automatic speech recognition. Computer Speech & Language, 14(2):115–135, 2000. [34] Batuhan G¨ undo˘ gdu and Murat Sarac ¸lar. Distance metric learning for posteriorgram based keyword search. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5660–5664. IEEE, 2017. [35] D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biol- ogy. Cambridge University Press, Cambridge, 1997. [36] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards remov- ing the curse of dimensionality. Theory of computing, 8(1):321–350, 2012. [37] Donna Harman. The history of idf and its influences on ir and other fields. In JohnI. Tait, editor, Charting a New Course: Natural Language Processing and Information Retrieval, volume 16 of The Kluwer International Series on Information Retrieval, pages 69–79. Springer Netherlands, 2005. [38] Wanjia He, Weiran Wang, and Karen Livescu. Multi-view recurrent neural acoustic word embed- dings. International Conference on Learning Representations (ICLR), 2017. [39] Herman Kamper, Karen Livescu, and Sharon Goldwater. An embedded segmental k-means model for unsupervised segmentation and clustering of speech. IEEE Automatic Speech Recognition and Understanding Workshop, 2017. [40] Damianos Karakos, Mark Dredze, Ken Ward Church, Aren Jansen, and Sanjeev Khudanpur. Esti- mating document frequencies in a speech corpus. In David Nahamoo and Michael Picheny, editors, ASRU, pages 407–412. IEEE, 2011. [41] Damianos Karakos and Richard Schwartz. Subword and phonetic search for detecting out-of- vocabulary keywords. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. 86 [42] Damianos Karakos and Richard M Schwartz. Combination of search techniques for improved spot- ting of oov keywords. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE Interna- tional Conference on, pages 5336–5340. IEEE, 2015. [43] Werner Kuich and Arto Salomaa. Semirings, automata, languages. Springer Verlag, 1986. [44] VI Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, volume 10, page 707, 1966. [45] Keith Levin, Aren Jansen, and Benjamin Van Durme. Segmental acoustic indexing for zero resource keyword search. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5828–5832. IEEE, 2015. [46] Chunxi Liu, Aren Jansen, Guoguo Chen, Keith Kintzley, Jan Trmal, and Sanjeev Khudanpur. Low- resource open vocabulary keyword search using point process models. In Fifteenth Annual Confer- ence of the International Speech Communication Association, 2014. [47] Jonathan Mamou, Bhuvana Ramabhadran, and Olivier Siohan. V ocabulary independent spoken term detection. In SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 615–622, New York, NY , 2007. ACM. [48] Lidia Mangu, Eric Brill, and Andreas Stolcke. Finding consensus in speech recognition: word er- ror minimization and other applications of confusion networks. Computer Speech & Language, 14(4):373–400, 2000. [49] M. Mohri, F. C. N. Pereira, and M. Riley. Weighted automata in text and speech processing. In Proceedings of ECAI, Workshop on Extended Finite State Models of Language, 1996. [50] Mehryar Mohri. Finite-state transducers in language and speech processing. Computational Linguis- tics, 23(2):269–311, 1997. [51] Mehryar Mohri. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350, 2002. [52] Mehryar Mohri. Edit-distance of weighted automata: General definitions and algorithms. Interna- tional Journal of Foundations of Computer Science, 14(06):957–982, 2003. [53] Mehryar Mohri. Weighted automata algorithms. In Manfred Droste, Werner Kuich, and Heiko V ogler, editors, Handbook of Weighted Automata, Monographs in Theoretical Computer Science. An EATCS Series, pages 213–254. Springer Berlin Heidelberg, 2009. [54] Mehryar Mohri, Pedro Moreno, and Eugene Weinstein. Factor automata of automata and applica- tions. In Implementation and Application of Automata, pages 168–179. Springer, 2007. [55] Mehryar Mohri, Pedro Moreno, and Eugene Weinstein. General suffix automaton construction algo- rithm and space bounds. Theoretical Computer Science, 410(37):3553 – 3562, 2009. [56] Mehryar Mohri, Pedro J Moreno, and Eugene Weinstein. Efficient and robust music identification with weighted finite-state transducers. Audio, Speech, and Language Processing, IEEE Transactions on, 18(1):197–207, 2010. [57] Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1):69–88, 2002. [58] Karthik Narasimhan, Damianos Karakos, Richard Schwartz, Stavros Tsakalidis, and Regina Barzi- lay. Morphological segmentation for keyword spotting. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 880–885, 2014. 87 [59] NIST. The spoken term detection (STD) 2006 evaluation plan, 2006. http://www.itl.nist.gov/iad/mig/tests/std/. [60] Josef Robert Novak, Nobuaki Minematsu, and Keikichi Hirose. Phonetisaurus: Exploring grapheme- to-phoneme conversion with joint n-gram models in the wfst framework. Natural Language Engi- neering, 22(6):907–938, 2016. [61] S. Parlak and M. Saraclar. Spoken term detection for Turkish Broadcast News. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5244–5247, April 2008. [62] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. A time delay neural network archi- tecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. [63] Daniel Povey. Discriminative training for large vocabulary speech recognition. Ph. D. Thesis, Cam- bridge University, 2003. [64] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011. IEEE Catalog No.: CFP11SRW-USB. [65] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. Purely sequence-trained neural networks for asr based on lattice-free mmi. Interspeech 2016, pages 2751–2755, 2016. [66] Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner. Weighting finite-state transductions with neu- ral context. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 623–633, 2016. [67] Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, and Terry Tai. The opengrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, pages 61–66. Association for Computational Linguistics, 2012. [68] Stephen Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60:2004, 2004. [69] Andrew Rosenberg, Kartik Audhkhasi, Abhinav Sethy, Bhuvana Ramabhadran, and Michael Picheny. End-to-end speech recognition and keyword search on low-resource languages, 03 2017. [70] Has ¸im Sak, Tunga G¨ ung¨ or, and Murat Sarac ¸lar. Turkish language resources: Morphological parser, morphological disambiguator and web corpus. In GoTAL 2008, volume 5221 of LNCS, pages 417– 427. Springer, 2008. [71] Murat Saraclar, Abhinav Sethy, Bhuvana Ramabhadran, Lidia Mangu, Jia Cui, Xiaodong Cui, Brian Kingsbury, and Jonathan Mamou. An empirical study of confusion modeling in keyword search for low resource languages. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 464–469. IEEE, 2013. [72] Murat Saraclar and Richard Sproat. Lattice-based search for spoken utterance retrieval. In Pro- ceedings of Human Language Technologies: The 2004 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 129–136, Morristown, NJ, May 2004. Association for Computational Linguistics. 88 [73] Matthew Stephen Seigel, Philip C Woodland, and MJF Gales. A confidence-based approach for improving keyword hypothesis scores. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8565–8569. IEEE, 2013. [74] Shane Settle, Keith Levin, Herman Kamper, and Karen Livescu. Query-by-example search with discriminative neural acoustic word embeddings. Proc. Interspeech 2017, pages 2874–2878, 2017. [75] O. Siohan and M. Bacchiani. Fast vocabulary independent audio search using path based graph indexing. In Proceedings of Interspeech/Eurospeech, pages 53–56, 2005. [76] H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig. The IBM 2004 conversational telephony system for rich transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 205–208, 2005. [77] Hagen Soltau, George Saon, and Brian Kingsbury. The ibm attila speech recognition toolkit. In Spoken Language Technology Workshop (SLT), 2010 IEEE, pages 97–102, Dec 2010. [78] Victor Soto, Lidia Mangu, Andrew Rosenberg, and Julia Hirschberg. A comparison of multiple meth- ods for rescoring keyword search lists for low resource languages. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. [79] Javier Tejedor, Alejandro Echeverr´ ıa, Dong Wang, and Ravichander Vipperla. Evolutionary dis- criminative confidence estimation for spoken term detection. Multimedia tools and applications, 62(1):5–34, 2013. [80] Jan Trmal, Matthew Wiesner, Vijayaditya Peddinti, Xiaohui Zhang, Pegah Ghahremani, Yiming Wang, Vimal Manohar, Hainan Xu, Daniel Povey, and Sanjeev Khudanpur. The kaldi openkws system: Improving low resource keyword search. Proc. Interspeech 2017, pages 3597–3601, 2017. [81] Roy W Tromble, Shankar Kumar, Franz Och, and Wolfgang Macherey. Lattice minimum bayes-risk decoding for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 620–629. Association for Computational Linguistics, 2008. [82] Dong Wang, Simon King, Joe Frankel, Ravichander Vipperla, Nicholas Evans, and Rapha¨ el Troncy. Direct posterior confidence for out-of-vocabulary spoken term detection. ACM Transactions on In- formation Systems (TOIS), 30(3):16, 2012. [83] Dong Wang, Javier Tejedor, Joe Frankel, Simon King, and Jose Col´ as. Posterior-based confidence measures for spoken term detection. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 4889–4892. IEEE, 2009. [84] Dong Wang, Javier Tejedor, Simon King, and Joe Frankel. Term-dependent confidence normali- sation for out-of-vocabulary spoken term detection. Journal of Computer Science and Technology, 27(2):358–375, 2012. [85] Hainan Xu, Tongfei Chen, Dongji Gao, Yiming Wang, Ke Li, Nagendra Goel, Yishay Carmiel, Daniel Povey, and Sanjeev Khudanpur. A pruned rnnlm lattice-rescoring algorithm for automatic speech recognition. 2018. [86] Z. Y . Zhou, P. Yu, C. Chelba, and F. Seide. Towards spoken-document retrieval for the internet: lattice indexing for large-scale web-search architectures. In Proc. of HLT-NAACL, 2006. 89
Abstract (if available)
Abstract
Spoken Content Retrieval (SCR) integrates Automatic Speech Recognition (ASR) and Information Retrieval (IR) to provide access to large multimedia archives based on their contents. There are several tasks of varying difficulty that fall under the SCR umbrella. Among them, Keyword Search (KWS) is one of the harder tasks, where the goal is to locate exact matches to an open vocabulary query term in a large heterogenous speech corpus. The retrieval operation is required to be fast, so the data must be indexed ahead of time for fast search. Since ASR transcripts are often highly erroneous in real world scenarios due to model weaknesses, especially in languages and domains where supervised resources are limited, all of these requirements must be met with imperfect information about which words occur where in the corpus. ❧ We present an efficient, flexible and theoretically-sound framework for SCR based on weighted finite-state transducers. While we mainly focus on the challenging KWS task, the algorithms and representations we propose are applicable in a wide variety of scenarios where the inputs can be represented as lattices, i.e. acyclic weighted finite-state automata. Our contributions include i) novel techniques for indexing and searching a collection of ASR lattices for KWS, ii) a new algorithm for computing and indexing exact posterior probabilities for all substrings in a lattice, iii) a recipe for computing and indexing probabilistic generalizations of statistics widely used in IR, such as term frequency (TF), inverse document frequency (IDF) and TF-IDF, for all substrings in a collection of lattices, iv) a new algorithm for computing and indexing posterior weighted alignments between substrings in a time aligned reference string and substrings in an ASR lattice, and v) a novel approach for performing open vocabulary KWS by explicitly modeling ASR errors and redistributing lattice-based posterior estimates based on sub-word level confusions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
Asset Metadata
Creator
Can, Doğan
(author)
Core Title
Weighted factor automata: A finite-state framework for spoken content retrieval
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/26/2018
Defense Date
05/09/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
factor automata,keyword search,lattice indexing,OAI-PMH Harvest,spoken content retrieval,weighted finite-state transducers
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Georgiou, Panayiotis (
committee member
), Knight, Kevin C. (
committee member
)
Creator Email
dogancan@usc.edu,dogancanbaz@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-33463
Unique identifier
UC11671744
Identifier
etd-CanDoan-6516.pdf (filename),usctheses-c89-33463 (legacy record id)
Legacy Identifier
etd-CanDoan-6516.pdf
Dmrecord
33463
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Can, Doğan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
factor automata
keyword search
lattice indexing
spoken content retrieval
weighted finite-state transducers