Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A PARALLEL COMPUTATIONAL MODEL FOR INTEGRATED SPEECH AND NATURAL LANGUAGE UNDERSTANDING b y Sang-Hwa Chung A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) August 1993 Copyright 1993 Sang-Hwa Chung UMI Number: D P 22862 All rights reserved INFORMATION TO ALL U SE R S The quality of this reproduction is d ep en d en t upon the quality of the copy submitted. In the unlikely ev en t that the author did not sen d a com plete manuscript and there are m issing p a g es, th e se will be noted. Also, if material had to be rem oved, a note will indicate the deletion. Dissertation Publishing UMI D P 22862 Published by ProQ uest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQ uest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United S ta tes C ode ProQ uest LLC. 78 9 E ast E isenhow er Parkway P.O. Box 1346 Ann Arbor, Ml 4 8 1 0 6 - 1 3 4 6 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by S a n g - H w a C h u n g under the direction of h.x$........ D issertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirem ents for the degree of D a te August. 5.,.;.. X29.3.... DISSERTATION COMMITTEE D O C TO R OF PHILOSOPH Y Dean of G raduate Studies C hairperson D ed ica tio n To my parents Taeyoung Chung and Jungja Jin, my wife Gyungeun, and my son Kenny. A ck n ow led gem en ts | I am deeply grateful to my advisor, Professor Dan Moldovan, for his guidance, ! support, and encouragement throughout my graduate studies at the University 1 of Southern California. His thorough scientific approach and unending quest for ; excellence have been inspirational during the years of my thesis research. : I would like to thank Professor Jean-Luc Gaudiot and Professor Kevin Knight , for serving on my dissertation committee. I sincerely appreciate the time and guidance they provided in the completion of my dissertation. I would also like to thank Professor Dennis McLeod, Professor Keith Price, and Professor Alexander ; Sawchuk for their valuable comments and discussions while serving on my guidance ! committee. i ! Many thanks are due to everyone who participated in the SNAP project. Wing Lee,‘ Ron DeMara, Eric Lin, Hirendu Vaishnav, Tara Poppen, Adrian Moga, Chinyew Lin, Traian Mitrache, and Mihai Petrescu provided a nice environment ; for various experiments while developing the SNAP system software and hardware, j Minhwa Chung, Juntae Kim, Seungho Cha, Ken Hendrickson, Steve Kowalski, Tony Gallippi, Haigaz Farajian, and Sanda Harabagiu have closely worked with i me for the development of SNAP applications. W ithout their help, I could not I possibly have finished my thesis by now. I also thank our secretary Dawn Ernst : for help with administrative tasks and proofreading this thesis. Several organizations and individuals made it possible to obtain the equipment used in the speech-front end. Speech Systems, Inc. (SSI) provided codebooks for I ATC and Radiology domains, and speaker model tapes for the Phonetic Engine. I i am grateful to Dr. David Trawick of SSI for his comments and suggestions regard ing the experiment with the Phonetic Engine. Dr. Phil Shinn of Citicorp/TTI was very generous to provide the Phonetic Engine hardware module. Dr. Cristine Montgomery of Language Systems, Inc. provided the required manuals and in stallation advice. I would like to acknowledge the support of National Science Foundation grants MIP-89-02426 and MIP-90-09109. Finally, but not the least, I thank my parents Taeyoung Chung and Jungja Jin for their love and constant support, my wife Gyungeun for her selfless sacrifice, confidence and encouragement, and my son Kenny for his smile. C on ten ts D edication ii A cknow ledgem ents iii List O f Figures viii A bstract x 1 Introduction 1 1.1 M otivation.................................................................................................... 1 1.2 O b je c tiv e .................................................................................................... 3 1.3 Organization of D issertatio n ................................................................... 4 2 Background 5 2.1 State of the Art in Speech Understanding R e s e a rc h ........................ 5 2.1.1 Low-level Speech Recognition R esearch .................................. 5 2.1.2 Integrated Speech and Natural Language Understanding Re search ............................................................................................ 9 2.1.3 Parallel Processing A p p ro a c h ................................................... 11 2.2 Parallel Marker-passing S ystem s............................................................ 11 2.2.1 Semantic Network R epresentation............................................ 12 2.2.2 Parallel Inferencing M e ch a n ism ............................................... 14 2.3 Memory-based P a rs in g ............................................................................. 15 2.4 SNAP Parallel C om puter.............................................. 17 2.4.1 SNAP-1 p r o to ty p e ...................................................................... 19 2.4.2 Instruction Set ............................................................................. 20 2.4.3 Propagation Rules ...................................................................... 22 2.4.4 Knowledge Representation on S N A P ..................................... 22 3 PASS: A Parallel M arker-passing Approach for Speech U nder standing 24 3.1 System O v erv iew ...................................................................................... 24 v 3.1.1 Hierarchical Knowledge B ase............................................... 25 3.1.2 Basic A lgorithm ...................................................................... 26 3.2 Speech-Specific Problems in Phoneme Sequence Recognition . . . 31 3.2.1 Insertion, Deletion and Substitution P ro b le m s............... .... 31 3.2.2 Ambiguous Word B o u n d a rie s ............................................ 31 3.3 The Alignment Scoring M o d e l............................................................... 32 3.3.1 Alignment between Phonetic Codes and Phoneme Sequences 32 3.3.2 X, Y, and Z M atrices.................................................................. 33 3.4 The Parallel Speech Understanding A lg o rith m ................................. 36 3.4.1 The Functional Structure of PA SS..................................... 36 3.4.2 Program Flow and Parallelism in P A S S ........................... 39 3.5 Marker-passing Solutions for Speech Specific Problems ................. 43 3.5.1 Insertion, Deletion and Substitution P ro b le m s.............. 45 3.5.2 Word Boundary P roblem ...................................................... 48 3.6 Multiple Hypotheses R eso lu tio n ........................................................... 51 3.7 Marker-passing Techniques to Improve P erform ance....................... 54 3.7.1 Triple Prediction W in d o w .................................................. 54 3.7.2 Expectation A d ju stm e n t...................................................... 56 3.8 S u m m a ry ................................................................................................... 56 A d d in g C o n te x tu a l K now ledge to PA SS 59 4.1 The Need to Use Contextual K n o w led g e........................................... 59 4.2 Applying Contextual Knowledge to Memory-based Speech Under standing ...................................................................................................... 61 4.2.1 Embedding Contextual Knowledge into Knowledge Base . 61 4.2.2 Utilization of Embedded Contextual K now ledge........... 62 4.2.3 Multiple Contexts and Activeness Marker .......................... 65 4.3 Ambiguity Resolution Based on Contextual K now ledge................ 66 4.3.1 The PASS System with Contextual P rocessing.............. 66 4.3.2 An Example of Ambiguity Resolution Using Contextual Knowl edge ................................................................................................ 67 4.3.3 Applying Contextual Constraints to CS Element Level . . 73 4.4 Summary .................................................................................................. 75 E x p e rim e n ta l R e su lts 76 5.1 Experimental Environm ent..................................................................... 76 5.1.1 SNAP-1 S im u la to r................................................................ 76 5.1.2 Speech F ro n t-e n d ................................................................... 77 5.1.3 Implementation of PASS on S N A P .................................. 77 5.1.4 ATC D o m a in ......................................................................... 78 5.2 Analysis on SNAP p r o g r a m ................................................................. 79 5.2.1 Parallelism in SNAP P ro g ra m .................................................. 79 vi I 5.2.2 Instruction Frequency and Execution T im e ............................. 81 1 5.3 Performance A nalysis...................................................... 81 1 5.3.1 Recognition Accuracy ............................................................ 83 | 5.3.2 Response Time and Scale-up...................................................... 84 5.3.3 Components of Execution T im e .............................................. 89 5.3.4 Processor S peed-up................... 90 5.3.5 Input Size Perform ance............................................................... 91 j 5.4 Performance of PASS with Contextual P ro cessin g ............................ 91 | 5.4.1 Accuracy .................................................................................... 91 I 5.4.2 Response T i m e ............................................................................. 92 i 5.5 S u m m a r y ........................ 94 6 C onclusion 96 6.1 Results of D isse rta tio n ............................................................................ 96 6.2 Future R esearch.................................................................... 99 6.3 Final R em arks............................................................................................ 101 A p pend ix 102 A p pend ix A Detailed Flow Chart of the PASS A lg o rith m ................................................ 102 A p pend ix B Example of the X, Y and Z Matrices ........................................... 103 B .l Example of the X M a trix ................................................................ 103 B.2 Example of the Y M a trix ................................................................ 104 B.3 Example of the Z M a trix ................................................................ 104 i A p pend ix C Example Test Sentences for the ATC D o m a in .............................................. 105 B ibliography 107 L ist O f F igu res 2.1 An example of semantic network............................................................. 13 2.2 An example of memory-based parsing.................................................... 17 2.3 SNAP-1 prototype [23].......................................................................... 20 2.4 SNAP instruction set [62]..................................................................... 21 3.1 The PASS environment......................................................................... 25 3.2 Hierarchical knowledge base including phoneme sequence level - ATC domain............................................................................................. 26 3.3 A complete CS hierarchy for in c re a se -sp e e d -e v e n t in ATC domain. 27 3.4 Parsing based on prediction and a c tiv a tio n .................................... 29 3.5 An example of CSI generated as a result of parsing........................... 30 3.6 An example of insertion, deletion, and substitution............................. 32 3.7 A possible alignment between input codes and a target transcription. 33 3.8 An example of applying X, Y and Z m atrices..................................... 35 3.9 Modules within PASS.................................................. 37 3.10 Parameters affecting concurrency........................................................ 39 3.11 Program flow and parallelism in PASS.............................................. 40 3.12 Parallelism for sample target sentence.................................................. 44 3.13 Alignments including insertion and deletion problems..................... 46 3.14 Handling insertion and deletion problems......................................... 47 3.15 Word boundary problem....................................................................... 50 3.16 Multiple hypotheses resolution............................................................. 53 3.17 Triple prediction window....................................................................... 55 3.18 Expectation Adjustment based on the Activation Score by A-marker (sub-phoneme sequence: L-M-N)......................................................... 57 4.1 An overview of CS & CSI memories with embedded contextual knowledge.................................................................................................. 63 4.2 A detailed example of embedding contextual knowledge into the knowledge base......................................................................................... 64 4.3 System flow chart. ...................................................................... 68 4.4 An example of ambiguity resolution using contextual knowledge. . 70 4.5 Accumulated activeness scores for U ttr 5 & 6.................................. 71 viii 4.6 An example of applying contextual constraints to CS element level. 74 5.1 The PASS system implemented on the SNAP-1 prototype............... 78 5.2 A simple example of SNAP program...................................................... 80 5.3 Instruction frequency and execution tim e............................................. 82 5.4 Accuracy vs. sentence length................................................................... 83 5.5 Definition of execution time and response tim e................................... 84 5.6 Response tim e.............................................................................................. 85 5.7 Execution tim e............................................................................................. 86 5.8 Logarithmic performance in executing inferencing algorithms on a tree-shaped semantic network.................................................................. 87 5.9 Response tim e vs. array size (KB size: 1.4K)...................................... 88 5.10 Increase in number of propagate instructions....................................... 89 5.11 Speed-up with increasing knowledge base size..................................... 90 5.12 Response time vs. target sentence length on 16 clusters................... 92 5.13 Sentence recognition accuracy.................................................................. 93 5.14 Semantic a c c u ra c y .................................................................................... 93 5.15 Response time vs. target sentence length on 16 clusters................... 94 A .l Flow chart of the PASS algorithm.......................................................... 102 B .l The X Marix................................................................................................. 103 B.2 The Y M arix................................................................................................ 104 B.3 The Z M arix................................................................................................. 104 A b stract This dissertation presents a massively parallel computational model for the effi cient integration of speech and natural language understanding. The integration of speech and natural language understanding is a key issue in large-scale spoken language processing. However, the integration involving multiple layers of knowl edge sources such as phonetic, lexical, syntactic, semantic, and contextual layers is accompanied by a substantial computational overhead. Therefore, an integrated system working on uniprocessor environment will face a scalability problem as the knowledge base size is increased. We describe how the scalability problem can be addressed in an integrated Parallel Speech understanding System called PASS. PASS adopts a hierarchically- structured semantic network knowledge base and a memory-based parsing tech nique that employs parallel marker-passing as an inference mechanism. W ithin this paradigm, parallel solutions are provided for major speech-specific problems such as insertion, deletion, substitution, and word boundary detection through tightly-coupled interaction between low-level phoneme sequences and higher-level concepts. A score-based ambiguity resolution scheme utilizing syntactic, seman tic, and contextual knowledge as constraints is developed using the memory-based parsing and parallel marker-passing paradigms. Beyond simply recognizing speech and converting it into text, PASS employs the underlying meaning representation through parallel speech understanding. PASS has been implemented on the Semantic Network Array Processor (SNAP), a massively parallel computer developed at the University of Southern California. SNAP is designed for semantic network processing with a reasoning mechanism based on marker-passing. Experimental results on SNAP shows an 86% sentence recognition rate for the Air Traffic Control (ATC) domain. Furthermore, speed-up of up to 15-fold is obtained from the parallel platform which provides response times of a few seconds per sentence for the ATC domain. C h ap ter 1 In tro d u ctio n 1.1 M o tiv a tio n A I and Parallel P rocessing As the current AI technology migrates from its early development stage to realistic, large-scale applications, the issue of how to reduce excessive search and computation becomes a major part of AI research. The traditional AI approach to this problem is to use more complex and clever strategies to minimize search space and cut unnecessary computations. However, as application size increases, just applying more strategies or heuristics is not good enough to control the expo nential growth of a search space. This is mainly because the traditional approach is based on the concept of sequential computation. On the other hand, applying the concept of parallelism to AI provides us a new perspective to look at the problem. That is, we may be able to develop several parallel algorithms which may hardly be conceived if we think about the problem in terms of sequential processing. W ith the advances in parallel processing tech nology, the parallel algorithms can be actually implemented on parallel machines instead of simulating them on sequential machines. The parallel processing technology is about to play a major role to overcome the limitations of the conventional AI approach. Due to the availability of mas sively parallel computers, it is feasible to utilize the abundant parallelism in AI applications. The need for parallel processing is even more urgent for applications 1 requiring real-time performance such as speech understanding. In this disser- j tation, we argue that parallel processing can provide im portant innovations to j I speech understanding. J Integration of Speech and N atural Language U nderstanding Despite several decades of research activity, speech understanding still remains a difficult field. The ultim ate goal of speech research is to create an intelligent ■ assistant, who listens to what a user tells it and then carries out the instructions. ! I An apparently simpler goal is the listening typewriter, a device which merely tran scribes whatever it hears with only a few seconds delay. The listening typewriter seems simple, but in reality the process of transcription requires almost complete [ understanding as well. Today, we are still quite far from these ultim ate goals, but progress is being made. A key issue in speech understanding is the integration of speech and natu ral language understanding. Their effective integration offers higher potential recognition rates than recognition using word-level knowledge alone. W ithout t higher-level knowledge, error rates for even the best currently available systems : are fairly large. For example, CMU’s Sphinx system [49] is considered to be one of the best speaker-independent continuous-speech recognition systems to day. However even the Sphinx system has a word recognition accuracy of 70.6% for speaker-independent, continuous-speech recognition with a vocabulary of 1000 . words, when recognizing individual words without using syntax or semantic in formation [49]. W ith a word recognition rate of only 70%, the overall sentence recognition accuracy will be quite low. ; Clearly, we need to apply high-level knowledge to better understand continuous speech. The integration of speech and natural language understanding resolves | multiple ambiguous hypotheses using syntactic, semantic and contextual knowl edge sources. Since this requires sizable computations involving multiple levels of knowledge sources, speed performance can degrade considerably on realistic knowledge bases suitable for broad and complex domains. Therefore, an inte grated system implemented on uniprocessor environment will face a scalability problem in the event that insufficient processing power is available as the knowl edge base size is increased. 1.2 O b jectiv e The scalability problem can be addressed using parallel processing approaches. In this dissertation, we present a massively parallel computational model for the efficient integration of speech and natural language understanding. The model adopts a hierarchically-structured semantic network knowledge base to efficiently combine multiple levels of knowledge sources such as phonetic, lexical, syntactic, semantic, and contextual layers. The model also adopts a memory-based pars ing technique that employs parallel marker-passing as an inferencing mechanism, which proved to be a viable approach for natural language processing in general [12], [13], [66], [38], [35], W ithin this paradigm, major speech-specific problems such as insertion, dele tion, substitution, and word boundary detection were analyzed and parallel so lutions were provided. Close interaction between low-level speech recognition and high-level natural language processing modules helps in resolving the speech- specific problems. A score-based ambiguity resolution scheme utilizing syntac tic, semantic, and contextual knowledge as constraints was developed using the memory-based parsing and parallel marker-passing paradigms. We developed an integrated Parallel Speech understanding System called PASS [15], [16], [17], [19], [20] based on the above computational model. PASS was actually implemented on the Semantic Network Array Processor (SNAP) [61], [62], [63], a marker-passing massively parallel computer developed by our Parallel Knowledge Processing Laboratory at the University of Southern California. SNAP is dedicated to AI applications, particularly in knowledge processing. The Phonetic Engine [60] developed by the Speech Systems Incorporated (SSI) is used as a speech front-end. The Phonetic Engine is one of the best commercially available speech front-ends [70], and can process speaker-independent continuous speech input in real-time. Beyond simply recognizing speech and converting it into text, PASS employs the underlying meaning representation through paral lel speech understanding. Here we describe an operational implementation and analyze its performance on SNAP. 3 1.3 O rgan ization o f D isserta tio n Chapter 1 explains the motivation for applying parallel processing to speech un derstanding and an overview of the thesis. Chapter 2 provides the background knowledge, including speech research in general, integrated speech and natural language processing research, and parallel processing approaches. Here we also describe the basic concept of parallel marker-passing and memory-based parsing, and present an overview of the SNAP parallel computer. Chapter 3 presents the parallel speech understanding algorithm in depth, and marker-passing solutions for the speech-specific problems. We also discuss the potential parallelism con tained in the parallel marker-passing based speech understanding algorithm. In Chapter 4, we added contextual knowledge to PASS to improve its performance. Here we explain the need for contextual knowledge and the implementation de tails. Chapter 5 describes the implementation of PASS on SNAP and shows performance obtained for a moderately-sized application domain. Instruction par allelism, recognition accuracy, execution time, speed-up, scale-up, and input size performance are analyzed and discussed. Chapter 6 presents a summary and major conclusions from this research, as well as an outline for future work. 4 C h ap ter 2 B ack grou n d As discussed in Introduction, the goal of this dissertation is to develop a paral lel marker-passing based computation model for integrated speech and natural language understanding. In this chapter, before entering into the main topic, we will briefly review speech understanding research in general, parallel marker- passing, and memory-based natural language understanding. We will also present an overview of the SNAP parallel computer. 2.1 S ta te o f th e A rt in S p eech U n d ersta n d in g R esearch Since ARPA initiated the Speech Understanding Project [45] in 1971, research in continuous-speech recognition has been performed in several directions. In this section, we present an overview of representative directions in speech recognition research, integrated speech and natural language understanding research, and parallel processing approach. 2.1.1 Low-level Speech R ecognition Research The speech recognition research has been developed along two directions: the template-based approach and the stochastic approach. The connectionist ap proach has also had some success, but is the most recent development in speech recognition and is still the subject of much controversy. 5 T em plate-based Approach | In the template-based approach, units of speech are represented by tem plates, [ I and recognition is carried out by matching an unknown spoken utterance with reference tem plates and selecting the best matching pattern. Dynamic programming is used to solve the problem of temporal variability. Timing differences between two speech patterns are eliminated by warping the time axis of one pattern so that the maximum coincidence with the other is at- j tained. Then the time-normalized distance is calculated as the minimized residual ■ distance between them. This minimization process is very efficiently carried out * using the dynamic programming technique. The dynamic programming based recognition techniques (or dynamic time warping) were first proposed by Sakoe and Chiba [83] in their pioneering work. Since then, several techniques for speech recognition using dynamic time warping were proposed [84], [64]. In early template-based systems, templates for entire words were constructed and each word had its own full reference tem plate. Template preparation and matching became prohibitively expensive or impractical as vocabulary size in creased beyond a few hundred words [91]. To handle this problem, tem plates should be further divided into phoneme-like units. However, tem plate matching systems are unable to perform fine phonetic distinctions because spectral tern- : plates do not capture the acoustic-phonetic events that are necessary to identify , most phonetic segments. In order to perform fine phonetic distinctions, it is neces- i sary to extract more acoustic features from the speech signal. Since these features I occur at different times in the signal and affect each other, the information needed to perform fine phonetic distinctions cannot be captured by comparing individual ■ spectral slices. i To overcome this difficulty, researchers have proposed a feature-based approach [21], [105] to speech recognition, in which one must first identify a set of acoustic features that capture the phonetically relevant information in the speech signal. The input speech signal is transformed into a representation that takes into ac count the acoustic features. From various stages of this transformation, acoustic parameters are extracted and used to classify the utterance into broad phonetic categories. The coarse classification can also include prosodic analysis [90] that identifies regions where the speech signal is likely to be more robust. The outcome of this analysis is used for lexical access. Stochastic Approach The stochastic approach is somewhat similar to the template-based approach. One major difference is that stochastic modeling entails the use of probabilistic models to deal with uncertain or incomplete information. In speech recognition, uncertainty and incompleteness arise from many sources; for example, confusable sounds, speaker variabilities, contextual effects, and homophone words. Thus, the stochastic model is a particularly suitable approach to speech recognition. The most popular stochastic approach today is the Hidden Markov Model (HMM) [73], [74], [49], [50]. A HMM is characterized by a finite-state Markov model and a set of output distributions. The output distribution in the Markov chain models the param etric distribution of speech events, and the transition distribution models the duration of events. HMMs can be used to represent any units of speech. Since there are strong tem poral constraints in speech, left-to- right models are usually used. The HMM approach provides a framework which includes an efficient decoding algorithm (the Viterbi algorithm) and an autom atic supervised training algorithm (the forward-backward algorithm or Baum-Welch algorithm). Compared to the template-based approach, the HMM is more general and has a firmer m athematical foundation. However, the first-order Markov assump tion [49] makes it difficult to model coarticulation directly and HMM training algorithms cannot currently learn the topological structure of word or sub-word models [54]. Nevertheless, HMM models are widely used in current research sys tems because this technique produces better results for continuous speech with limited-size vocabularies. SSI’s Approach Since SSI’s Phonetic Engine is used as a speech front-end for our speech un derstanding research, we introduce the SSI’s approach briefly. The SSI’s speech system is composed of two modules; Phonetic Encoder(PE or Phonetic Engine) and Phonetic Decoder(PD) [3], [4], The speech recognition algorithm developed 7 by SSI is based on the feature-based approach. The PE can handle speaker- independent continuous speech input with a large vocabulary of up to 40,000 words in real-time. The PE is one of the best commercially available speech recognition systems [70]. Like other feature-based systems, the PE involves feature extraction, segmen tation and labeling to transform speech signals into a sequence of phonemic sym bols. A continuous speech utterance is first quantized by the acoustic front-end into a time-sequence of acoustic frames. Next, the acoustic frames are coded by a neural-net-based frame encoder [3] trained to maximize the average m utual infor mation between its code alphabet and the alphabet of the broad phonetic classes (which are used for frame segmentation). The time-sequence of coded acoustic frames is then processed by a segmenter which forms acoustic segments by merging some time-contiguous blocks of frames using their codes and the code/broad phonetic class statistics to make segmenta tion decisions. Finally, the resulting acoustic segments are coded (or labeled) by a neural-net-based segment encoder [3] trained to maximize the m utual information between its code alphabet and the alphabet of the dictionary phonemic classes (which are used for transcribing words in the dictionary). To convert the segment-code sequence into the sentence text, the PD uses a version of the dynamic programming beam search method. An utterance is de coded left-to-right by progressively aligning its segment-code sequence with the multiple transcriptions found in the sentence transcription graph, which is a di rected graph with nodes associated with certain word categories. Paths in the graph correspond to sentences allowed by the grammar. The alignment process is done one by one on a uniprocessor. After the decoding process reaches the end of the segment-code sequence, the complete sentence transcription yielding the highest scoring alignment with the complete-code sequence is found. Since the PD uses a left-to-right beam-search algorithm which discards partial transcriptions of the speech based upon their scored likelihood, the search is not exhaustive. Thus, the PD does not always find the best scoring sentence. In general, the PD based on the dynamic programming beam search method works 8 well in relatively small domains, but faces the scalability problem as the size of the application domain grows. 2.1.2 Integrated Speech and N atural Language U nderstanding Research In the previous subsection, we have explained the approaches concentrating on the problems of acoustic representation and modeling, which are the usual fo cuses of speech recognition research. However, acoustic-modeling research will not reach human-like performance for large-vocabulary tasks. The reason is that humans make use of many non-acoustic sources of information including syntax, semantics, pragmatics, dialogue, and the knowledge of the speaker. Therefore, it is extremely im portant to utilize the non-acoustic knowledge sources to improve speech recognition. Previous research efforts have been m ade towards integrating linguistic information with speech recognition using various levels of knowledge sources. The following describes the major knowledge sources used by different speech systems. Syn tactic G ram m ars A syntactic grammar first divides all words into different syntactic categories. Instead of using transition probabilities between word pairs or triples, a syntactic grammar specifies all possible sequences of syntax word categories for a sentence [97]. A set of acceptable sentences can be specified in terms of a finite state grammar [55], a context-free grammar [65], or a unification gram m ar [33]. The selected grammar is then integrated into the speech recognizer so th at only legal sequences are hypothesized and evaluated. Paeseler [69] applied chart-parsing algorithm to speech recognition. Seneff [82] developed a probabilistic syntactic parser (TINA) for speech understanding systems. Sem antic G ram m ars Semantic grammars have been the most popular form of sentential informa tion encoded in a speech recognition system. The grammar rules are similar to those of syntactic grammars, but words are categorized by a combination of syn tactic class and semantic function. Only sentences that are both syntactically well formed as well as meaningful in the context of the application will be recog nized by a semantic grammar. Semantic grammars express stronger constraints than syntactic grammars, but also require more rules. Compare to the syntactic grammars, they are even more difficult to construct by hand. Nevertheless, most speech systems which use high level knowledge have chosen to use semantic gram mars as their main sentential knowledge source [47] [98]. Some speech recognition systems emphasized semantic structure while minimizing syntactic dependencies [32]. This approach results in a large number of choices due to the lack of ap propriate constraints. The recognition performance therefore suffers due to the increased ambiguities. C ontextual K now ledge While not many speech recognition systems account for constraints beyond the sentence level, some systems utilize knowledge beyond single sentences. B arnett [5] describes a speech recognition system which uses a thematic memory. The system keeps track of previously recognized content words and predicts that they are likely to reoccur. The possibility of using a dialogue structure is mentioned by B arnett, but no results or implementation details are reported. The them atic memory idea was picked up again by Tomabechi and Tomita [88], who demonstrated an actual implementation in a frame-based system. Both speech recognition systems use an utterance to activate a context. This context is then transformed into word expectations which prime the speech system for the next utterance. Biermann et al. [7] implemented a system that used a dialogue feature to cor rect errors made in speech recognition. Their system was strictly history based. It remembered previously recognized meanings (i.e., semantic structures) of sen tences. If the currently analyzed utterance was semantically similar to one of the stored sentence meanings and the system was at a similar state of the dialogue at that time, the stored meaning can be used to correct the recognition of the new utterance. The history constraint was only applied after a word recognition module has processed the utterance, in an attem pt to correct possible errors. Strategy knowledge was applied as a constraint in the voice chess application of Hearsay-I [75]. Specifically, the principles of using a user model, task seman tics, and situational semantics were applied as constraints. The task domain 10 was defined by the rules of chess. The situational semantics were given by the current board position. Based on this information, a list of legal moves could be formulated which represented plausible hypotheses of what a user might say next. In addition, a user model was defined to order the moves in terms of the appropriateness of a move. Young et al. [101] implemented the MINDS system which exploits knowledge about users’ domain knowledge problem solving strategy, their goals and focus as well as the general structure of a dialogue to constrain speech recognition down to low-level speech processing. Contextual knowledge sources were used predictively to circumscribe the search space for word candidates. j | ! 2.1.3 Parallel Processing Approach j Applying parallel processing to speech understanding is a new approach. The I m ajority of research on parallel parsing emphasizes w ritten language, although some recent work has addressed integrated speech understanding. Huang and Guthrie [39] proposed a parallel model for natural language parsing based on a combined syntax and semantics approach. Waltz and Pollack [92] investigated parallel approaches under paradigms related to the connectionist model. Giachin and Rullent [25] implemented a parallel parser for spoken natural language on a Transputer-based distributed architecture. They used a case frame- based parsing scheme and reported a sentence recognition accuracy of about 80% from continuously-uttered sentences, on average 7 words long, with a dictionary of 1000 words. While considerable progress has been made in speech understanding, the state of the art is still far from providing real-time speech understanding on broader domains. 2.2 P arallel M arker-pas sin g S y ste m s A parallel marker-passing system consists of a semantic network knowledge rep- j resentation scheme and a parallel inferencing mechanism. The semantic network I knowledge base contains knowledge about a particular domain and is organized j 11 I hierarchically. Inferencing operations are defined in terms of 1) the state of con- : cepts, and 2) operations involving global or local exchange of information between ! concepts. This is consistent with a memory-based reasoning approach [86] which is suitable to parallel processing. 2.2.1 Sem antic N etwork R epresentation The issue of knowledge representation is central to knowledge processing. One ! of commonly used approaches is the semantic network representation, which was first introduced to the AI community as a semantic memory by Quillian [71] in ; 1968. Semantic network systems can represent a broad range of knowledge and i support various reasoning mechanisms. Semantic networks express knowledge in terms of concepts, their properties and the hierarchical relationship between concepts. The concepts are connected to their subsuming concepts (super concepts) by is-a or subsumption links, and also i to their properties by appropriately labeled links. Concept nodes are organized 1 hierarchically. T hat is, more abstract or general concepts are placed higher in the hierarchical representation. Properties of more abstract concepts are inherited by all concepts subsumed by higher concepts, except when specially declined, so that common properties do not have to be specified for each concept individually. As an illustration, consider a semantic network for the concept com puter : in Figure 2.1. In this example, com puter subsumes p arallel_ co m p u ter, and p arallel_ co m p u ter has the property: (h a s -p a rt m ult ip le _ p ro ce sso rs). . p a ra lle l_ co m p u te r is an instantiation of com puter in the concept hierarchy, and • inherits com puter’s properties: (h a s -p a rt memory) and (h a s -p a rt io -d e v ic e ). To solve realistic AI problems, such as those encountered in natural language , processing, a very large knowledge base is required. In practice, tens of thousands | of nodes are required even for a limited domain. Since a semantic network knowl- ' edge base maintains the permanent knowledge used in the reasoning system, it can be regarded as a static infrastructure which allows the transfer of information between concepts in the domain consisting of background knowledge. Notable systems using semantic networks are FRL [80], KRL [9], NETL [24], KL-ONE ■ [10], and others. abstract thing anim al information is-aT Chuman machine memory uses ■► (com puter io-device parallel^ computej, sequential computer SNAP ^5U N Span^ ^Nextstatioi Figure 2.1: An example of semantic network. 2.2.2 Parallel Inferencing M echanism A semantic network knowledge base is well suited to parallel inferencing mech- , anisms, which are based on marker-passing. Marker-passing is a technique de- | veloped for utilizing parallelism to find connections between objects in semantic ; i networks. Markers are data patterns associated with each node and act as dy- | namic agents of inference to exchange information. Markers are used to represent properties of nodes, membership in different sets, and reflect the state of a hypoth esis as they travel in parallel through semantic networks. The basic inferencing mechanism in which markers operate is explained as follows: • Selection Phase: Set initial conditions by placing markers on origin nodes corresponding to input objects. • Propagation Phase: Markers are propagated to other nodes through con nected links. While initiated on a global basis, markers move autonomously and only need to be propagated under local control. • Collection Phase: After propagation term inates, the marked nodes are re- ■ trieved and used for the inference by inspecting the relationship between those objects of concern. While the semantic network represents the permanent knowledge, tem porary ' knowledge is represented as markers that are attached to nodes, and marker- passing is the mechanism which changes the state of the knowledge base. When- ever a marker encounters new nodes, it may change the state of knowledge as sociated with these nodes. Complex reasoning operations can be achieved by controlling the movement of markers through semantic networks as determined i by propagation rules which are attached to each marker. Propagation rules al low markers to individually select which paths to follow and which to avoid. To quantify properties, markers also carry a floating-point weight which is used to ' evaluate alternative hypotheses encountered during processing. Marker-passing was first introduced by Quillian [71] as a spreading activation 1 ! on semantic memory. Fahlman designed a hardware implementation of marker- passing scheme, called NETL [24]. More elaboration on path finding and its ap plication to inferencing was done by Charniak [12], [13]. Since Charniak’s work, marker-passing has become an accepted technique for inferencing in natural lan guage processing. Norvig [66] used a constrained marker-passing scheme using ! specific link types to prevent irrelevant propagation. Hendler [35] proposed a i method which filters out the paths violating certain restrictions. Yu and Sim- I j mons [104] presented a scheme which controls marker propagation by specifying complex conditions on markers. J i The major advantage of marker-passing is its genuine parallelism which al lows us to implement it in parallel machines. The introduction of parallelism can solve the problem which most knowledge-based AI systems implemented on serial machines encounter: adding a new piece of knowledge would degrade the perfor mance of a system unlike in cases of human beings. In parallel computation, it is possible to assign more processing elements to the new piece of knowledge, and . prevent the whole system from slowing down. 2.3 M em o ry -b a sed P arsin g The idea of representing knowledge as a network of concepts and performing inference by passing markers on the network introduced a new approach to natu- | ral language processing, called memory-based parsing. Marker-passing techniques provide a reasoning mechanism for memory-based parsing approaches. Memory- ' based parsing emphasizes a large case memory over sophisticated parsing rules I i or grammars. Parsing is viewed as a memory-intensive process which identifies patterns in the memory network to provide an interpretation based on embedded syntactic and semantic relations. Processing is performed using a large number of i markers which propagate concurrently through the memory network. Due to the j use of a large semantic network and many markers, this approach is amenable to parallel computer systems. Memory-based parsing can be considered as an application of memory-based : reasoning [86] and case-based reasoning [79] to language understanding. This ' view, however, differs from the traditional idea of extensive rule application to build up a meaning representation. The memory-based parsing approach was first introduced by Riesbeck and Martin [77] as Direct Memory Access Parsing (DMAP), where parsing is performed directly in the knowledge base by marker- passing memory search. Based on DMAP, similar memory-based systems such as DMTRANS [87] and <&Dm-Dialog [41], [42] were developed. In 3>Dm-Dialog, Kitano applied memory-based parsing in a Japanese-to-English speech translation system. As shown in Figure 2.2, in a memory-based parsing system, linguistic knowl edge is stored in memory as phrasal patterns called concept sequences such as [a g e n t, m arry, o b je c t]. A concept sequence is the basic building block in memory-based parsing, and represented by a cluster of a concept sequence root (CSR) node and a set of concept sequence element (CSE) nodes. For example, as shown in Figure 2.2, m arry-event is a CSR node, and ag en t, m arry & o b je c t are CSE nodes. CSR and CSE nodes are connected by first, next and last links. The knowledge base is a hierarchically organized semantic network of concepts and concept sequences. When an input sentence is accepted, the linguistic pattern in the input sen tence is matched against stored concept sequences in the knowledge base by a sequence of marker-passing operations. The parsing algorithm is based on top- down prediction and bottom-up activation. In other words, the algorithm relies on marking concepts with prediction and activation markers, and advancing pre diction markers when prediction markers collide with activation markers. At the beginning, all CSRs are predicted as potential hypotheses, and their first CSEs and all subsumed concepts in the knowledge base are also predicted as hypotheses as well. When an input word is accepted, the corresponding node in the knowledge base is activated. Activation markers are propagated up to the subsuming nodes in the knowledge base. W hen a predicted CSE receives an activation marker, the prediction marker is advanced to the next CSE and propagated to all subsumed concepts. This process repeats with a new input ( 16 I 1 I (agent, MARRY, object] many-evenO first next next object MARRY 'agent I i t agent human human, Text input married Tom Susan Figure 2.2: An example of memory-based parsing. [ ; I ' word. Even though many candidates are predicted at the beginning, only a few ■ candidates are activated, and further narrowed down as more input words are processed. Once a whole concept sequence is recognized, an interpretation for the : sentence is generated, and stored in the semantic network knowledge base. ( Figure 2.2 shows an example of memory-based parsing. By applying the above i parsing algorithm, the concept sequence: m a r r y -e v e n t is recognized from the . ’ input sentence: “Tom m a rried S u san ”. As a result of parsing, the corresponding , I i semantic interpretation: [m a r r y -e v e n t# l ( a g e n t : Tom#2) ( o b j e c t : S u sa n # 3 )] ; , is generated in the knowledge base. I I t 2.4 S N A P P a ra llel C om p u ter i . In the previous sections, the concepts of semantic network and marker-passing * have been explained. Based on these concepts, memory-based parsing approaches ! have also been explained. Although these approaches are suitable to parallel I I 17 1 i i processing, most of the previous work in this area has been simulated on sequential computers, because the parallel hardware implementation supporting the above concepts was not available. 1 The three year SNAP (Semantic Network Array Processor) project [61], [62], I [63], funded by NSF was started in 1990 at USC to develop a specialized massively J parallel computer to provide realistic solutions to knowledge processing problems. ! SNAP is a highly parallel array processor fully optimized for semantic network processing with a reasoning mechanism based on parallel marker-passing. In order ! to facilitate efficient propagation of markers and to ease development of applica tions, a set of marker propagation instructions has been microcoded. SNAP sup- i ports propagation of markers containing bit-vectors, address and numeric value. By limiting the content of the markers, significant reduction in cost and resource ! has been attained without undermining performance requirements for knowledge processing. Several AI applications such as PASS [19] (a speech understanding system), PARALLEL [14] & DMSNAP [43] (natural language processing systems), ; and PALKA [40] (a semantic knowledge acquisition system) have been developed on SNAP. There are some other efforts on parallel implementation of marker-passing ap- : proaches. As introduced previously, Fahlman was one of the first to propose a massively parallel implementation of a marker-passing scheme (NETL), but the , system was actually simulated in software. The Connection Machine was devel oped by Hillis [37] as a large scale SIMD machine capable of performing the types ' of operations Fahlman envisioned, as well as other tasks outside of AI. The CM-2 i contains up to 64,000 single-bit processors, with each processor having 4K bits of memory and a serial ALU. We modeled various inferencing algorithms for seman- ’ tic network processing on the CM-2 as a comparison study [18]. The performance comparison between SNAP and CM-2 was reported in [23]. A hardware/software ; system specifically developed for inferencing operations using semantic networks was designed and constructed by Higuchi [36]. The processor is called IXM-2 and contains 64 Inmos Transputers, each with attached associative memory. IXM-2 provides advantages of both an array processor architecture and associative mem ory. Other applicable parallel processors for AI include the Fairchild AI machine (FAIM-1), developed by researchers at Fairchild labs, which featured a scalable ' hexagon mesh [2]. i 2.4.1 SN A P -1 prototype For the performance evaluation and the testing of the original SNAP design, a prototype machine has been built. The SNAP-1 prototype [23] is based on a j multiprocessing array and a dual-processor array controller, as shown Figure 2.3. | The array stores a semantic network of up to 32K nodes and 320K links. The SNAP-1 array consists of 144 Texas Instrum ents TMS320C30 DSP micro- j ; processors, which act as Processing Elements (PEs). The array is organized as 32 tightly-coupled clusters of 4 to 5 PE each1. Each cluster manages 1024 semantic network nodes. The clusters are interconnected via a modified hypercube network. 1 The array controller consists of two pipelined TMS320C30 microprocessors, called the program control processor (PCP) and the sequence control processor (SCP). The PC P executes the SNAP application program, and controls the overall program flow. The SNAP instruction stream is passed to the SCP via a FIFO ’ queue implemented by a dual-port RAM. The SCP is responsible for instantiating operands in each SNAP instruction. It sequences and broadcasts the instructions for parallel execution in the array. When the pipeline is empty, housekeeping is performed, including node management and garbage collection. The controller interfaces the array with a SUN4/280 host where application ■ programs are written and compiled in SNAP-C using libraries provided for marker- passing. The parallel operations are initiated through a global bus from the con- j troller which begins each propagation cycle by broadcasting marker instructions to ! the array. The m ajority of computation is performed locally though the propaga tion of markers within the cluster. Several instructions and multiple propagations , can be performed simultaneously to explore multiple paths in parallel. 1 P r e s e n t l y , 1 6 c l u s t e r s a r e i m p l e m e n t e d i n t h e f u l l 5 P E c o n f i g u r a t i o n , w h i l e t h e r e m a i n i n g 1 6 c l u s t e r s h a v e 4 P E s e a c h , t o t a l i n g 1 4 4 P E s . 19 Host Computer Hardware Software Environment Environment Program development using SNAP instruction set Host Physical Design SUN 4/280 SNAP-1 Controller VME Bus Controller Compiled SNAP code Program Control Processor Sequence Control Processor ..Qo.e.9y^sKe.board SNAP-1 Array Custom Backplane 144 Processor Array Knowledge base SNAP instruction execution Eight 9U-size boards Four clusters per board our to five processors per cluster Figure 2.3: SNAP-1 prototype [23]. 2.4.2 Instruction Set j A set of 30 high-level instructions specific to semantic network processing are ; implemented directly in the hardware. The controller performs global inferencing I by issuing instructions to all processors in the array. The SNAP instructions are divided into seven core operations: node maintenance, marker maintenance, | logical, search, marker, marker supplemental, and retrieval instructions. The instruction set can be called from C language so that users can develop applications with an extended version of C language. From the programming level, SNAP provides a data-parallel programming environment similar to C* of the Connection Machine, but is specialized for semantic network processing with marker passing. 20 Instruction Argument Action CREATE DELETE s-node, <relation>, <weight>, e-node s-node, <reIatioin>> e-node Create new <relation> with <weight> between s-node & e-node. Delete <relation> between s-node & e-node. MARKER-CREATE MARKER-DELETE marker, <s-relation>, e-node, <e-relation> marker, <s-relation>, e-node, <e-relation> Create new <s-relation> between nodes with marker and e-node. <e-relation> is the relation type for e-node. Opposite of marker-create. TEST AND OR NOT marker-1, <marker-2>, <value>, <cond> marker-1, marker-2, < marker-3>, <func> marker-1, marker-2, < marker-3>, <func> marker-1, <marker-2> For all nodes with marker-1, set marker-2 if marker-1 <value> meets <cond>. For all nodes with marker-1 and marker-2, set <marker-3>. The marker values are handled by <fiinc>. For all nodes with either marker-1 or marker-2, set <marker-3>. For all nodes with marker-1, invert <marker-2>. SEARCH node, <marker>, <value> Set <marker> with <value> in node. PROPAGATE marker-1, <marker-2>, <rule>, <func> For all nodes with marker-1, begin propagating <marker-2> with <rule> and <func>. SET-MARKER-VALUE CLEAR-MARKER FUNC-MARKER marker, <value> marker marker, <func> For all nodes with marker, set <value>. Clear marker in all nodes. Assign a node function for dealing with duplicate markers in all nodes with maker. COLLECT-MARKER COLLECT-RELATIO N marker marker, relation Get results from all nodes with <marker>. Get relation information from all nodes with marker. Figure 2.4: SNAP instruction set [62]. 21 2.4.3 Propagation R ules Several marker propagation rules are provided to govern the movement of markers. Marker propagation rules enable us to implement guided, or constrained marker passing. This is done by specifying the type of links that markers can propagate. The following are the propagation rules of SNAP: • SEQ(R1, R2): the SEQuence propagation rule allows the marker to propa gate through R1 once, then through R2 once. • SPREAD(R1, R2): the SPREAD propagation rule allows the marker to traverse through a chain of R1 links. For each cell in the R1 path, if there exists any R2, the marker switches to R2 link and continues to propagate until the end of the R2 link. • C0M B(R1, R2): the COMBine propagation rule allows the marker to prop agate to all R1 and R2 links without limitation. END-SPREAD(R1, R2): This propagation rule is the same as SPREAD except that it marks only the last cells in the paths. END-C0M B(R1, R2): This propagation rule is the same as COMB except that it marks only the last cells in the paths. 2.4.4 K now ledge R epresentation on SN A P For knowledge representation, SNAP provides the following elements: node, link, node color, link weight, and marker. The nodes represent concepts, while the links represent relationships between connected nodes. The node color indicates the type of node, and is used for set classification of nodes. The link weight indicates the strength of inter-concept relations. A marker consists of a marker bit, source address, and marker value. While the semantic network represents the permanent knowledge, tem porary knowledge is represented as markers that are attached to nodes and marker-passing is the mechanism which changes the state 22 of the knowledge base. Link weights and node values provide a strength indicator for relations and nodes to serve as a measure of belief during m ultiple hypotheses resolution. i 23 C h ap ter 3 i i P A SS: A P arallel M ark er-p assin g A p p ro a ch for I S p eech U n d ersta n d in g As described previously, the objective of this thesis is to develop a parallel com putational model based on marker-passing paradigm for speech understanding. In this chapter we present the PASS system, which understands speech using tech niques exploiting parallelism to increase processing speed and tractable domain size. ! 3.1 S y ste m O verview As shown in Figure 3.1, PASS contains the natural language understanding (NLU) ! module, the speech understanding (SU) module and the knowledge base. The i inputs to PASS are provided by the Phonetic Engine manufactured by SSI. The Phonetic Engine provides a stream of input phonetic codes for processing. It can ; perform signal processing on speaker-independent continuous speech in real-time. The input codes provided by the Phonetic Engine are evaluated in the SU module to find the matching phoneme sequences. The predictions provided by > the NLU module allow the SU module to handle multiple hypotheses efficiently. 1 The NLU module guides the scope of the search space. Word candidates activated , by the SU module are further evaluated in the NLU module to construct meaning representations and generate a sentence output. The predictions and activations are performed in parallel by markers throughout the knowledge base. ; Speech Understanding (SU) Natural Language Understanding (NLU) phonetic codes word prediction speech input phoneme prediction Phonetic Engine sentence output word activation phoneme activation Knowledge Base (KB) Figure 3.1: The PASS environment. 3.1.1 Hierarchical K now ledge Base We use hierarchically organized knowledge bases to support close interaction be tween several levels of knowledge sources. A concept sequence (CS) is a basic build ing block in memory-based parsing. Each CS represents the underlying meaning of a possible phrase or sentence within a domain. In each CS, concept sequence element (CSE) nodes are connected by first, next and last links. Similarly, in each phoneme sequence (PS), phoneme sequence element (PSE) nodes are connected by p-first, p-next and p-last links to form words. More general CSs are placed at higher levels, and more specific CSs are placed at lower levels. This type of mem ory network is called a CS hierarchy. Phoneme sequences, which are attached to the corresponding concept nodes, reside at the lowest level of the concept sequence hierarchy. Figure 3.2 shows a part of a hierarchical knowledge base from the Air Traffic Control (ATC) domain developed for training air traffic controllers [68]. We have adapted the ATC domain to support a vocabulary of approximately 200 words using a hierarchical semantic network of approximately 1400 nodes. In Figure 3.2, tig e r -6 1 6 is a CS root node and t i g e r , s ix and s ix te e n are the corresponding CS Hierarchy event increase speeo ^jjvent. tiger-616 next inext Figure 3.2: Hierarchical knowledge base including phoneme sequence level - ATC domain. CS element nodes. Phoneme sequences are attached to these CS element nodes ; as shown in Figure 3.2. A complete CS hierarchy for in c re a s e -sp e e d -e v e n t in the ATC domain is shown is Figure 3.3. A layered structure makes it possible to process knowledge from the phonetic level to the contextual level by representing knowledge using a layered memory ■ network. After phonemes are processed and word hypotheses are formed, linguis tic analysis can be performed based on the syntactic, semantic and contextual ; constraints embedded in the knowledge base. 3.1.2 Basic A lgorithm As shown in Figure 3.1, the SU module performs phoneme prediction and phoneme activation while the NLU module performs word prediction and word activation. The algorithm is based on a combination of top-down prediction and bottom-up \ I CS Memoiy: event is-a is-a turn right event turn left event climb event maintain . event> event event speed event first range trrr^f knots t next. : tusxt. aircraft speed tiger-616 i next next tens / is-a j is d> ::( zero ): I sixteen tiger five Target sentence: “Tiger six sixteen, increase speed to one five zero knots’ j Figure 3.3: A complete CS hierarchy for in c re a s e -sp e e d -e v e n t in ATC domain. I ----------------------------------------------------------------------------------------------------------------- I I 27 activation. Top-down prediction locates candidates to be evaluated next. Bottom- up activation locates possible sets of phonemes from the given input codes. ! The basic parsing mechanism using prediction and activation are shown in • Figure 3.4, where a simplified CS hierarchy is shown. Figure 3.1 shows a circular I path existing between the NLU modules and the SU module: word prediction — » I J phoneme prediction — > phoneme activation — ► word prediction. The operation starts from the NLU module by predicting the first CSEs in possible CSs. This f i in turn impacts the prediction of the first PSEs for these CSEs in the SU module J as shown in Figure 3.4-a. Next, the SU module accepts a phonetic code from the | Phonetic Engine, and activates all relevant PSEs. The collision of prediction and activation in PSEs trigger further phoneme predictions as shown in Figure 3.4-b I and c. This process repeats until the last PSE is activated in each PS candidate 1 as shown in Figure 3,4-d. This implies that a new word hypothesis is formed, and the process moves to the NLU module by activating the corresponding CSE and predicting the next CSE in turn, as shown in Figure 3.4-e and f. Here, through a process similar to the one at the PS level, the collision of prediction and activation 1 triggers further word predictions. These prediction and activation operations are performed in parallel using markers on the circular path until all input codes are processed. Once a whole CS is recognized, a CS instance (CSI) is generated as an ■ interpretation of the speech input, and stored in the knowledge base. As shown in Figure 3.5, a sequence of input codes1 are matched against the CS: in c r e a s e - s p e e d .- e v e n t in the CS memory using the above algorithm, and 1 the CS instance (CSI): in c r e a s e - s p e e d - e v e n t #3 is dynamically generated in the CSI memory as a result of parsing. The CS and CSI are connected by a CS in- 1 stance link. From the CSI, the sentence output: “T ig e r s i x s i x t e e n , in c r e a s e sp e e d t o on e f i v e z e r o k n o t s ” is generated, as shown in Figure 3.5. ! 1 D e t a i l s o f p h o n e t i c c o d e s w i l l b e e x p l a i n e d i n t h e f o l l o w i n g s u b s e c t i o n s . 28 Concept Sequence: last first r next plastN. : Phoneme Sequence next pfirst text pnexr— "pnexi P&A a) Initial Prediction b) Collision of Prediction and Activation at the first PSE P • * P&A c) Movement of Prediction in PS d) Collision of Prediction and Activation at the last PSE P&A| e) Collision of Prediction and Activation at the first CSE f) Movement of Prediction in CS Figure 3.4: Parsing based on prediction and activation 29 . I I s I Output Sentence: ‘T iger six sixteen, increase speed to one five zero knots” , i Concept Sequence Instance (CSI): acid \ CSI Memory Tange# ll ) [tens aircraft id2 five #7 zero#4 CS instance lin k Concept Sequence (CS): first Iasi next. next aircraft knots range is-a cs Memory X a j / first k « S s next/'''*! ’"'Nnext ^hundreds) tens 1— > T is-a T is-a [ one J C five ) first next sixteen heavy Input Phonetic Codes: 240 721 128213681327 93 967 1619 1066 845 631 857 989 239 576 834 1244 505 122 611 972 1002 297 1006 343 238 763 976 880 628 197 444 969 258 4611025 1040 1575 1356 354 660 25 1387 1325 83 313 878 1176 1626 1458 518 249 I Figure 3.5: An example of CSI generated as a result of parsing. 30 3.2 S p eech -S p ecific P ro b lem s in P h o n e m e , i i I S eq u en ce R eco g n itio n I i t In this section, before presenting the complete PASS system, we explain some speech-specific problems which must be solved in phoneme sequence recognition j process. Automated understanding of natural continuous speech presents some 1 ! unique problems, in addition to those already in w ritten natural language, such ' : as insertion, deletion & substitution, and word boundary detection. 1 I 3.2.1 Insertion, D eletion and S u b stitu tion Problem s f * In continuous speech, matching the expected phonemic realizations of a given input with an unknown phonemic string possibly contains insertion, deletion and substitution errors. The influence of surrounding vowels, consonants and stress 1 patterns can lead to the insertion and deletion of segments from the expected acoustical characteristics. Input codes may even have substituted phonemes when compared to an ideal transcription which contains only expected phonemes. For example, some elements in the sequence of input codes may be a spurious : result of incorrect segmentation (insertion). Also, some of the expected phonemes , ■ may be missing as a result of incorrect segmentation, or as a result of being om itted by the speaker (deletion). In addition, a poorly articulated segment can [ be identified as a different phoneme type (substitution). Figure 3.6 shows an 1 example of insertion, deletion and substitution problems for the word t ig e r . 1 I 3.2.2 A m biguous W ord Boundaries j j ! Continuous speech recognition is considerably more difficult than isolated word ‘ recognition. First, word boundaries are typically not detectable in continuous i ; speech. This results in additional confusable words and phrases (for example, “I , scream ” and “ic e cream”), as well as an exponentially larger search space. The second problem is that there is much greater variability in continuous speech due to stronger coarticulation (or inter-phoneme effects) and poor articu lation. Often words are coarticulated such that their boundaries become merged 31 The phonemic string for word tiger deletion insertion substitution A n e x p e c t e d p h o n e m i c r e a l i z a t i o n o f a g i v e n i n p u t Figure 3.6: An example of insertion, deletion, and substitution. and unrecognizable. For example, the phrases: “some m ilk” and “s ix s ix te e n ”, are examples where the two words overlap without a clear boundary. The phrase: “d id you” becomes “d id ja ” when poorly articulated in continuous speech. 3.3 T h e A lig n m en t S corin g M o d el It is difficult to correctly align input phonetic codes with target phoneme sequences because the phonetic codes contain the segmentation problems such as insertion, deletion and substitution. The code/phoneme statistics collected by SSI provide the necessary information for the alignment process. 3.3.1 A lignm ent betw een P h on etic C odes and Phonem e Sequences An initial task to be performed in speech understanding is finding the best cor respondence of input codes to phoneme sequences. To evaluate each match, a codebook is used, which was derived from automatically labelled speech data, collected from several speakers. The codes represent acoustic events having some ambiguity with respect to phonemes. That is, two or more successive phonemes 32 t c(0) c (l) c(2) c(3) c(4) c(5) j i Figure 3.7: A possible alignment between input codes and a target transcription. I * may be time-aligned with a single code, and two or more successive codes may ■ be time-aligned with a single phoneme. Our system accepts 1644 different speech input codes generated by the Phonetic Engine, which map to 49 phonemes. Each input code is assigned by an integer between 0 and 1643. The above can be described in terms of an alignment scoring model[4]. A sequence S consists of separate input codes c(i) and is denoted by S = {c(i) : i = 0,... ,iV}. To find the sentence which produced S, the memory network is I searched for a sentence transcription T = {p ( j) : j = 0,..., M } consisting of . phonemes, each labeled p(j). The correspondence of S to T which maximizes the alignment score is chosen as output. A subsequence of codes {c(i) : i = *o, io < in, can be time-aligned with a single phoneme p(j). Conversely, a subsequence of phonemes {p(j) : j = Jo, • • • > in}?Jo Jn, can be time aligned with a single code, c(i). A possible alignment is illustrated in Figure 3.7. Here c(0) is aligned with p(0); p (l) is j aligned with {c (l), c(2)}; c(3) is aligned with {p(2), p(3), p(4)}] p(4) (the last i , phoneme of the previous subsequence) is also aligned with c(4)\ c(4) (the last code of the previous subsequence) is also aligned with p(5); and finally, p(5) is j also aligned with c(5). 3.3.2 X , Y , and Z M atrices To compute the alignment score between S and T, score values are computed for the time-aligned subsequences of S and T. The model accounts for: 1) each 33 alignment between a code and a phoneme, 2) the number of successive phonemes ! aligned with the same code and 3) the number of successive codes aligned with the same phoneme. To express the score of an alignment, three matrices are required: I • X(code, phoneme) - each element Xij is a score to align code i with phoneme j. The X m atrix is generally known as a confusion matrix. | • Y(code, #phoneme) - each element yij is a score to align code i with num- ' ber(j) of successive phonemes. I I • Zfphoneme, #code) - each element Zij is a score to align phoneme i with , I number(j) of successive codes. Figure 3.8 shows an example of applying the X, Y and Z matrices. In Fig ure 3.8-a, the phonetic code 93 activates the first phonemes of PSs: b a r h g u ll, and other phonemes in the knowledge base with different X m atrix scores. Align ment examples of applying the Y and Z matrices are shown in Figure 3.8-b and c, respectively. For each input code, alignment scores are calculated by consulting the X, Y i I and Z matrices, and the score of an entire utterance is the sum of the scores of the time-aligned subsequences in the utterance. For example, the score for Figure 3.7 is computed as: Score = { X(c(0),p(0)) + Y(c(0),l) + Z(p(0),l) } + { X (c(l),p(l)) + X(c(2),p(l))+ Y ( c ( l ) , l ) + Y ( c ( 2 ) , l ) + Z(p( 1 ) , 2 ) } + { X(c(3),p(2)) + X ( c ( 3 ) , p ( 3 ) ) + ; X ( c ( 3 ) , p ( 4 ) ) + Y ( c ( 3 ) , 3 ) + Z(p{ 2 ) , 1 ) + Z{p{3 ) , 1 ) + Z(p( 4 ) , 2 ) } + ! { X ( c ( 4 ) , p ( 4 ) ) + X ( c ( 4 ) , p ( 5 ) ) + Y ( c ( 4 ) , 2 ) + Z(p( 5 ) , 2 ) } + ; { X ( c ( 5 ) , p ( 5 ) ) - f Y ( c ( 5 ) , l ) } j Each set of brackets indicates a sub-group of possible alignments. For instance, i the first set of brackets indicates an alignment between c(0) and p(0). The individual scores for the X, Y and Z matrices are logarithms of probabil ities and/or probability ratios [4]. They are scaled (by choosing an appropriate base of the logarithm) into the range of an 8-bit integer between — 128 and 127. The scores for Y and Z matrices are offset with the single alignment scores, such gull bar a) X (code, phoneme) X(93,b) =scorel X(93,g) = score2 bar b< b) Y (code, #phoneme) r J Y(93,2) = score3 bar be. 1282 1368 c) Z (phoneme, #code) Z(A’, 2) = seore4 Figure 3.8: An example of applying X, Y and Z matrices. 35 as Y (c(i),l) and Z{p{j), 1), to avoid unnecessary computations during the scor ing process. Actual examples of the X, Y and Z matrices, we have used for the i experiments, are shown in Appendix B. | Although we have shown the final alignment scores in the above example, j actual computations include some intermediate scores. For example, in Figure 3.7, ' suppose we have already processed the phonetic codes c(0) and c(l), and p (l) aligned with a single code with the score of Z(p( 1),1). W hen the next code, c(2), is accepted, a new alignment is formed with the score of Z(p{ 1),2). In this case, to extend the alignment matching a single code to 2 successive codes, we only need ; to add Z{p{ 1),2) instead of also subtracting Z(p( 1),1). This is possible because the individual scores for Y and Z matrices are already offset, i Insertion, deletion and substitution can be handled in terms of the alignment ; scoring model [4]. Specifically, insertion problems are handled by the Z matrix; - deletion problems are handled by the Y matrix; substitution problems are handled by the X m atrix. 3.4 T h e P ara llel S p eech U n d ersta n d in g A lg o rith m In the previous sections, we have described a system overview, speech-specific problems and the alignment scoring model as a foundation to handle the alignment : problems. In this section, we present the complete PASS system, and analyze its parallelism. 3.4.1 T he Functional Structure of PASS | The complete system organization is shown in Figure 3.9. Dark arrows indi cate execution flow and light arrows show information flow. The SU module performs Phoneme Prediction, Phoneme Activation, Word Boundary Detection, Insertion Control, and Deletion Control. The NLU module performs Word Predic tion, Word Activation, Multiple Hypotheses Resolution, Meaning Representation Construction, and Sentence Generation. The knowledge base containing concept 36 4 sentence output Sentence Generation Meaning :presentaifi( ^onstructioi Multiple Hypotheses ^Resolution, NLU Word Prediction Word Activation I Concept ^ I Sequences & V Instances ) KB I Sequences J V . Word Boundary Detection Insertion Control Phoneme A - Prediction J* Phoneme Activation Deletion Control _____ X,Y,Z Matrices phonetic codes S U Scorings Infonpation Phonetic Engine 4 speech input Figure 3.9: Modules within PASS. ! sequences and phoneme sequences is partitioned and distributed to several pro- I cessing elements. The scoring information including X, Y and Z matrices, is maintained in a host machine or central controller. ' As explained previously, the algorithm is based on a combination of top-down , prediction and bottom -up activation. The circular path, explained with the ba- ! sic algorithm, is expanded with the full set of functional modules, as shown in Figure 3.9. A new circular path existing between the NLU module and the SU module is: word prediction — * phoneme prediction — > phoneme activation — » word 37 . boundary detection — > word activation — > multiple hypotheses resolution — > word j prediction. The operation starts in the NLU module by predicting the first words in possi- ■ ble concept sequences. This, in turn, impacts the prediction of the first phonemes i for these words. Next, the system accepts an input code as speech input, and ; by consulting the X, Y and Z matrices, all relevant phonemes are activated. The i candidates of predicted and activated phonemes trigger further phoneme predic tions. This process repeats until word boundaries are detected, and new word j hypotheses are formed. The movement of prediction and activation markers on | PS candidates involves a complicated alignment process to handle insertion, dele tion and substitution problems. Once the word hypotheses are formed, the process moves to the NLU module. Here, through a process similar to the one at the SU ; level, only the coincidence of predicted and activated words triggers further word , predictions. Multiple hypotheses are resolved using a score-based ambiguity reso lution scheme with the aid of high-level information. Details of insertion, deletion Sz substitution control, word boundary detection, and m ultiple hypotheses reso lution will be presented in the following sections. (The detailed flow chart for the PASS algorithm is shown in Appendix A.) Parallel M arkers To implement the above algorithm using marker-passing, we need approxi m ately 30 different types of markers. Both fat-markers and simple bit-markers are used. Fat-markers carry scoring information as they move through the mem- I ory network, while bit-markers only convey the set membership characteristics of nodes. Some im portant fat-markers include: • P-markers - indicate the next possible nodes (or phonemes) to be activated ; in the concept sequence (or phoneme sequence). They are initially placed on the first nodes of concept sequences and phoneme sequences, and move through the next (or p-next) link. • A-markers - indicate activated nodes. They propagate upward through the concept sequence hierarchy. 38 Parameter Description Observed Value #CS total number of concept sequences 45 #PN total number of phoneme nodes 957 ins average number of operations introduced by insertions per iteration 2.3 del average number of operations introduced by deletions per iteration 1.1 0 ^ fwp 5 1 fraction of concept sequences active during word prediction 0.22 (avg.) 0 ^ fwa — 1 fraction of concept sequences active during word activation 0.71 (avg.) 0 5 ^ 1 P P fraction of phoneme sequences active during phoneme prediction 0.37 (avg.) 0 < f I £ 1 fraction of phonemes above threshold 0.39 (avg.) V I V I o fraction of (predicted, activated) phonemes with PA-collision 0.67 (avg.) o < f i <1 m fraction of candidates remaining after multiple hypo, resolution 0.65 (avg.) Figure 3.10: Param eters affecting concurrency. • I-markers - indicate instantiations of activated nodes. Activated concept nodes are finally identified by I-markers. • C-markers - indicate canceled nodes. Because of m ultiple hypotheses, some I-markers may be canceled or invalidated later on as their scores become inferior. 3.4.2 Program Flow and Parallelism in PASS The motivation for a marker-passing approach is to exploit the concurrency in the speech understanding algorithm. Figure 3.11 shows the execution flow and the parallelism obtained. The amount of parallelism depends on the size of the 39 begin P a r a l l e l i s m w i t h i n M o d u l e f ‘ •#cs j«S..... fi* #CS + ins + del f?‘ #PN fj • min{ fpp • #CS + Sis - + - del, it * #PN} e n d D e l e t i o n C o n t r o l I n s e r t i o n C o n t r o l W o r d B o u n d a r y D e t e c t i o n M u l t i p l e H y p o t h e s e s R e s o l u t i o n P h o n e m e A c t i v a t i o n W o r d P r e d i c t i o n P h o n e m e P r e d i c t i o n W o r d A c t i v a t i o n M e a n i n g R e p r e s e n t a t i o n C o n s t r u c t i o n S e n t e n c e G e n e r a t i o n Figure 3.11: Program flow and parallelism in PASS. 40 knowledge base and the degree of ambiguity in the input code sequence. The key variables are shown in Figure 3.10. The knowledge base size is described by # C S which denotes the number of concept sequences and # P N which denotes the number of phoneme nodes in the memory network. The param eters ins and del correspond to extra operations introduced by insertions and deletions, respec tively. Finally, the various fractional param eters in Figure 3.10 correspond to each phase of the algorithm. They represent the fraction of possible concurrent operations which actually occur during each phase of Figure 3.11 as described below: 1. Word Prediction - P-markers are propagated to words which can possibly appear next in the concept sequence hierarchy. During word prediction, fwp " ftC S nodes are activated. 2. Phoneme Prediction - P-markers flow down the hierarchy to the phoneme sequence level. For the initial prediction, the first phoneme in the phoneme sequence is predicted. As processing continues, subsequent phonemes within the sequence become predicted. Thus, the parallelism is determined by the number of phonemes predicted (f£p‘ fyCS), and the average number of oper ations introduced by insertions and deletions. The parallelism is dominated by the number of concept sequences, rather than the number of phoneme sequences. In particular, phoneme sequences attached to corresponding con cept words in each concept sequence cannot be evaluated at the same time, and processing is guided by concept sequences. 3. Phoneme Activation - For each input code, phoneme types in the X ma trix are compared with a pre-determined threshold. This determines the phonemes to be activated with A-markers. Initially, a large fraction of them are activated in parallel ( // • # P N ), but many are irrelevant and rapidly eliminated. Candidates remaining are those which were predicted and ac tivated (PA-collision). Thus, the available concurrency will approach the lesser of the number of predicted and activated phonemes (min{f£p • # C S + ins -f del, ft ■ # P N } ) multiplied by /^, which is the fraction of phonemes with PA-collisions. 41 4. Insertion/Deletion Control - The prediction window is adjusted to handle 1 insertions and deletions. The window size is adjusted for phoneme sequence i candidates containing phonemes with PA-collisions. The Insertion/Deletion I Control is performed along with Phoneme Prediction and Phoneme Activa- ! tion for the j th iteration based on the codes received. 5. Word Boundary Detection - When P-markers reach the last phonemes of | the phoneme sequence candidates, possible word boundaries are detected. | After the inner loop is completed, processing returns to the word level with | ! the same degree of parallelism available as in Word Prediction. 6. Word Activation - A-markers are propagated up through p-last links to the ■ corresponding concept sequence nodes. An I-marker is placed on the concept sequence element to indicate that the node is a possible instance, and will be canceled later if it is not selected. Words predicted but not activated are eliminated. The fraction of predicted and activated words, denoted by f^ a, remain as parallel candidates. ! 7. Multiple Hypotheses Resolution - A-markers are propagated up through the | concept sequence hierarchy to the concept sequence roots and their sub suming concepts. When multiple A-markers arrive at the same node, the concept sequence candidate with best score is selected. Steps 1-7 are once for each word activation. After the ith iteration, the fraction of multiple 1 | candidates remaining is f % m. 8. Meaning Representation Construction / Sentence Generation - For the se- ' j lected concept sequences, instances are generated using the existing I-markers. ! The new concept sequence instances record a particular occurrence of a generic concept sequence, and together they form the semantic meaning , representation. A natural language sentence is generated by analyzing the semantics of all concept sequence instances. Through experiment, we also measured the parallelism available by counting ' the actual number of activations for each type of marker. The results are shown in Figures 3.10 & 3.12. D ata was collected for the algorithm on a phase-by-phase 42 basis for speech input codes corresponding to an entire target sentence. The results are presented for the ATC domain with # C S = 45 and # P N ~ 1000. The ; ! vertical axis in Figure 3.12 indicates the average number of concurrent operations j , available within each module over the sentence as a whole. An average value j is reported, since many iterations of i and j are required, and the fractional ; ' param eters vary with each iteration. The horizontal axis indicates the relative ; number of processing steps required for each module, based on its critical path of , computation. For this small knowledge base, the average parallelism ranges from ! 8 to 20 concurrent operations. A large spike of over 300 operations occurs at the : beginning of each iteration of phoneme activation because many candidates are i initially active, but then quickly fail the PA-marker collision test. Overall, about 62% of the execution flow is spent in phoneme activation which has sustained parallelism of degree 14. Larger knowledge bases provide proportionally more concurrency. In particu lar, knowledge bases up to 7-fold larger size (9000 nodes) had fractional param eters close to those shown in Figure 3.10. If these parameters are relatively constant then the available parallelism will be proportional to the number of concept se quences and phoneme nodes. Thus parallelism will increase along with knowledge 1 base size, as described by the equations in Figure 3.11. In contrast with a sequen tial implementation, the weaker activated candidates do not need to be pruned early in the evaluation process. This can further improve recognition accuracy by ; providing a chance to recover from early expectations which may later turn out to be incorrect. Through experiment, a compromise can be obtained between the parallel resources required, execution speed and recognition accuracy. , i : : i 3.5 M ark er-p assin g S o lu tio n s for S p eech S p ecific P ro b lem s In this section, we present marker-passing solutions for the speech-specific prob- ! lems such as insertion, deletion, substitution, and ambiguous word boundaries. 43 , i I 1000 & o > * - § cd C L , 0 1 o u 100 10 * o M ■ o 5 = rs u a o c o .C C h P honem e A ctivation M u l t Hypo. 4% 13% |2 % |« — 12% — > | 62 % ' Execution flow [% of sequential steps] Figure 3.12: Parallelism for sample target sentence. 44 3.5.1 Insertion, D eletion and S u b stitution Problem s ■ i : Using the X, Y and Z matrices of the alignment scoring model, a sequence of input codes must be aligned with multiple phoneme sequence transcriptions. However, i ; multiple candidates should be evaluated at the same time and subsequent phoneme I activations cannot be foreseen while the current input code is being processed. | ! . . 1 I Therefore, when we advance the prediction markers, based on the current scoring ' information, we must consider how to recover from bad expectations in some ; phoneme sequence candidates, without affecting the good expectations in other | candidates. Various alignment examples are illustrated in Figure 3.13. The left-hand side [ ' f describes the state before a new code is processed, and the right-hand side de scribes the state right after phoneme candidates have been activated by the input code. Dual prediction markers, P (-l) and P(0), are used to keep the previously used P-marker as well as the current P-marker. Figure 3.13-a shows a normal ' alignment where the collision between P(0) and A exists. In this case, we can nor- ! mally advance P (-l) and P(0) to phonemes p2 and p3 to predict the next phoneme ! activations. i Dual predictions are required to handle the insertion problem. When there exists a strong activation (an activation beyond a threshold) to a phoneme with P (-l), we regard the current code input as oversegmented. Figure 3.13-c shows the insertion problem. In this case, we cannot advance P(0) to phoneme p3, because no collision exists between P(0) and A in phoneme p2. However, a new P(0) will be sent to phoneme p2 from phoneme p i to reflect the newly calculated score, by J adding the score of the A-marker. ! . . . . , | An example of the deletion problem is shown in Figure 3.13-e. Two consecutive i phonemes, p2 and p3, are activated together from code c2, but phoneme p3 does ' not contain a prediction marker. In this case, the code input is regarded as ’ undersegmented. The single code input covers phonemes p2 and p3. Therefore, P (-l) and P(0) can be advanced two steps through p-next links to predict the next ; phoneme activations. In Figure 3.13-g, we show a complicated alignment where both the insertion and deletion problems exist. Two consecutive phonemes, p i and p2> are activated c l c2 "'•'cl ' '*'c2 a) normal alignment P(-l) P(0) * p l *® p2 ^Wp3 P(-1)&A P(0) B SSB 0S- I \ >%--*%•- *+p 2 ► • p^ c) insertion P^-l) _P(0) ^ ^ P£-l) P(ff)&A _ _ A * * p l >® p2 * » p l ^ p 2 *J®p3^ I I S S S H S ^ T T / ■m c r m c T > ..... e) deletion P H ) P(0) P(-12&A P(0)&A ■•JpT^pa- I W 6 S $ » I \ j g) insertion & deletion * ^ P l Q f phoneme sequence ' ^ c i c2 ^ ' ^ >alt P * * 006* * 0 co^e * nPut tijr r ......p s g a r r^r.... b) normal alignment with word boundary problem through concept hierarchy through concept hierarch P(-l)& d) insertion with word boundary problem through concept P (-l)l hierarchy * f) deletion with word boundary problem through concept hierarchy through hierarchy P(0)&A h) insertion & deletion with word boundary problem P (0 ): current P-Marker PC-1): previous P-Marker A : A-Marker Figure 3.13: Alignments including insertion and deletion problems. tiger-616 first Iasi last next next next, heavy tiger -K sixteen six plast pfirst deletion t< : a sequence o f phonetic codes 721 1282 1368 1327 967 1619 1066 insertion Figure 3.14: Handling insertion and deletion problems. together from code c2. First, we detect the insertion by the collision of P (-l) | and A. When a new P(0) is sent to phoneme p2 to update the score, we detect a ; deletion by the collision of P(0) and A. The deletion implies th at there exists no more insertion to phoneme p i. Thus, we can advance P (-l) and P(0) one step to cover the deleted phoneme p2, and predict the next phoneme activations. An example of handling insertion and deletion problems is shown in Fig- | ure 3.14, where only a part of the concept sequence hierarchy is depicted. We illustrate the time alignment between the subsequence of input codes: 721 1282 ' | 1368 1327 93 967 1619 1066, and the phoneme sequence for the word tiger, t j ! t < A' y g g < e r. Assume that we have processed the subsequence of codes: 1 721 1282 1368 1327 93. The remaining codes can be handled as follows: ; 1. Code 967 is now produced from the Phonetic Engine. By consulting the X m atrix, phoneme e is activated. The scoring process is as follows: Score(P(0)) . = Previous_Score(P(0)) + Score(A), where Score(A) = X(967, e)+ Y (967,1)+; Z(e, 1). Then, P(0) is propagated to phoneme r from phoneme e through the p-next link. Phoneme e also keeps P (-l) to prepare against a possible insertion. I 2. When code 1619 arrives, phoneme e is activated again, instead of phoneme r, so an insertion exists. The insertion handling routine calculates the score of the A-marker: Score(A) = Y(1619,e) + (1619,1) + Z(e, 2), where the previous Z(e,l) need not be canceled, because scores in the Y and Z matrices only contain offset values to avoid unnecessary computations. After adding the score of A to P (-l), a new P(0) is propagated again to phoneme r. 3. When code 1066 arrives, phoneme e and phoneme ra re activated together. T hat is, both an insertion and a deletion occur at the same time. By this, | we can assume th at there exist no more insertions to phoneme e. Now, the j j score of the A-marker is calculated as: Score(A) = A"(1066, e) + K (1066,1) + Z(e, 3). Again, a new P(0) adding the score of A is propagated to phoneme r. Here, a collision between P(0) and A exists, and the scoring process for phoneme r begins. Because phoneme r is the last phoneme activated in the j phoneme sequence, an A-marker propagates upward through the concept I ! sequence hierarchy, and finally a new P(0) arrives at the first phoneme of ! the phoneme sequence for concept six. i Substitution problems are not shown in Figure 3.14. When a substitution occurs, the score of the X M atrix for the substituted code-phoneme pair will be low, maybe even below a threshold. Thus, when a substitution problem occurs in a phoneme sequence candidate, the score of the candidate is decreased. This candidate may be rejected later when other hypotheses get better scores. 3.5.2 W ord Boundary Problem The word boundary problem occurs when the last phoneme of a phoneme sequence is activated and the first phoneme of the next phoneme sequence is predicted. I When no coarticulation exists between two consecutive words, it is the same as either the normal alignment or insertion problem, except that A&P markers are moving through the concept sequence hierarchy instead of just through the p-next j 48 link. When a coarticulation exists between two consecutive words, it contains deletion or combined deletion/insertion problems. Some examples are shown in Figure 3.13-b,d,f, and h. We handle the word boundary problem with the aid of high-level information embedded in concept sequence hierarchy. An example describing the coarticulated phrase s ix s ix te e n is shown in Figure 3.15. In this example, the last phoneme of concept s ix and the first phoneme of concept s ix te e n are located on the word boundary, and are represented by the same phoneme type s. Let us assume that dual prediction markers P (-l) and P(0) are located on the last two phonemes of concept s ix respectively. As shown in Figure 3.15, when code 845 is evaluated, the two phonemes on the boundary are activated together indicating a deletion problem across the bound ary. Because both phonemes have P&A collisions, dual prediction markers are advanced twice, reflecting newly calculated scores. As a result, P (-l) and P(0) i are located on the first and second phonemes of concept s ix te e n . The next input code 631 activates those two phonemes on the boundary again, as shown. How ever, only the first phoneme of concept s ix te e n gets a P&A collision indicating that the word boundary problem is resolved and the alignment process for concept s ix te e n begins. j The last phoneme of each phoneme sequence contains an L-marker indicating word boundary. During the alignment process, a word boundary can be detected by a P&A&L collision. When a word boundary is detected, A&P markers need to travel through the concept sequence hierarchy. In the case of s ix s ix te e n , markers need to move through the p-last, next and p-first links. W hen a word boundary problem exists across two different concept sequences, markers need to move multiple steps up and down through the concept sequence hierarchy depending on where those concept sequences are located on the hierarchy. Marker propagations through the concept sequence hierarchy involve the following steps: 1. A-markers are propagated up through the concept sequence hierarchy start ing from the last element of the previously evaluated phoneme sequence. 49 tiger-616 first last next next next heavy tiger > ( sixteen six pfirst pfirst plast 845 631 Figure 3.15: Word boundary problem. 1 2. P-markers are propagated in one step through the next link from the concept J nodes with A-Markers. 1 3. P-markers are further propagated down through the concept sequence hier archy to arrive at the first elements of the next possible phoneme sequences. It is very im portant to implement the above marker propagation steps effi ciently because such marker propagations through the concept sequence hierarchy occur frequently in the PASS algorithm. We have rew ritten the above marker propagation steps more specifically, using the marker propagation rules, defined in Chapter 2, as follows: For all phonemes with PSzASzL ! Propagate A by Seq(plast) For all concepts with A Propagate A by End-Comb(isa, last) For all concepts with A Propagate P by Seq(next) j For all concepts with P ! Propagate P by End-Comb(r-isa, first) I 50 For all concepts with P Propagate P by Seq(pfirst) In the above algorithm, r-isa is a reverse link of is-a. The marker propagation rules are effectively used to limit the spreading of unnecessary markers. For ex ample, End-Comb marks only the last elements in the propagation paths, because the interm ediate elements do not need to be marked in this case. 3.6 M u ltip le H y p o th e se s R eso lu tio n | M ultiple hypotheses are usually generated due to the stochastic nature of phoneme recognition errors. Thus, a list of candidates is hypothesized with different scores. Multiple hypotheses cannot be completely resolved with the information available at the phoneme sequence level. For example, consider understanding the following sentences where [ACID] denotes an arbitrary aircraft ID: • “ [ACID], c o n tin u e c lim b in g a n g l e . ” • “ [ACID], c o n tin u e c lim b in g h e a d in g .” • “ [ACID], c o n tin u e tim in g b a s i s . ” • “ [ACID], c o n tin u e tim in g s e r i e s . ” In this case, the words clim bing and tim in g will produce similar speech code sequences from the Phonetic Engine. Once these words are accepted as candidates, then four multiple hypotheses remain. Since these two candidates are key words for the given sentences, the correct sentence understanding depends on how well the low-level ambiguities are resolved using higher-level knowledge. Specifically, as additional words are received, the range of words which are semantically valid for last word in the sentence acts to guide the resolution of the ambiguous speech data from clim b in g and tim in g . When the last word is encountered, only the score of the appropriate concept sequence will be increased. The SU module activates multiple competing words, or the same words repeat edly, when insertion problems exist. When A-markers are propagated up through 51 the concept sequence hierarchy at a word boundary, several candidates exist at any merging point. A merging point exists where a concept node contains m ulti ple incoming last links, is-a links or next links. A merging point also exists where a concept node has multiple p-last links coming from multiple phoneme sequences representing various pronunciations of the concept node. When m ultiple candidates arrive at a merging point, the concept node might ! have been either already activated from the evaluations of the previous input codes, or visited for the first time. To get the best candidate, the scores carried by the candidates’ A-markers are compared. If the concept node was activated previously, the score of the I-marker is also compared with the scores of the A- markers. The candidates with lower scores are canceled by propagating C-markers I down through the concept sequence hierarchy. As a result, the concept node gets a new I-marker containing the highest score. In Figure 3.16, the concept aircraft id was previously activated by the hypoth esis: t i g e r s ix te n , and the score of the I-marker for the node is the value 4252. Because this hypothesis was not a correct one, the activation score was poor. Although t i g e r s ix te n is apparently different from t i g e r s ix s ix te e n , it is still possible to activate this hypothesis with a low score. Basically, any phoneme with the X M atrix score greater than a threshold can be activated from an input code regardless of its meaning. So, it is critical to 1 set the threshold for each level of the hierarchy through experiment to prevent unnecessary activations without losing meaningful information. There exists a trade-off between the recognition accuracy and the speed performance. That is, these settings affect not only the accuracy but also the speed, because some unnecessary activations cause congestion at concept nodes with multiple incoming links. When a new hypothesis: t i g e r s ix s ix te e n arrives at the node aircraft id I with the score of the A-marker equal to a value of 741, the previous hypothesis is rejected because of its lower score value of 425. C-markers are simultaneously ! propagated down through all possible links in the concept sequence hierarchy 2 T h i s v a l u e i s o b t a i n e d t h r o u g h e x p e r i m e n t s , a n d i n d i c a t e s a n a c c u m u l a t e d a c t i v a t i o n s c o r e f o r t h i s h y p o t h e s i s f r o m t h e l o w - l e v e l p h o n e m e s e q u e n c e p r o c e s s i n g . I i 52 event : is-a link Increa! event ent first first 1(425) > 0 l y tiger-616 ) | next aircraft Tclimb pfirst increase. next pfirst Jjtiger-610 ) first/^^'' /N s next V S next first last last six > 5 5 l K sUt“ n heavy tiger ne; first last thai air six ne: next Figure 3.16: Multiple hypotheses resolution. 53 except the link to the newly selected hypothesis. W hen C-markers collide with I-markers during the propagation, the I-markers are canceled. To protect partially evaluated hypotheses, a C-marker in each node stops its propagation when the node does not contain an I-marker. For example, th a i a i r s ix sev en teen is still in the middle of evaluation as shown, while later, it may turn out to be the hypothesis with the highest score. Thus, the I-markers on the nodes thai, air and i six need to be retained. After resolving multiple hypotheses, P-markers are propagated to the concept nodes such as increase, climb and turn through next links. From these nodes, P-markers are further propagated down to the first phonemes of corresponding phoneme sequences. The activation, cancellation and prediction operations are j performed simultaneously for all possible hypotheses. t i I 3.7 M ark er-p assin g T ech n iq u es to Im p rove P erform an ce In the previous sections, we have described the complete PASS system, its paral- I lelism, marker-passing solutions for the speech-specific problems and the multiple : i i hypotheses resolution scheme. In this section, we present two im portant marker- i passing techniques necessary to improve performance. I 3.7.1 Triple P rediction W indow As we discussed before, the insertion problem requires th at two consecutive phonemes in a phoneme sequence be predicted at the same time. We have realized | that dual prediction is insufficient when the deletion and insertion problems are I intertwined. For instance, Figure 3.17 shows a new scenario for Figure 3.14. W hen the code 967 is processed, we observe that both phoneme e and phoneme r are activated together. The activation of phoneme r is very weak with the score I ! of -34. This was not shown in Figure 3.14. Actual phoneme activations depend on how we set the threshold value for the phoneme level of the hierarchy. The deletion handling routine regards this as a possible deletion problem and advances i 54 Code Activations relevant to the current hypothesis (with score) New predictions after processing current activations P(-2) P(-l) P(0) 240 #(69) # t 721 t(16) t<(22) t t< A’ 1282 A’(54) y(-40) A’ y g 1368 A’(77) y(34) A’ y g 1327 y(88) g(18) y g g< 93 g(89) g<(66) g g< e 967 e(ll) r(-34) e r s 1619 e(28) r(75) e r s 1066 e(32) m e r s 845 s(14) 1(12) s I k 631 s(82) I(-25) s I k 857 s(86) I(-19) s I k 989 1(95) k(-38) I k k< 239 k(104) k<(-37) k k< s 576 k<(25) k< s 834 k<(36) s(ll) k< s E’ Figure 3.17: Triple prediction window. P-markers two steps. However, when the code 1619 is processed, we observe that instead of phoneme s, the previous two phonemes e and r are activated with strong scores. An insertion to the second previous phoneme e exists and the deletion assumed before is wrong. As shown in Figure 3.17, a P(-2) keeps the P-marker for the second previous phoneme, and the system can recover from the wrong expectation using the P(-2). By the prediction with a window size of 3, the combined problem of deletion and insertion can be handled elegantly. 3.7.2 E xpectation A djustm ent Sometimes, activations do not conform to expectations. Whenever a new code activates a set of phonemes in the phoneme sequences, we need to adjust the j previous expectations based on the current activation scores. The expectation adjustm ent table for the sub-phoneme sequence: L-M -N is shown in Figure 3.18. The activation scores are scaled into the range of an 8-bit integer between — 128 and 127. The score of an A-marker is high if Score(A ) > 0. When the activation j scores for the prediction window are {H igh, H igh, Low}, or {H igh , Low, Low}, we assume that the previous expectation for a deletion was not correct, and adjust the prediction window as shown. When both an insertion and a deletion occur together with high scores like {H igh , H igh, H igh}, or {H igh , Low, H igh}, nei ther one can be ignored. In this case, we partition the prediction window and keep both possibilities, as shown in Figure 3.18. T hat is, we make two separate predic tion windows for the phonemes L h M and the phoneme N in the same phoneme sequence. In most cases, the inferior one will be eliminated after one or two more i input codes are processed. 3.8 S u m m ary This chapter described the parallel speech understanding system called PASS. Memory-based parsing and marker-passing paradigms are the underlying philos ophy to build the system. The algorithm is based on top-down prediction and t 56 Score o f A-Marker After Activation New Prediction Window* After Adjustment Phoneme L with P(-2) Phoneme M with P(-l) Phoneme N with P(0) Phoneme L Phoneme M Phoneme N High** High High PM) P(0) P(0) High High Low** P(-l) P(0) High Low High P(-D P(0) P(0) High Low Low PM) P(0) Low High High PM) P(0) Low High Low P(-D P(0) Low Low High P(0) Low Low Low P(0) * Prediction Window: P(-2) on Phoneme L, P(-l) on Phoneme M, P(0) on Phoneme N. ** Score(A): Low if Score(A) < 0, High if Score(A) >= 0 i Figure 3.18: Expectation Adjustment based on the Activation Score by A-marker (sub-phoneme sequence: L-M-N). I bottom -up activation on the hierarchically organized knowledge base. We pre sented the complete working system and analyzed its parallelism. The marker- passing solutions for the speech-specific problems were provided. We also pre sented a score-based ambiguity resolution scheme. Finally, we explained the marker-passing techniques, such as triple prediction window and expectation ad justm ent, to improve performance. In Chapter 4, we will present a parallel ap proach for utilizing contextual knowledge to improve speech understanding. A new PASS system with contextual processing will be described. I 1 i 58 j C h ap ter 4 A d d in g C o n tex tu a l K n o w led g e to P A S S f i In this chapter, we present a parallel memory-based approach for utilizing contex tual knowledge to improve speech understanding. An ambiguity resolution scheme utilizing contextual knowledge is implemented using a marker-passing technique. We present an operational implementation of a new PASS system w ith contextual processing. 4.1 T h e N e e d to U se C o n te x tu a l K n o w led g e Problems in phonetic segmentation and lexical ambiguity resolution are not com- j pletely solvable by enhancements in natural language processing (NLP). There is an underlying problem of choosing the most appropriate hypothesis for grouping phonetic segments and choosing the correct word-sense from multiple hypotheses supplied by a low-level speech processing module. j Even with the help of syntactic and semantic knowledge used in many NLP parsing techniques, we suffer from the problem of ambiguities that are not possible with ordinary text inputs. This problem increases when the vocabulary of a speech understanding system enlarges and the variety of sentences th at are accepted by the system expands. Although multiple competing hypotheses can be ordered by scores obtained from low-level speech processing, the difference of scores be- | tween the competing hypotheses is often within the tolerance of the system ’s error , checking mechanism. In other words, syntactic and semantic constraints are not I sufficient to disambiguate continuous speech input, since an interpretation can be I 59 i ____________________________________________________________________________________________ totally legitimate syntactically and semantically, but can mean something drasti cally different from what has been input into the speech system, as well as being contextually inappropriate. In order to solve the problems in choosing an appropriate hypothesis for group ing phonetic segments, and for selecting the correct word-sense from multiple hypotheses, speech understanding systems need contextual knowledge, knowledge beyond the sentence level. The need for contextual knowledge in speech under standing systems is more urgent than in text input understanding systems. A combination of contextual factors influences the interpretation of an utterance. In fact, what is usually meant by “the context of an utterance” is precisely that set of constraints which together direct attention to the concepts of interest in the discourse in which the utterance occurs. Both the preceding discourse con text - the utterances th at have already occurred - and the situational context - the environment in which an utterance occurs - affect the interpretation of the utterance [27], [28]. There are only a few speech systems utilizing knowledge beyond the sentence level. Barnett [5] described a speech system which used a them atic memory. The system keeps track of previously recognized content words and predicts that they i I | are likely to reoccur. Tomabechi and Tomita [88] used the them atic memory j I idea in implementing a case-frame-based speech system. Both speech systems use ; an utterance to activate a context. This context is then transformed into word expectations which prime the speech systems for the next utterance. Biermann et al. [7] implemented a system that used a dialogue feature to correct errors made ! in speech recognition. It utilized previously recognized meanings (i.e., semantic structures) of sentences. Strategy knowledge was applied as a constraint in the voice chess application of Hearsay-I [75]. The principles of using a user model, task semantics and situational semantics were applied as constraints. Young et : al. [101] implemented the MINDS system, which exploits knowledge about users’ domain knowledge problem solving strategy, their goals and focus, as well as the ' general structure of a dialogue. Although various forms of contextual knowledge ! were utilized, they all belong to the classes of the preceding discourse context and 1 the situational context. 60 4.2 A p p ly in g C o n tex tu a l K n o w led g e to M em o ry -b a sed S p eech U n d ersta n d in g As discussed before, it is crucial to utilize the contextual knowledge to improve spoken language understanding. Both the preceding discourse context and the situational context are directly applicable to memory-based speech understanding. In this section, we describe how to embed the contextual knowledge into memory, and how to utilize the embedded contextual knowledge. 4.2.1 Em bedding C ontextual K now ledge into K now ledge B ase As explained in Chapter 3, a CSI is generated as a result of understanding an utterance, and stored in the CSI memory1. The preceding discourse context is acquired from the previously generated CSIs stored in the CSI memory. As shown in Figure 3.5, each CSI is dynamically connected to the corresponding CS using a CS instance link at the end of parsing. The information contained in CSIs ' provides the preceding discourse context to the CS memory through CS instance links for subsequent utterances. ; The situational context can be obtained by analyzing a given domain. For example, in the ATC domain, the knowledge about flight goals, possible sequence of actions, availability of runways, and others provide the situational context. Two contextual constraint links: the contextual preference (CP) link and the contextual I inhibition (Cl) link are used to embed the situational context into the CS memory. A contextually plausible set of concepts is connected by CP links while a set of concepts contradicting each other is connected by Cl links. In this definition, j concepts include both CS root nodes indicating events or states, and CS element 1 nodes. An overview of embedding contextual knowledge into CS & CSI memories is depicted in Figure 4.1, where for simplicity, only event level constraints are shown j with CS instance, CP and Cl links. (CS element level constraints will be explained | 1 I n o u r a p p r o a c h , t h e k n o w l e d g e b a s e c o n s i s t s o f t h e C S m e m o r y a n d C S I m e m o r y . 61 later.) The preceding discourse context is being added to the knowledge base, as more instances are generated as parsing results. A detailed example of embedded contextual knowledge is shown in Figure 4.2. I For example, d e c re a se -sp e e d -e v e n t and d e sce n d -an d -m a in tain -e v en t are connected by a bidirectional CP link, because these two events can fol low each other in a sequence of utterances, but d e c re a se -sp e e d -e v e n t and in c re a s e -sp e e d -e v e n t are connected by a bidirectional Cl link, because these two events are not likely to happen in sequence. 4.2.2 U tilization of Em bedded C ontextual K now ledge ! j To utilize the embedded contextual knowledge effectively in resolving the am- | . . . . . I I biguities, we introduce the idea of activeness. The activeness of a concept is ! generally defined as a degree of attention to the concept, paid by the listener, in a discourse [29], [11]. In our approach, the activeness of a concept is measured based on the contextual knowledge and indicates the possibility that the concept is spoken again in the subsequent utterances. Activeness provides an efficient way of utilizing both the preceding discourse context and the situational context I dynamically, as explained below. When a CSI is generated as a result of understanding an utterance, the CSI is ! connected to the corresponding CS, and utilized as preceding discourse context. Since new preceding discourse context is obtained, activeness is propagated from the CSI to the corresponding CS through CS instance link. Then the activeness is further propagated to other concepts in the CS memory through CP and Cl links, ! which indicate the situational context. During the propagation, the activeness of a concept is strengthened if the concept is referenced via CP link, while the activeness of a concept is weakened if the concept is referenced via Cl link. After processing a sequence of utterances, each concept in CS memory will accumulate its own activeness information. I i i 62 CSX MEMORY descend & maintain-event#9 iecrease-speed-event janding-event. j [escend & maintain-event fly-evenQ^ I tum-right-event H u- : CS Instance Link : Contextual Preference (CP) Link O : Contextual Inhibition (Cl) Link Figure 4.1: An overview of CS & CSI memories with embedded contextual knowl edge. 63 unit acid range#10, : CSI Memory hundreds id2 ,idl five#2 zero#7 six#8 cs CP link maintain event- istance CS M em ory: first last next .next/ and descend decrease" speed ^event C l link Cl link first last nexL .next. range 1 knots speed decrease. r e n t first last lasl aircraift\f56?< .next increase speed knots Figure 4.2: A detailed example of embedding contextual knowledge into the knowl edge base. i 64 4.2.3 M ultiple C ontexts and A ctiveness Marker As more CSIs are generated as parsing results, the preceding discourse context grows accordingly. However, when there exist multiple contexts in the preced ing utterances, we cannot uniformly apply the preceding discourse context to all subsequent utterances. As an example, suppose that the following utterances are spoken in sequence. • Uttr 1: TW A se v e n n in e t e e n , f l y h e a d in g t h r e e two z e r o . • Uttr 2: D e lt a f i f t y f o u r , d escen d and m a in ta in f i v e th o u sa n d f e e t . • Uttr 3: TW A se v e n n in e t e e n , tu r n r i g h t h ea d in g t h r e e f o u r s e v e n . • Uttr 4- D e lt a f i f t y f o u r , d e c r e a s e sp eed t o one t h r e e z e r o k n o t s . In this example, there are two different contexts: one with TW A se v e n n in e te e n , and the other with D e lt a f i f t y fo u r . Therefore, the contextual in formation from U ttr 1 is selectively applied to U ttr 3 while the information from U ttr 2 is applied to U ttr 4. The activeness information is tem porary knowledge dynamically generated as a result of parsing, while the CS memory is permanent knowledge. W hen multiple contexts exist, we need to m aintain different sets of activeness information on the j CS memory, and to bring up a set of information relevant to the current context. The idea of activeness can be efficiently implemented using the marker-passing technique. In particular, the activeness can be implemented by a marker with its value. The value of an activeness marker indicates the strength of activeness, and is increased/decreased by the link weights assigned to C P /C I links. An activeness marker is assigned for each context. So, on the CS memory, each concept node may contain multiple activeness markers as several contexts exist in a sequence of utterances. As soon as a new context is identified during the parsing process, the | corresponding activeness marker is dynamically activated on the CS memory. 4.3 A m b ig u ity R eso lu tio n B a se d on C o n tex tu a l K n o w led g e Activeness markers can propagate the contextual information to the CS memory as explained before. In this section, we will explain how to utilize the activeness markers to resolve the ambiguities in speech understanding. | 4.3.1 T he PASS System w ith C ontextual The overall speech understanding process including contextual processing is shown in Figure 3.11. By repeatedly applying word prediction, word activation, and ambiguity resolution operations, a CSI is generated from an input utterance, as i shown in the flow chart. The functional modules added for contextual processing to the original PASS system are indicated by thick boxes in the flow chart, and explained below: A ctiven ess M arker S election When a new concept word is identified after resolving ambiguities, the system j checks whether a new context is detected. For example, in the ATC domain, I j the existence of a new context can be determined by obtaining a complete aircraft j name like D e lta f i f t y fo u r. When a new context is detected, the corresponding activeness marker is dynamically activated on the CS memory. i Scoring The strength of activeness is determined by the value contained in an activeness marker. The marker value works as a score (activeness score), which is added to the word activation score obtained through the low-level phoneme sequence processing. We use the following markers to support the contextual processing efficiently: • CA-markers - stand for Contextual Activeness markers. CA-markers indi- j cate concept nodes containing activeness information. * . ! • WA-markers - stand for Word Activation markers. WA-markers indicate i concept nodes activated by the low-level phoneme sequence processing. 66 • SA-markers - stand for Sentence Activation markers. SA-markers indicate competing hypotheses (CS candidates or target sentence candidates) cur rently being evaluated. Using the above markers, the scoring process can be formulated as follows: Scor e(S A -m arker) = PreviousJ5core(SA-m arker) -f S cor e(W A -m arker)-f - Score{C A-m arker) By applying the above formula in parallel, CS candidates obtain new activation scores reflecting contextual information. An example of the scoring process will be presented in the following subsection. A ctiven ess M arker P ropagation j When a complete CSI is generated at the end of parsing, the corresponding j activeness scores are updated by propagating CA-markers. Propagation of CA- markers is performed in parallel for all competing candidates along the following links: i • Step 1: CS instance link. j • Step 2: CP & Cl links: activeness scores are adjusted based on the link t weights. • Step 3: Combinations of first, next, r-isa (reverse is-a) links: to CS element nodes. 4.3.2 A n Exam ple of A m biguity R esolu tion U sing C ontextual K nowledge To show how the contextual knowledge helps to resolve competing candidates, let us consider the following example regarding the ATC domain. Figure 4.4 shows the overall working mechanism. Suppose tig e r -6 1 6 is approaching to the airport ! for landing and the following utterances were already processed: C begiiT^) n e w ^ n o u t t e r a n c e 1 w o r p r e d i c t i o n w o r d a c t i v a t i o n s c o r i n g a m b i g u i t y r e s o l u t i o n a c u v e n e s s m a r k e r s e l e c t i o n e n d o f u t t e r a n c e c o n c e p t s e q u e n c e i n s t a n t i a t i o n ± a c t i v e n e s s m a r k e r p r o p a g a t i o n Figure 4.3: System flow chart. • Uttr 5: T ig e r s i x s i x t e e n , d e c r e a s e sp eed t o one f i v e z e r o k n o t s . • Uttr 6: T ig e r s i x s i x t e e n , d escen d and m a in ta in f i v e th o u sa n d f e e t . As shown in Figure 4.4, CSIs such as d e c r e a s e -s p e e d -e v e n t# 6 and d e s c e n d - a n d -m a in ta in -e v e n t# 9 were generated as parsing results. At the end of each successful parsing, an activeness marker is propagated to the CS memory. For example, from d e c r e a s e -s p e e d -e v e n t# 6 , an activeness marker is propagated to the following concepts step by step: • Step 1: d e c r e a s e -s p e e d - e v e n t via CS instance link. • Step 2: d e sc e n d -a n d -m a in ta in -e v e n t via CP link, in c r e a s e - e v e n t via j Cl link. • Step 3: the corresponding CS element nodes via first, next and r-isa links from the above CS root nodes. As explained previously, an activeness m arker carries a marker value to indicate i the strength of activeness. In this example, for the simplicity of explanation, the I initial marker value of each activeness marker is set to 0; the link weights of CP I and Cl links are uniformly set to 10 and -10, respectively2. The marker value is increased by 10 when the marker moves through a CP link, and decreased by 10 when a Cl link is involved. To indicate the accumulated strength of activeness, the marker value is added to the previous value of the activeness marker for each concept. The accumulated activeness scores for the current context, before ] 1 processing U ttr 5 and after processing U ttr 5 & 6, are shown in Figure 4.5. For this score calculation, the CP & Cl links shown in Figure 4.1 were assumed. For the example concept nodes such as d e c r e a se , in c r e a s e and d escen d , we selected one im portant CS element node from each event shown in Figure 4.1. Now, this activeness information can be applied to the following utterance: • Uttr 7: T ig e r s i x s i x t e e n , d e c r e a s e sp eed t o one one t h r e e k n o t s . I CSI M em o ry : even tiger-616#3 landing eventii ^am erican-912#^ unit knots#9 id2 one #8 zcro#7 six#8 CS nstance ink CS instance link CP link CS M em o ry : maintain .e v e n t. first last next next. aircraft and descend maintain CA (30) Cl linki Cl link first last next. .ncxt/"^ /vnext/'T I > f rangelj— knots speed CA SA (658) first _ rangc2)| knots SA (658) .next. increase speed CA (-30) first. jast next, tiger sixteen / WA (85) \ WA \ WA \ WA I I : input phonetic codes 721 297 357 1023 238 763 976 880 [ tiger six sixteen ] [ decrease ] [ speed to one one zero knots ]: target sentence Figure 4.4: An example of ambiguity resolution using contextual knowledge. Concept Node Accumulated Activeness Scores Before Uttr 5 (landing-event#4) After Uttr 5 (decrease-speed- event#6) After Uttr 6 (descend-and- maintain-event#9) land 10 20 30 take-off -20 -20 -30 decrease 10 20 30 increase -10 -20 -30 descend 10 20 30 climb -10 -20 -30 fly -10 -10 -10 turn-right 0 0 0 turn-left 0 0 0 Figure 4.5: Accumulated activeness scores for U ttr 5 & 6. W ith U ttr 7, in c re a s e was often activated instead of d e c re a se because of mis-recognition in the low-level speech processing. Although the two hypotheses are apparently different, the difference between the competing hypotheses, as in dicated by scoring, is often within the tolerance of the system ’s error checking mechanism. As a result, a wrong hypothesis can also be activated with a consid erable score. Moreover, when a filled pause like uh or uhm is inserted in front of the target word, or background noise is mixed together with the target word, the wrong hypothesis can be even stronger than the right one because of the increased difficulty in phonetic segmentation. In this example, the contextual knowledge can be effectively applied to resolve the ambiguity. As shown in Figure 4.4, the CS: tig e r -6 1 6 is activated with an SA score of 658 by evaluating the front part of input phonetic codes. Because i the aircraft name is the same as U ttr 6’s, the same activeness marker used for 1 I ; U ttr 6 can be used again for U ttr 7. The activation of the CS: tig e r-6 1 6 further i triggers the prediction of the concept nodes: d e crea se , in c re a s e and descend in parallel through is-a and next links. Out of these three concept node candidates, d e c re a se and in c re a s e are activated with WA scores of 82 and 85 by the low- level phoneme sequence processing. As shown in Figures 4.4 & 4.5, the CA scores for d e c re a se and in c re a s e are 30 and -30, respectively. Although in c re a s e has j a better word activation score, compared with d e crea se , the activeness score of d e c re a se is much stronger than in c re a s e ’s. Therefore, by applying the scoring formula: Score(SA ) = P revio u sS co re(S A ) + Score(W A ) + Score(C A ), sentence activation scores of 770 and 713 are obtained for d e c re a se -sp e e d -e v e n t and d escen d -an d -m ain tain -ev en t, respectively. Therefore, d e c re a se -sp e e d -e v e n t remains as a superior candidate, even with the poor word activation score. 2 I n t h e a c t u a l i m p l e m e n t a t i o n , t h e l i n k w e i g h t s w e r e a d j u s t e d t o v a r i o u s v a l u e s t h r o u g h e x p e r i m e n t s . f 72 4.3.3 A pplying C ontextual C onstraints to CS E lem ent Level As explained before, the contextual constraints can be applied to the CS element level. By domain analysis, a set of individual concepts in the CS memory can be connected by CP links if the set elements are contextually relevant, and an activation of one concept can cause subsequent activations of other concepts in the set for the incoming utterances. Similarly, a set of concepts in the CS memory can be connected by Cl links if the set elements are m utually exclusive, and an activation of one concept can prevent other concepts in the set from being activated. The set elements can be either contained in the same CS or spread over multiple CSs. The activeness control mechanism combines both the CS and CS element level i constraints. Concept instances in the CSI memory and corresponding concepts in the CS memory are dynamically connected at the end of parsing using concept instance link. An activeness marker is propagated through concept instance, CP and Cl links. Activeness scores propagated via both concept instance link and CS instance link are added together to reflect the combined CS and CS element level ; constraints. As an example, consider the following utterances: ; I • Uttr 8: American n in e tw e lv e , two m ile s from th e m arker, runway 1 tw enty seven approach. • Uttr 9: American n in e tw e lv e , c le a re d i l s (in stru m e n t la n d in g sy stem ), runway tw enty seven approach. As a result of parsing U ttr 8, a p p ro a c h -s ta t e#2 is generated in the CSI ! memory, as shown in Figure 4.6, where a p p ro a c h -sta te # 2 is connected to appro a c h -s t a te in the CS memory via CS instance link, and 27#4 is connected to 27 via concept instance link. 4 r ig h t, 26 and 27 are specific concepts of runway. Once a specific runway is assigned to an aircraft, the same runway name is sup- i posed to be mentioned repeatedly all the way through the landing process of the aircraft. Thus, the sibling concepts under runway are connected each other using Cl links, as shown in Figure 4.6. i ! 73 I /acid ■american-9L2#T) ^ 2 m ile s# 6 ^ runway 27#4 : CSI memory CS instance link concept instance link iecr ease-speed-even] approach-state -Qlescend & maintain-event last first : CS memory I I isa I I isa 4 right instance link CP link Cl link ;26 Figure 4.6: An example of applying contextual constraints to CS element level. I 1 In this example, d e c re a se -sp e e d -e v e n t and d e sce n d -an d -m a in tain -e v en t are connected with appro a c h -s t a te using CP links indicating the CS level con straints. Suppose that CSIs of d e c re a se -sp e e d -e v e n t and descen d -an d - m a in ta in -e v e n t for am erican-912 were generated previously, and activeness markers were propagated to a p p ro a c h -s ta te via CS instance and CP links. Also, suppose th at the activeness markers were further propagated down to the CS el ements of a p p ro a c h -s ta te . Then, the sibling concepts under runway keep equal activeness scores for am erican-912. W hen U ttr 8 is processed, new activeness markers arrive at the concept node 27 from both a p p ro a c h -sta te # 2 and 27#4 through CS & concept instance links. W hen the new activeness marker from the CSI element node 27#4 arrives at the concept node 27 via concept instance link, the activeness marker is further propagated to other sibling concept nodes via Cl link. At the end of the propa gation, 27 gets the better activeness score compared with other sibling concepts i because the negative weights contained in the Cl links weakened the strength of activeness markers. Therefore, the next sentence, U ttr 9 has less chance of being mis-recognized with the wrong runway names other than 27. 4 .4 S u m m ary This chapter presented a parallel approach for utilizing the contextual knowledge to improve speech understanding. To resolve the ambiguities between compet ing candidates not cleared with syntax and semantics, we added the contextual knowledge into the knowledge base. The preceding discourse context and the situ ational context were embedded into the hierarchical knowledge base as contextual constraints. The activeness of competing concepts is indicated by marker values based on the contextual constraints. An ambiguity resolution scheme utilizing the I activeness information was implemented using the memory-based parsing and par- I allel marker-passing paradigms. The new PASS algorithm utilizing the contextual | knowledge has been implemented on the SNAP-1 prototype and is operational, j In Chapter 5, we will describe the various experiments performed on the SNAP-1 | prototype, and report the performance of the PASS system. C h a p ter 5 i E x p erim en ta l R esu lts i In the previous chapters, we have described the PASS system, which integrates speech and natural language understanding using marker-passing techniques. We have also explained how to improve speech understanding with the aid of contex tual knowledge. The PASS algorithms have been implemented on the SNAP-1 ! prototype and is operational. In this chapter, we present the performance of the PASS system on SNAP. In particular, we will analyze the following aspects: instruction parallelism, recognition accuracy, speed-up, scale-up, and input size performance. ' ; i i t 5.1 E x p erim en ta l E n viron m en t j The PASS program was initially developed on the SNAP-1 simulator [57], [46] using the SNAP instruction set. Then, the program was implemented on the SNAP-1 prototype, a real parallel computer. Speech input codes to the PASS system were collected using the Phonetic Engine. 5.1.1 SN A P -1 Sim ulator ' The simulator was developed to simulate the SNAP instruction set, so th at SNAP application programs may be developed and executed on sequential machines. The simulator code is optimized for execution on a sequential machine, so that when one is comparing the running time of a program on SNAP to the running tim e on I 1 I a sequential machine, a realistic speed-up can be reported. SNAP application programs are written in the C programming language, and are linked with the object codes of the SNAP instruction set. The SNAP devel opment library provides the programmer with a set of functions that aid in the development and debugging of application programs. 5.1.2 Speech Front-end The Phonetic Engine produces 1619 phonetic codes, th at must be time aligned by the SNAP-1 prototype to produce the 49 phonemes needed for recognition of spoken English. The speech input codes are delivered at 19.2K BAUD via an RS-232 interface. The user speaks into a headset connected to the Phonetic Engine, while de pressing a foot switch to delineate the start and end of each sentence [85]. A Motorola DSP within the Phonetic Engine processes the spoken sentences to de liver the phonetic codes. The Phonetic Engine hardware also includes a local memory and hard disk for storing its generic male and female speaker models. 5.1.3 Im plem entation of PASS on S N A P Figure 5.1 shows the actual implementation of the PASS system on the SNAP-1 prototype. The combined system consists of the SNAP-1 prototype, the Pho netic Engine, and a SUN 4/280 (host computer). The speech front-end and the SUN host are connected to the SNAP controller via an RS-232 and a VME bus, respectively. As introduced in Chapter 2, the SNAP-1 prototype is a parallel array proces sor designed for semantic network processing with a reasoning mechanism based on marker-passing [23]. The SNAP-1 prototype has an array controller and a multiprocessor array consisting of 144 Texas Instrum ents TMS320C30 DSP mi croprocessors, which act as Processing Elements (PEs). The array is organized as 32 tightly-coupled clusters of 4 to 5 PEs each. Each cluster manages 1,024 semantic network nodes and 10,240 links. The SNAP instruction set is implemented directly in the array, and executed in parallel. The semantic network knowledge base is distributed over the memory of Host Computer Speech Front-end Speech input : PASS Program : Knowledge Base SNAP Controller SNAP Processor Array SNAP-1 Prototype Figure 5.1: The PASS system implemented on the SNAP-1 prototype. the PEs in the array. The PASS program was written and compiled in the SNAP- ! C environment on the SUN host, and downloaded to the controller for execution. 1 j The serial portion of the program code is executed in the controller. The controller broadcasts SNAP instructions executed in the array. This provides simultaneous j access to many PEs to search for nodes, markers or relations; perform marker * propagation operations; and collect data. When each PE executes marker-passing instructions, marker propagations occur asynchronously without intervention of the central controller. This allows many different markers to travel simultaneously j through the network. Parallelism in the PASS algorithm is supported by SNAP’s capability of dealing with many markers concurrently. During the tim e of experiments, 16 clusters were available out of 32 clusters. Thus, all the experimental results reported in this chapter were measured on a 16-cluster SNAP-1 configuration operating at 25 MHz. 5.1.4 ATC D om ain I We have analyzed the execution of PASS using the ATC domain with various ; quantities of concept and phoneme sequences. The ATC domain was originally ! SUN 4/280 Host i/o Phonetic codes VME bus RS-232 Phonetic Engine 78 ' developed by SSI to train air traffic controllers. The basic ATC domain consists of 1,357 semantic network nodes with 5,834 links. For a scale-up experiment, the knowledge base size was increased by inserting additional concept and phoneme sequences. The increments of concept and phoneme sequences were added up to a full knowledge base configuration of 9,033 nodes and 20,840 links. To measure the performance of PASS, we tested 60 different continuously- uttered sentences regarding the ATC domain. Each sentence contained from 8 to 16 words and was spoken by four untrained speakers. Examples of the test sentences are shown in Appendix C. 5.2 A n a ly sis on S N A P p rogram t \ Before we provide the performance analysis on the PASS system, we will explain the parallelism exploited in the SNAP implementation, and characteristics of the PASS algorithm in the aspects of instruction frequency and execution time. 5.2.1 Parallelism in SN A P Program In the SNAP implementation, parallelism in the algorithm is supported by SNAP’s capability of dealing with markers distributedly. Specifically, two types of paral- I lelism are exploited by SNAP instructions: 1) intra-instruction parallelism, and 2) inter-instruction parallelism [23]. For a better explanation, a simple SNAP program is shown in Figure 5.2. The program sends out markers to execute an intersection search for related concepts. Intra-instruction parallelism is derived from the concurrent management of mul tiple nodes distributed on the semantic network by markers, or the concurrent transmission of markers from those nodes within a single propagation instruction. The examples of the former cases are shown in instructions 1, 2, 3, and 6. Search and boolean operations are executed in constant time by distributed markers. The examples of the latter case is shown in instruction 4. W hen instruction 4 is executed, nodes C\ and C 2 are activated in parallel, and their corresponding J marker propagations through the combination of relations R i and R 2 will be also 1 j executed in parallel. 79 Classification SNAP Instruction Description search 1. search_node (Cj,M j, Vj); 2. search_node (C2, Mj, Vj); 3. search_node (Cj, M2, V?); Select concepts Cj and C 2 using marker M], and concept C3 using marker M2. Values Vj and V 2 are assigned. propagate 4. propagate (M j, M3 , COMB(Rj,R2), ADD)', 5. propagate (M2, M4 , COMB(R3 , R4 ), ADD); Propagate markers M3 & M4 from all concepts which currently contain Mi & M2 respectively using propagation rules: COMB(Rj, R2) & COMB(R3, R4). boolean 6. and_marker (M 3 , M4, M 3 , MAX); Set M3 on concepts with both M3 and M4 collect 7. collect_value (M3 ); Collect fte names and values of concepts with M3 Figure 5.2: A simple example of SNAP program. Inter-instruction parallelism is derived from the overlapping of consecutive propagation instructions, which is possible because SNAP can handle multiple markers concurrently. This requires complete understanding of the data depen dencies between dilferent markers used by the propagation instructions. For ex ample, inter-instruction parallelism exists between instructions 4 and 5. This parallelism is possible because there are no dependencies in the markers used in instructions 4 and 5. Instruction 7 is a collect operation, which also exploits intra-instruction parallelism during the activation process in the SNAP array, but collecting those activated nodes into the SNAP controller is sequential in nature. Thus, the effect of utilizing intra-instruction parallelism is not significant in this case. We have measured the intra/inter instruction parallelism in the PASS al gorithm. The example sentence used for the test is: “T ig e r s ix s ix te e n , in c re a s e speed to one f iv e zero k n o ts”. A total of 30223 SNAP instruc- ! tions were used to generate an interpretation from the given speech codes. The degree of inter-instruction parallelism is determined by the algorithm, while the degree of intra-instruction parallelism grows as the size of knowledge base is in creased. The average value of inter-instruction parallelism was 2.8 and the max imum value was 5. Run tim e values of intra-instruction parallelism were consis tently less than 100 for a small knowledge base of 1.4K nodes and 6K links, but the degree of intra-instruction parallelism scales as the knowledge base grows. ' 5.2.2 Instruction Frequency and E xecution T im e ! For a better understanding of the characteristics of the PASS algorithm, we stud ied the number and type of instructions required to process the typical target sentence shown in the previous subsection. Figure 5.3 shows the ratio of each different class of SNAP instructions on frequency and execution time. i As shown in Figure 5.3, the boolean and set/clear marker operations take small 1 ^ portions of total execution time, while they dominate in instruction frequency, j i This is because they are executed in a constant amount of time, exploiting intra instruction parallelism, as explained before. On the other hand, the propagate operation dominates in execution time, while it is not significant in instruction frequency. Although the propagate operations utilize intra/inter instruction par allelism in the SNAP implementation, the time spent for the operations are not j constant, but proportional to the lengths of their critical paths, which should be | i visited in sequence during the propagation. (More detailed analysis on critical path will be provided in the following section.) Operations like search and collect are insignificant in both instruction frequency and execution time. i ! ; I i \ 5.3 P erform an ce A n a ly sis In this section, we present the performance of the original PASS system. The improvements with contextual processing will be discussed in the following sec tion. Based on the experimental results on the SNAP-1 prototype, we analyze I recognition accuracy, speed-up, scale-up, instruction execution, and input size I performance. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Figure 5.3: Instruction frequency and execution time. 2.7 3.2 create delete 6.3 1.9 ■ m M 30.2 11.5 1 marker-create marker-delete boolean 7.7 3.0 search 60.3 16.3 : instruction frequency : instruction execution time 38.0 3.2 6.5 propagate set-marker collect clear-marker 82 1 1 1 1 1 1 1 1 ----------------- 0.9 - ® ------------B ------------------ B -_ ' ~ --0 -------- ^---- I ! 0.8 - * " _ _ _ _ _ _ * -------- 0.7 - 0---------- Semantic accuracy 0.6 - * Sentence recognition rate Q 5 ____________i____________ i_____________ i__________ i____________i ____________i____________ i ____________ 8 9 10 11 12 13 14 15 16 Target sentence length [words] Figure 5.4: Accuracy vs. sentence length. 5.3.1 R ecognition A ccuracy As shown in Figure 5.4, a sentence recognition rate of about 80% was obtained with performance decreasing slightly as the sentences became longer and more complex. Although the overall sentence recognition rate was only 72% for sentences with 16 words, these longer sentences are encountered less frequently than shorter ones which tend to occur more often. Furthermore, Figure 5.4 shows a semantic accuracy1 of about 90%, which indicates that the m ajority of sentence-level failures had nearly correct semantic meaning representations. The underlying semantics of the input sentence could be determined successfully even for long sentences. By analyzing partially generated concept sequence instances, we observed that some of the instances were very close to correct results. By adjusting threshold values in the parallel speech understanding algorithm, we could obtain correct 1 S e m a n t i c a c c u r a c y i s t h e p e r c e n t a g e o f s e n t e n c e s w i t h s e m a n t i c a l l y c o r r e c t m e a n i n g . E v e n t h o u g h o n e o r t w o w o r d s m a y b e i n c o r r e c t , i f t h e s p e a k e r ’ s i n t e n t i s s u c c e s s f u l l y d e t e r m i n e d f r o m t h e m e a n i n g r e p r e s e n t a t i o n i n t h e C S I m e m o r y , t h e n i t i t c o n s i d e r e d a s s e m a n t i c a l l y c o r r e c t . 1 83 j Semantic accuracy Sentence recognition rate S p e a k e r b e g ins se n te n c e S p e a k e r fin is h e s se n te n c e S e n ten c e / m eaning r e c o g n iz e d «=----------------- in p u t tim e = * • r e s p o n s e t i m e — •= - — ---------------------------------------- e x e c u tio n tim e ----------------------------------- =- Figure 5.5: Definition of execution time and response time. results in some cases. Therefore, when the system fails to obtain a complete concept sequence instance as a result of recognition process, it is possible to repeat the recognition process with dynamically adjusted threshold values. 5.3.2 R esponse T im e and Scale-up We have measured response time as more knowledge is used. The knowledge base size was increased by inserting additional concept and phoneme sequences. For each configuration, results for program running time are reported according to the definitions in Figure 5.5. Execution time indicates the tim e elapsed from when the speaker first begins the sentence. Since the speech codes are generated by the Phonetic Engine as the sentence is spoken, PASS begins execution immediately when the first code is generated. This means that the input and processing are overlapped. Thus, the response time observed by the user is only the tim e required to construct meaning representation and generate an output sentence after the last input is received. Figure 5.6 shows response time as a function of the total num ber of semantic network nodes. The dotted line is for a 16-cluster SNAP-1 configuration oper ating at 25 MHz. Response time for the basic ATC domain was 3.7 seconds with input time of about 5 seconds. Thus, near real-time performance can be obtained, while extracting meaning representation and generating a sentence out put from untrained continuous-speech input when using a knowledge base of this size. W hen more nodes were added, response time increased linearly with a small ‘ 84 6:00 5:30 5:00 4:30 4:00 t = 0.04N - 12.9 | 3:30 | 3:00 4) a a 8 2:00 at 1:30 U n i p r o c e s s o r S N A P 1:00 t = 0.0024N + 0.8 0:30 0:00 IK 2K 3K 4K 5K 6K 7K 8K 9K Knowledge base size [nodes] Figure 5.6: Response time. slope. Response time ranged from 3.7 seconds for 1.4K nodes to 23 seconds for 9K nodes. The solid line in Figure 5.6 is for the identical algorithm on a single TMS320C30 processor at 25 MHz. Response tim e also increases linearly, but the user must ! wait over 30 seconds for a response, even when using the basic ATC domain. The lines shown have been fitted to the measured data with slopes computed as 0.0024 and 0.04 for parallel and serial execution, respectively. Thus, while both curves increase linearly, the proportionality constant for the uniprocessor is 17- times greater, and a 15-fold speed-up is obtained from the parallel implementation j for 9K nodes. ; The rate at which processing time increases is primarily influenced by the criti- j cal path of marker propagation. The critical path is determined by the structure of j ! i 85 Total SNAP instructions Propagation Serial code & 10 -x- X- - - * * —♦nr-*— t- 3000 2000 5000 1000 4000 6000 7000 8000 9000 Knowledge base size [nodes] Figure 5.7: Execution time. i i 86 6 0 50 40 1 Vi i ■ 2 30 6 20 10 0 0 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 Semantic network size [nodes: log scale] Figure 5.8: Logarithmic performance in executing inferencing algorithms on a tree-shaped semantic network. the knowledge base as it grows. Consider an efficient parallel implementation for a knowledge base which grows hierarchically. The critical path corresponds to the maximum depth from the root to the leaves of the hierarchy. Thus performance ! approaching logarithmic tim e can be obtained up to the number of processors available. Figure 5.8 shows an example of the logarithmic performance in ex ecuting inferencing algorithms on a tree-shaped semantic network which grows hierarchically [18]. The experiment was performed on the CM-2 and SUN-4. The j execution tim e on the sequential machine increased exponentially. 1 4 1 k 4 » 4 k 1 « k k 4 4 4 4 k k k S I N 4 ; * 1 » • 4 1 • 4 4 1 C M 87 o 10 4 8 10 12 14 16 2 6 Array size [clusters] Figure 5.9: Response tim e vs. array size (KB size: 1.4K). However, a linguistic knowledge representation typically introduces new nodes at predetermined levels for the concept and phoneme sequences. Therefore, al- j though the knowledge base is organized hierarchically, it maintains a relatively ; fixed depth while its breadth increases. The length of the critical path is roughly fixed, and performance approaching constant tim e can be obtained in term s of knowledge base size on a parallel machine with sufficient resources. For increases in knowledge base size which are large relative to the processing resources avail- ■ able, execution tim e increases at a constant rate so near linear performance will occur on an efficient parallel machine. In PASS, the addition of new nodes does not significantly change the knowledge base depth. Since meaningful increments in knowledge base size exceed the number of processors in the SNAP-1 array, ; ; linear performance is obtained. j j ; i I ( i I 6000 5500 o 5000 S, 4500 4000 - 3500 3000 5000 6000 7000 8000 9000 1000 2000 3000 4000 Knowledge base size [nodes] Figure 5.10: Increase in number of propagate instructions. 5.3.3 C om ponents of E xecution T im e To understand the execution characteristics of the algorithm, we also studied the i components of execution time required to process a typical target sentence on ! SNAP-1 as shown in Figure 5.7. The dashed line is for all 26 types of SNAP instructions, including marker-propagation. The dotted line shows that the ma- , jority of processing tim e is spent in the propagation phase. Only a small portion of the code is serial and cannot be executed as SNAP array instructions. The serial portion is about 10% for small knowledge bases and less than 4% percent for larger knowledge bases. Since the reasoning mechanisms are based on marker- propagation, the serial processing time does not depend heavily on the size of the knowledge base. However, as shown in Figure 5.10, there is some increase in the total num ber of propagations required. This occurs because more irrelevant candidates become activated, which must be removed by propagating cancel markers during j the multiple hypotheses resolution phase. Since large knowledge bases will add candidates which are not relevant, the number of propagations is not expected to 89 19 ■ i i i i i i i 17 - 15 n - F o. 13 9 M 1 1 ytr" s '* 9 p f D Response 7 - - A A ------- *------- Execution ** i i i i i i i i 5 1000 2000 3000 4000 5000 6000 7000 8000 9000 Knowledge base size [nodes] Figure 5.11 Speed-up with increasing knowledge base size. exceed much more than 5000. Most other operations remained relatively constant l with processing dominated by marker set/clear (12,000 instructions)and boolean marker operations (11,000 instructions). 5.3.4 Processor Speed-up Figure 5.9 shows the effect of varying the number of processors while the size of the knowledge base is held constant. Since the capacity of a single cluster is limited to 1024 nodes, the basic ATC domain was used with 4 or more clusters, j For each configuration, execution tim e was measured for different sentences and the average processing time was calculated. In general, the performance improves as the number of processors is increased. The improvement levels off when a large 90 number of clusters are used. This is mainly because each cluster is only partially occupied and no longer fully utilized2. However, if a proportionally larger knowledge base is used, then it is possible to take advantage of the parallelism, and higher speedups can be obtained. This effect is shown in Figure 5.11 for a 16-cluster configuration. The speedup over a single TMS320C30 for response tim e and execution time increase as the knowledge base grows. 5.3.5 Input Size Perform ance The effect of target sentence length on response tim e is shown in Figure 5.12 for the basic ATC domain. Although each speaker generates a varying number of | speech codes, the performance is roughly proportional to the number of phonemes ! in the target sentence. Each sentence is an average of 11 words or about 50 I phonemes long, and takes about 3 to 5 seconds until the meaning representation I and output sentence is generated. To further increase the speed performance for larger knowledge bases, the full 32 cluster configuration can be used when it becomes available. I h 5.4 P erform an ce o f P A S S w ith C o n tex tu a l I P ro cessin g In Chapter 4, we presented a new PASS system with contextual processing. Ex periments were performed on SNAP-1 with the new PASS system to measure the improvements obtained by utilizing the contextual knowledge. 5.4.1 A ccuracy ; As shown in Figure 5.13, with contextual processing, a sentence recognition accu racy of about 86% was obtained, showing about 6% improvement compared with j 2 W i t h t h e 1 . 4 K b a s i c A T C d o m a i n , t h e a v e r a g e p a r a l l e l i s m r a n g e s f r o m 8 t o 2 0 , w h i l e 4 0 P E s a r e a v a i l a b l e i n a 1 6 c l u s t e r c o n f i g u r a t i o n . O t h e r 3 2 P E s a r e d e d i c a t e d t o p r o g r a m c o n t r o l , c o m m u n i c a t i o n , e t c . F o r d e t a i l s o f c l u s t e r d e s i g n , s e e [ 2 3 ] . 5 4.5 i 3 . 5 1 2.5 t I I I I I I I I I I I I 43 44 45 46 47 48 49 50 51 52 53 54 55 Sentence length [phonemes] Figure 5.12: Response time vs. target sentence length on 16 clusters. the system not utilizing the contextual knowledge. Figure 5.14 shows a semantic accuracy of about 92%, which indicates about 2% improvement by adding the | contextual knowledge. The semantic accuracy was not improved much compared 1 with the sentence recognition accuracy. This is because many cases of sentence level failures made by the system not utilizing the contextual knowledge already had correct semantic meaning representations, and the corrections made by the contextual processing did not enhance the semantic accuracy much. 5.4.2 R esponse T im e Experiments have been performed with a 16-cluster SNAP-1 configuration oper ating at 25 MHz. For this configuration, we measured response times for all the utterances used to measure the accuracy. Figure 5.15 shows response time as a | function of target sentence length (number of phonemes) w ith/w ithout contextual processing. In both cases, the performance is roughly proportional to the number of phonemes in the target sentence, although each speaker generates a varying j 92 • o e .8 I 8 8 0.7 § with contextual processing without contextual processing 0.6 - 0.5 15 10 14 8 9 11 12 16 13 Target sentence length [words] Figure 5.13: Sentence recognition accuracy. A 1 — — A I v......■ "— '■ A ▲ with contextual processing □ without contextual processing — .....1 . . . . 1 i.........._... 1 ..... 1 .....I 8 9 10 11 12 13 14 15 16 Target sentence length [words] Figure 5.14: Semantic accuracy. 5 4.5 « 3.5 ▲ with contextual processing □ without contextual processing 2.5 21____I______ 1 ____I ______I_____I _____I _____I _____I_____I_____I_____I _____ 43 44 45 46 47 48 49 50 51 52 53 54 55 Sentence length [phonemes] Figure 5.15: Response time vs. target sentence length on 16 clusters. j number of speech codes. Each sentence is an average of 11 words or about 50 ! phonemes long, and input tim e is 5 seconds on average. ! As shown in Figure 5.15, with the contextual processing, the average response time for the ATC domain was about 3.9 seconds, while the average response time without the contextual processing was about 3.7 seconds. The parallel imple m entation introduced a slight increase in response tim e to utilize the contextual knowledge. The increase was mostly caused by the propagations of activeness markers and score computations. However, the increase in execution tim e is al most negligible considering the improvement in recognition accuracy. I 5.5 S u m m ary We implemented the PASS system on the SNAP-1 prototype and performed vari- ; ous experiments to measure the performance of the system. We explained the par- j allelism exploited in the SNAP implementation, and characteristics of the PASS i ! algorithm in the aspects of instruction frequency and execution time. Based on the I j 94 experimental results on the SNAP-1 prototype, we analyzed recognition accuracy, speed-up, scale-up, instruction execution, and input size performance. Results on the SNAP-1 prototype show an 80% sentence recognition accuracy, and a 90% semantic accuracy for the ATC domain. For limited domains, near real-time per formance has been obtained with a speed-up of up to 15-fold over a sequential implementation. We also implemented a new PASS system with contextual pro cessing on the SNAP-1 prototype. Experiments were performed to measure the improvements obtained by utilizing the contextual knowledge. Through the exper iments on the SNAP-1 prototype, about 6% improvement in sentence recognition accuracy was observed, while the overhead in execution tim e for the contextual processing was insignificant. Results obtained demonstrate the benefits of the integrated parallel model. I I f I I I j C h ap ter 6 C o n clu sio n i In this thesis we have developed a massively parallel computational model for the efficient integration of speech and natural language understanding. This chap ter summarizes the main results of the thesis and discusses directions for future research. 6.1 R e su lts o f D isserta tio n i P arallel M odel for Integrated Speech and N atu ral Language U nderstanding The integration of speech and natural language understanding is a key issue in speech understanding to improve the recognition rate. However, the integration involving m ultiple layers of knowledge sources, such as phonetic, lexical, syntactic, j semantic, and contextual layers, is accompanied by a substantial computational ' overhead. Therefore, an integrated system working on uniprocessor environment will face a scalability problem as the knowledge base size is increased. A parallel computational model based on memory-based parsing and marker- , passing paradigms was designed to address the scalability problem. Memory-based parsing techniques th at employ marker-passing as an inferencing mechanism are . suitable to parallel processing. The parallel model also adopts a hierarchical ' knowledge presentation scheme to support the integration of the m ultiple layers 96 of knowledge sources efficiently. Tight interaction between the hierarchically or ganized layers by marker-passing is a foundation to build a memory-based speech understanding system. D evelop m ent of Integrated Parallel Speech U n derstan d in g S ystem Based on the above computational model, we have developed an integrated sys tem called PASS for speech understanding from low-level speech input up through meaning representation and sentence generation. The PASS algorithm is based on a combination of top-down prediction and bottom-up activation. Top-down prediction locates candidates to be evaluated next. Bottom-up activation locates possible sets of phonemes from the given input codes. After phonemes are pro cessed and word hypotheses are formed, linguistic analysis is performed based on : the syntactic, semantic constraints embedded in the knowledge base. ' In this thesis, we presented the complete working system and analyzed its \ parallelism. The development of the PASS algorithms focused on techniques ex ploiting the parallelism to increase processing speed and tractable domain size. A score-based ambiguity resolution scheme was devised for ambiguity resolution. We also developed some special marker-passing techniques, such as triple predic- . tion window and expectation adjustment, to improve performance. The meaning | representations which are generated as parsing results can be applied not only I to generate an output sentence, but also to provide information for applications, such as high-level inferencing and speech translation. We have demonstrated the feasibility of applying parallel processing to speech understanding by implementing the PASS system on the SNAP-1 parallel AI prototype machine. In the PASS algorithm, most of the available parallelism is captured by passing approximately 30 different types of markers in parallel through the semantic network knowledge base in performing parsing operations. In the SNAP implementation, parallelism in the algorithm is supported by SNAP’s capability of dealing with markers distributedly. M arker-passing Solutions to Speech-specific P roblem s Automated understanding of natural continuous speech presents some unique problems, in addition to those already in written natural language, such as inser- > tion, deletion & ; substitution, and word boundary detection. We have introduced the alignment scoring model as a foundation to handle the speech-specific prob lems. We have developed efficient marker-passing solutions for the speech-specific problems based on the alignment scoring model. The marker-passing techniques nicely handle the complicated phoneme sequence alignment involving insertion, deletion, substitution, and ambiguous word boundaries. A m b igu ity R esolu tion U sin g C on textu al K now ledge We have developed a parallel approach for utilizing contextual knowledge to improve speech understanding. To resolve the ambiguities between competing hy potheses not cleared with syntax and semantics, we added the contextual knowl edge into the knowledge base. The preceding discourse context and the situational context were embedded into the hierarchical knowledge base as contextual con straints. The activeness of competing concepts is indicated by marker values based on the contextual constraints. An ambiguity resolution scheme utilizing the activeness information was implemented using the memory-based parsing and parallel marker-passing paradigms. The new PASS algorithm utilizing the con textual knowledge has been implemented on the SNAP-1 prototype. Through the experiments on the SNAP-1 prototype, a considerable improvement in sentence recognition accuracy was observed, while the overhead in execution time for the contextual processing was insignificant. Perform ance: A ccuracy, Speed-up, and Scale-up We have performed various experiments to measure the performance of the PASS system on the SNAP-1 prototype. The PASS system with contextual pro cessing shows an 86% sentence recognition accuracy, and a 92% semantic accuracy for the ATC domain. For limited domains, near real-time performance has been obtained with a speed-up of up to 15-fold over a sequential implementation. Re sponse time increased linearly with a small slope ranging from 3.7 seconds for 1.4K nodes to 23 seconds for 9K nodes. The experimental results demonstrate the benefits of the parallel computation model for the integration of speech and natural language understanding. 98 6.2 F u tu re R esea rch | i t Although we have achieved fairly good performance with the PASS system, we can still improve the system performance in both recognition accuracy and speed. In this section, possible future work to improve the PASS system is presented. j D ynam ic A djustm ent o f T hreshold When the system fails to recognize a sentence, there still rem ain partially pro cessed candidates. By analyzing partially generated concept sequence instances, ; we observed that some of the instances were very close to correct results. By adjusting threshold values in the PASS algorithm, we could obtain correct results ' in some cases. Therefore, when the system fails to obtain a complete concept sequence instance as a result of parsing, it is possible to repeat the recognition process with dynamically adjusted threshold values. For this purpose, an extensive analysis should precede to come up with sets of dynamically adjustable threshold values for the various decision points. For better dynamic decisions of different configurations, we can utilize the information provided by the partially generated concept sequence instances from the previous ■ run. However, this fail-soft technique will increase the processing time. Sem antically-driven A llocation When we store the knowledge base in the SNAP array, we currently use a distributed allocation scheme. In this scheme, we do not consider any semantic pattern to reduce communication distance. As discussed in the previous chapter, . just increasing the number of clusters does not improve the speed performance much. One reason is that the average communication distance is increased with the distributed allocation scheme. If we allocate semantic nodes from the same concept sequence (or phoneme sequence) to the same processor’s memory, the communication distance will be decreased a lot. This distributes the available parallelism evenly while reducing communication overhead associated with marker-propagation. The semantically- driven allocation scheme may require the system’s ability to dynamically under stand overall knowledge base structures distributed on the parallel memories. 99 PA SS on G eneral-purpose M achines Although we have implemented the PASS system on the SNAP-1 prototype, the PASS algorithm was designed and developed independent of a specific target machine. The thesis work can be further extended by implementing the system on general-purpose massively parallel computers. W ith significantly improved per formance and easier accessibility of general-purpose massively parallel computers like the CM-5 and MP-2, it may be worthwhile to provide an easily accessible parallel computational platform for speech understanding research. E xten sion o f Scope down to A coustic Level Currently, the PASS system utilizes integrated parallelism from phonetic, lex ical, syntactic, semantic, and contextual knowledge sources. It is desirable to extend the scope of the research down to the acoustic level, and to provide a tightly-coupled parallel speech system from low-level speech analysis to high-level contextual processing. Although this is a promising approach towards real-time speech understanding with real-world domains, considerable research efforts will be necessary before any viable system can be developed. M ore D ifficult Speech P roblem s There are some other cases of speech problems we did not handle in the current PASS system. Because of hurried pronunciation or co-articulation effects, speech systems may occasionally fail to recognize the correct word completely. The missed words are usually short, unstressed, function words rather than longer content words. This omission is not handled by standard natural language processing techniques. However, new techniques for processing typed, but grammatically imperfect, input may be adaptable for this purpose[32]. Natural human speech is also more likely to be ungrammatical. Once spoken, words cannot be easily retracted, while typed utterances can be corrected if the user notices the error in time. Thus, fail-soft techniques need to be developed to really understand human speech. 100 6.3 F in al R em ark s Since users are accustomed to using speech in dealing with other humans, their expectations are fairly high. The level of knowledge, which a human brings to un derstanding speech, is ultim ately the level of knowledge which a user will expect from a speech interface with a computer. However, currently simulation of such complete knowledge is feasible only by choosing limited domains. One of the prob lems speech systems have to deal with in expanding their knowledge bases suitable for broad and complex domains is the scalability problem. This thesis work was motivated by a desire to provide a new perspective in handling the scalability problem. Throughout the design and development of the PASS system based on memory-based parsing and marker-passing paradigms, we have dem onstrated the feasibility of applying parallel processing to speech understanding. It is our hope that this thesis has provided some insight into real-time speech understanding on broad, complex domains. 101 ■“ ) A p p en d ix A D e ta ile d F low C hart o f th e P A S S A lg o rith m Initial Prediction Get an Input Code Meaning Representation Construction Activation of Rioneme Candidates E Expectation Adjustment Collision? rvcrocgm cniac Detection and Resolution Word Boundary Detection Word oundary Word Activation S Multiple Hypotheses Resolution Next Phoneme Prediction Next Word Prediction Collision? Undersegmentation Detection and Resolution Word Boundary Detection Boundary? Word Activation Multiple Hypo the aes Resolution Next Phoneme Prediction Next Word Prediction Figure A.l: Flow chart of the PASS algorithm. 102 A p p en d ix B E x a m p le o f th e X , Y and Z M a trices B .l E x a m p le o f th e X M a trix 'K[code, phoneme] = score where 0 < code < 1643, 0 <phoneme < 48, and -128 < score < 127 Phoneme Map 9 ?< A A’ A” a E E’ e I’ I O O’ U U ’ y w 1 r d! f H s s! t! V z z! m n n; c c< k k< p p< t t< b b< d d< g g< J J< d” q X[0,0 to 48]: X[l, 0 to 48]: X[2,0 to 48]: X[3,0 to 48]: 43 15 -40 -100 -46 -21 -84 -91 -53 -113 -59 -10 -115 -4 -96 -76 -26 -95 -36 36 -1 -36 -84 -34 14 -26 -68 -15 6 -103 -44 -31 -74 43 -30 27 -50 43 -75 8 8 -38 -29 11 39 -15 -21 -49 67 48 4 -9 -83 -66 -54 -86 -93 -73 -97 -79 -11 -117 -6 -80 -78 -62 -78 -38 93 -9 -37 -86 -36 -38 22 -52 2 -26 -50 -28 -33 -76 -13 -68 30 -52 20 -77 20 31 1 1 5 -9 15 1 -23 -3 20 45 15 -69 111 75 -50 -95 -102 -53 -95 -51 -21 -126 -15 -108 -88 -82 -106 -108 50 60 -29 -59 -46 35 34 -51 -26 -9 -54 -56 -24 -85 22 -6 -4 -43 7 -87 -26 8 -20 -27 -18 3 -27 -32 -42 13 57 71 -62 -103 -68 -85 -88 -95 -14 -117 -81 -13 -118 -8 -82 -62 -27 -35 -82 4 15 12 -88 -38 -21 7 -72 -18 -79 -52 -48 -35 -78 38 -23 19 -6 -9 -10 -29 -24 -52 42 -11 5 -19 -24 -52 24 Figure B .l: The X Marix. 103 B .2 E x a m p le o f th e Y M a trix Y [code, tiphoneme] = score where 0 < code < 1643,1 < #phoneme< 2, and -128 < score < 127 -----------------------------; ---------------------------------------------------------- j Y[ 0 to 9 ,2 ] : -22 -7 -40 -34 -13 5 8 -16 9 -6 j Y[10 to 1 9 ,2 ]: -12 -17 8 -25 10 0 5 -10 6 9 Y[20 to 2 9 ,2 ]: 22 40 57 26 12 21 20 42 24 35 Y[30 to 3 9 ,2 ]: 41 42 59 53 63 35 48 -12 19 5 ■ Y[40 to 4 9 ,2 ]: 3 5 30 23 33 36 33 35 -35 -45 Y[50 to 5 9 ,2 ]: -l 5 -9 42 11 33 53 -11 28 -7 Y[60 to 6 9 ,2 ]: -36 -24 9 -12 12 -25 -44 12 -4 27 Y[70 to 7 9 ,2 ]: 28 40 43 66 31 54 29 40 38 -17 j Y[80 to 89,2] : -2 -11 -33 -14 24 4 24 -16 -8 11 I Y[90 to 9 9 ,2 ]: 4 3 30 16 44 18 47 26 55 33 1 B .3 E x a m p le o f th e Z M a trix Z[phoneme, #code] = score where 0 < phoneme < 48,1 < #code < 10, and -128 < score < 127 Z[0, Ito lO ]: 6 -28 10 -32 20 ^35 22 ^28 20 ^ 0 ~ Z[ 1,1 to 10]: -3 5 14 -2 13 5 11 24 39 2 Z[ 2, Ito lO ]: 4 -15 -5 -12 5 7 -15 28 3 -44 Z[ 3, Ito lO ]: -6 17 8 12 3 2 -2 22 -46 -85 Z[ 4, Ito lO ]: 0 -4 5 1 -1 8 -13 -25 -102 -128 Z[ 5, Ito lO ]: 4 -11 -18 1 -14 -3 7 0 -48 -85 Z[ 6, Ito lO ]: -2 12 -13 7 -22 -4 2 -97 127 -128 Z[ 7, ItolO]: -13 32 26 11 14 -2 28 21 -8 -118 Z[ 8, ItolO]: 7 -22 -21 -24 -19 -29 -26 84 -45 -96 Z [9, ItolO]: -13 31 18 26 11 14 13 -2 6 28 Z[10,l to 10]: 5 -12 -17 -15 -6 -18 -18 5 -88 -128 104 i 6 -28 10 -32 20 -35 22 -28 20 -10 -3 5 14 -2 13 5 1 1 24 39 2 4 -15 -5 -12 5 7 -15 28 3 -44 -6 17 8 12 3 2 -2 22 -46 -85 0 -4 5 1 -1 8 -13 -25 -102 -128 4 -11 -18 1 -14 -3 7 0 -48 -85 -2 12 -13 7 -22 -4 2 -97 127 -128 -13 32 26 11 14 -2 28 21 -8 -118 7 -22 -21 -24 -19 -29 -26 84 -45 -96 -13 31 18 26 11 14 13 -2 6 28 5 -12 -17 -15 -6 -18 -18 5 -88 -128 Figure B.3: The Z Marix. -22 -7 -40 -34 -13 -12 -17 8 -25 10 22 40 57 26 12 41 42 59 53 63 3 5 30 23 33 -1 5 -9 42 1 1 -36 -24 9 -12 12 28 40 43 66 31 -2 -11 -33 -14 24 5 8 -16 9 -6 0 5 -10 6 9 21 20 42 24 35 35 48 -12 19 5 36 33 35 -35 -45 33 53 -11 28 -7 -25 -44 12 -4 27 54 29 40 38 -17 4 24 -16 -8 11 18 47 26 55 33 3 30 16 44 Figure B.2: The Y Marix. A p p en d ix C E x a m p le T est S en ten ces for th e A T C D o m a in • C o n tin e n ta l s e v e n t e n , tu r n r i g h t h e a d in g tw o n in e z e r o . • A m erican n in e t w e lv e , d e sc e n d and m a in ta in s e v e n th o u sa n d . • N o rth w est s i x f i f t e e n , in c r e a s e sp e e d t o one s e v e n z e r o . • T W A se v e n n in e t e e n , c lim b and m a in ta in t h r e e th o u s a n d . • U n ite d n in e tw e n ty s e v e n , tu r n r i g h t h e a d in g z e r o tw o z e r o . • Bar h a rb o r s e v e n s e v e n t e e n , f l y h e a d in g tw o s e v e n z e r o . • G u ll a i r f i v e f o u r t e e n , d e c r e a s e sp e e d t o tw o f i v e e i g h t . • A m erican n in e t w e lv e , tu r n l e f t h e a d in g t h r e e tw o z e r o . • T ig e r s i x e ig h t e e n h e a v y , c lim b and m a in ta in n in e th o u sa n d . • T W A s e v e n n in e t e e n h e a v y , tu r n r i g h t h e a d in g z e r o tw o z e r o . • D e lt a f i f t y fo u r t h i r t e e n , c le a r e d i l s a p p ro a ch runway two s e v e n . • U n ite d n in e tw e n ty s e v e n h e a v y , f l y h e a d in g on e s e v e n z e r o . • E a s te r n e i g h t e l e v e n , c le a r e d i l s runway s i x f i v e r i g h t a p p r o a c h . • U n ite d n in e tw e n ty s e v e n , c le a r e d i l s a p p ro a ch runway t h r e e e i g h t l e f t . • T ig e r s i x e i g h t e e n , one e i g h t m ile s from t h e a p p ro a ch f i x c le a r e d i l s runway tw o s e v e n a p p ro a ch . • E a s te r n e i g h t e l e v e n , e i g h t m ile s from t h e m arker c le a r e d i l s runway tw o n in e l e f t a p p ro a ch . • C o n t in e n ta l s e v e n t e n , on e one m ile s from t h e o u t e r m arker c l e a r e d i l s runway tw o s e v e n a p p ro a ch . • N o rth w est s i x f i f t e e n n in e m ile s from t h e m arker c l e a r e d i l s a p p roach runway tw o s e v e n . • G u ll a i r f i v e f o u r t e e n , t h r e e m il e s from t h e o u t e r m arker c l e a r e d i l s runway f o u r r i g h t a p p ro a ch . • T W A s e v e n n in e t e e n , one m ile from t h e m arker c le a r e d i l s runway t h r e e e i g h t l e f t a p p ro a ch . • U n ite d n in e tw e n ty s e v e n , two s i x m ile s from t h e o u t e r m arker c le a r e d i l s runway t h r e e e i g h t l e f t a p p ro a ch . 106 B ib lio g ra p h y [1] J. Allen, Natural Language Understanding, The Benjamin/Cummings Pub lishing Company, Inc., 1987. [2] J.M Anderson, W.S. Coates, A.L. Davis, R.W. Hon, I.N. Robinson, S.V. Robinson, and K.S. Stevens, “The Architecture of FAIM-1,” IEEE Computer, January 1987. [3] M.T. Anikst, W.S. Meisel and M.C. Soares, “Experiments with Tree- Structured MMI Encoders on the RM Task,” Proceedings of Speech and Nat ural Language Workshop, June 1990. [4] M.T. Anikst and D.J. Trawick, “Training Continuous Speech Linguistic De coding Parameters as a Single-Layer Perceptron,” Proceedings of Interna tional Joint Conference on Neural Networks, vol. 2, pp. 237-240, 1990. [5] J.A. Barnett, “A Vocal Data Management System,” IEEE Trans. Audio and Electroacoustics, AU-21, 3, 185-186, June 1973. [6] J. B arnett, K. Knight, I. Mani, and E. Rich, “Knowledge and Natural Lan guage Processing,” Communications o f the AC M , Vol.33, No.8, 1990. [7] A. Biermann, R. Rodman, B. Ballard, T. Betancourt, G. Bilbro, H. Deas, L. Fineman, P. Fink, K. Gilbert, D. Gregory, and F. Heidlage, “Interactive Natural Language Problem Solving: A Pragm atic Approach,” Proceedings of the Conference on Applied Natural Language Processing, 180-191, 1983. [8] A. Brietzmann and U. Ehrlich, “The Role of Semantic Processing in an Au tom atic Speech Understanding System,” Proceedings o f COLING-86, 1986. [9] D. G. Bobrow and T. Winograd, “An Overview of KRL: A Knowledge Rep resentation Language,” Cognitive Science, vol. 1, pp. 3-46, 1977. [10] R. J. Brachman and J. G. Schmolze, “An Overview of the KL-ONE Knowl edge Representation System,” Cognitive Science, vol. 9, pp. 171-216, 1985. [11] S.-H. Cha and D.I. Moldovan, “Discourse Anaphora Resolution using Marker- passing,” Technical Report PKPL 93-2, EE-Systems, University of Southern California, 1993. [12] E. Charniak, “Passing Markers: A Theory of Contextual Influence in Lan guage Comprehension,” Cognitive Science, vol. 7, 1983. [13] E. Charniak, “A Neat Theory of Marker Passing,” Proceedings of AAAI-86, 1986. [14] M.-H. Chung and D.I. Moldovan, “Parallel Memory-Based Parsing on SNAP,” Proceedings of IPPS-93, 1993. [15] S.-H. Chung and D.I. Moldovan, “Parallel Speech Understanding on SNAP,” Technical Report PKPL 92-2, EE-Systems, University of Southern California, 1992. [16] S.-H. Chung and D.I. Moldovan, “A Parallel Computational Model for the Integration of Speech and Natural Language Processing,” The 2nd Pacific Rim International Conference on Artificial Intelligence, 1992. [17] S.-H. Chung and D.I. Moldovan, “Speech Understanding on a Massively Par allel Computer,” 1992 International Conference on Spoken Language Pro cessing, 1992. [18] S.-H. Chung and D.I. Moldovan, “Modeling Semantic Networks on The Con nection Machine,” Journal of Parallel and Distributed Computing, vol. 17, pp. 152-163, February 1993. [19] S.-H. Chung, D.I. Moldovan and R.F. DeMara, “PASS: A Parallel Speech Understanding System,” Proceedings of Ninth IEEE conference on A I for Applications, 1993. [20] S.-H. Chung, D.I. Moldovan and R.F. DeMara, “A Parallel Com putational Model for Integrated Speech and Natural Language Understanding,” to ap pear in IEEE Transactions on Computers, 1994. Also to appear in Massively Parallel Artificial Intelligence, J.A. Hendler and H. Kitano (eds.), AAAI Press / The MIT Press, 1994. [21] R.A. Cole, R.M. Stern and M.J. Lasry, “Performing Fine Phonetic Distinc tions: Template Versus Features,” Variability and Invariance in Speech Pro cesses, Lawrence Erlbaum Assoc., NJ., 1986. [22] K. Dahlgren, Naive Semantics fo r Natural Language Understanding, Kluwer Academic Publishers, 1988. 108 [23] R.F. DeMara and D.I. Moldovan, “The SNAP-1 Parallel AI Prototype,” Pro ceedings of the 18th Annual International Symposium on Computer Architec ture, 1991. [24] S.E. Fahlman, NETL: A System for Representing and Using Real-World Knowledge, MIT Press, Cambridge, Massachusetts, 1979. [25] E.P. Giachin and C. Rullent, “A Parallel Parser for Spoken Natural Lan guage,” Proceedings o f IJCAI, pp. 1537-1542, 1989. [26] R. H. Granger, “FOUL-UP: A Program That Figures out Meanings of Words from Context,” Proceedings of the 5th International Joint Conference on A r tificial Intelligence, 1977. [27] B.J. Grosz, “The Representation and Use of Focus in Dialog Understanding,” SRI Stanford Research Institute, Stanford, CA, 1977. [28] B.J. Grosz and C.L. Sidner, “Attention, Intention, and the Structure of Dis course,” Computational Linguistics, 1986. [29] E. Hajicova, “A Meeting Point of Linguistics and Artificial Intelligence,” Artificial Intelligence, Vol 2, 311-321, 1987. [30] A.G. Hauptmann, S.R. Young and W.H. Ward, “Using Dialog-Level Knowl edge Sources to Improve Speech Recognition,” Proceedings o f AAAI-88, 1988. [31] A.G. Hauptmann, “From Syntax to Meaning in Natural Language Process ing,” Proceedings of the 10th National Conference on Artificial Intelligence (AAAI-91), 1991. [32] P.J. Hayes, A.G. Hauptmann, J.G. Carbonell, and M. Tomita, “Parsing Spo ken Language: a Semantic Caseframe Approach,” Proceedings of COLING- 86, pp. 587-592, 1986. [33] C. Hemphill and J. Picone, “Robust Speech Recognition in a Unification Grammar Framework,” Proceedings o f IEEE International Conference on Acoustics, Speech, and Signal Processing, May 1989. [34] J. A. Hendler, “Marker-passing and Microfeatures,” Proceedings of the 10th International Joint Conference on Artificial Intelligence (IJCAI-87), 1987. [35] J.A. Hendler, Integrating Marker-Passing and Problem Solving: A Spread ing Activation Approach to Improved Choice in Planning, Lawrence Erlbaum Associates, 1988. 109 [36] T. Higuchi, T. Furuya, H. Kusumoto, K. Handa, and A. Kokuba, “The Proto type of a Semantic Network Machine IXM,” Proceedings o f 1989 International Conference on Parallel Processing, 217-224, 1989. [37] D. W. Hillis, The Connection Machine, The MIT Press, Cambridge, MA, 1985. [38] G. Hirst, Semantic Interpretation and the Resolution o f Ambiguity, Cam bridge University, 1987. [39] X. Huang and L. Guthrie, “Parsing in Parallel,” Proceedings o f COLING-86, pp. 140-145, 1986. [40] J.-T. Kim and D. I. Moldovan, “Acquisition of Semantic Patterns for Infor mation Extraction from Corpora,” Proceedings of Ninth IE E E conference on A I for Applications, 1993. [41] H. Kitano, “4>DM-Dialog: A Speech-to-Speech Dialogue Translation Sys tem ,” Machine Translation 5, pp.301-338, 1990. [42] H. Kitano, “4>DM-Dialog. An Experimental Speech-to-Speech Dialog Trans lation System,” IEEE Computer, June, 1991. [43] H. Kitano, D. Moldovan and S. Cha, “High Performance N atural Language Processing on Semantic Network Array Processor,” Proceedings of the 12th International Joint Conference on Artificial Intelligence (IJCAI-91), 1991. [44] H. Kitano and T. Higuchi, “High Performance Memory-Based Translation on IXM2 Massively Parallel Associative Memory Processor,” Proceedings of AAAI-91, 1991. [45] D.H. K latt, “Review of the ARPA Speech Understanding Project,” The Jour nal of the Acoustical Society o f America, 62(6), pp. 1324-1366, December 1977. [46] S. Kowalski, “SNAP Simulator and Development Library,” Technical Report PKPL 92-5, University of Southern California, Department of EE-Systems, 1992. [47] W.A. Lea (Ed.), Trends in Speech Recognition, Prentice Hall, Englewood Cliffs, N.J., 1980. [48] M. Lebowitz, “Memory-based Parsing,” Artificial Intelligence, Vol. 21, 1983. [49] K.F. Lee, “Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The Sphinx System,” Technical Report CMU-CS-88-148, De partm ent of Computer Science, Carnegie-Mellon University, 1988. 110 50] K.F. Lee, “Context Dependent Phonetic Hidden Markov Models for Contin uous Speech Recognition,” IEEE Trans. ASSP, April 1990. 51] W. Lee and D. I. Moldovan, “The Design of a Maker Passing Architecture for Knowledge Processing,” Proceedings of AAAI-90, 1990. 52] W. G. Lehnert, “Knowledge-based Natural Language Understanding,” Ex ploring Artificial Intelligence, Howard E. Shrobe Ed., Morgan Kaufmann Publishers, Inc., 1986. 53] L. Lesmo and P. Torasso, “Weighted Integration of Syntax and Semantics in Natural Language Analysis,” Proceedings o f IJCAI-85, 1985. 54] R.P. Lippmann, “Review of Neural Networks for Speech Recognition,” Neural Computation, 1(1), March 1989. 55] B.T. Lowerre and D.R. Reddy, “The HARPY Speech Understanding Sys tem ,” Trends in Speech Recognition, Speech Science Publications, Apple Val ley, Minn., 1980. 56] S.L. Lytinen, “Dynamically Combining Syntax and Semantics in Natural Language Processing,” Proceedings of AAAI-86, 1986. 57] C. Lin and D.I. Moldovan, “SNAP Simulator Results,” Technical Report CENG 89-11, University of Southern California, Departm ent of EE-Systems, 1989. 58] C.E. M artin, “Pragm atic Interpretation and Ambiguity,” Proceedings of COGSCI-89, 1989. 59] M. Minsky, “A Framework for Representing Knowledge,” The Psychology of Computer Vision, P. H. Winston (ed.), McGraw-Hill Book Co., New York, 1975. 60] W.S. Meisel, M.P. Fortunato and W.D. Michalek, “A Phonetically-Based Speech Recognition System,” Speech Technology, Apr/M ay 1989. 61] D. I. Moldovan, W. Lee, C. Lin, and S.-H. Chung, “Parallel Knowledge Pro cessing on SNAP,” Proceedings of ICPP-90, vol. I, pp. 474-481, 1990. 62] D. Moldovan, W. Lee, C. Lin, and M.-H. Chung, “SNAP: Parallel Processing Applied to AI,” IEEE Computer, June, 1992. 63] D.I. Moldovan, W. Lee and C. Lin, “SNAP: A Marker-Propagation Archi tecture for Knowledge Processing,” IEEE Trans, on Parallel and Distributed Systems, July 1992. Ill 64] H. Ney, “The Use of a One-stage Dynamic Programming Algorithm for Con nected Word Recognition,” IEEE Trans. ASSP, 32(2), 263-271, April 1984. 65] H. Ney, “Dynamic Programming Speech Recognition Using a Context-Free Gramm ar,” Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 3.2.1-3.2.4, April 1987. 66] P. Norvig, “Inference In Text Understanding,” Proceedings of AAAI-87, 1987. 67] P. Norvig, “Marker Passing as a Weak Method for Text Inferencing,” Cogni tive Science, vol. 13, 1989. 68] L. Olorenshaw, “Air Traffic Control Training Using Continuous Speech Recognition and the ATCOACH,” Proceedings o f Speech Tech ’ 90, 1990. 69] A. Paeseler, “Modification of Earley’s Algorithm for Speech Understanding,” Recent Advances in Speech Understanding and Dialog Systems, Springer- Verlag, Berlin, 1987. 70] R.D. Peacocke and D.H. Graf, “An Introduction to Speech and Speaker Recognition,” Computer, pp. 26-33, August 1990. 71] M. R. Quillian, “Semantic Memory,” Semantic Information Processing, M. Minsky(Ed-), The MIT press, Cambridge, MA, 1968. 72] M. R. Quillian, “The Teachable Language Comprehender: A Simulation Pro gram and Theory of Language,” Communications o f the ACM, Vol. 12, No. 8, 1969. 73] L.R. Rabiner and B.H. Juang, “An Introduction to Hidden Markov Models,” IEEE A SSP Magazine, 3:1, pp. 4-16, 1986. 74] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applica tions in Speech Recognition,” Proceedings of IEEE, 1989. 75] R. Reddy and A. Newell, “A Knowledge and Its Representation in a Speech Understanding System,” Knowledge and Cognition, Erlbaum Associates, Po tomac, MD., 256-282, 1974. 76] C.K. Riesbeck, “Conceptual Analysis,” Conceptual Information Processing, North-Holland, Amsterdam, 1975. 77] C.K. Riesbeck and C.E. M artin, “Direct Memory Access Parsing,” Techni cal Report YALEU/DCS/RR #354, Department of Computer Science, Yale University, 1985. [78] C.K. Riesbeck, “From Conceptual Analyzer to Direct Memory Access Pars- ! ing: An Overview,” Advances in Cognitive Science 1, N.E. Sharkey (Ed.), ; Ellis Horwood Limited, Chichester, England, 1986. ' ■ [79] C.K. Riesbeck and R. Schank, Inside Case-Based Reasoning, Lawrence Erl- i baum Associates, 1989. ' [80] B. Roberts and I. Goldstein, The FRL Manual M IT A I Memo 409, 1977. I [81] J. G. Schmolze and T. Lipkis, “Classification in the KL-ONE Knowledge Representation System,” Proceedings of IJCAI-83, 1983. [82] S. Seneff, “TINA: A Probabilistic Syntactic Parser for Speech Understanding ; Systems,” Proceedings of DARPA Speech and Natural Language Workshop, pp. 168-178, 1989. [83] H. Sakoe and H. Chiba, “Dynamic Programming Algorithm Optimization for Spoken Word Recognition,” IEEE Trans. ASSP, Vol 26, pp. 43-49, February 1978. [84] H. Sakoe, “Two-Level DP Matching - A Dynamic Programming-based P at tern Matching Algorithms for Connected Word Recognition,” IE EE Trans. ASSP, 26(1), pp. 43-49, February 1978. [85] Speech Systems Incorporated, “Speech Input Development System Model DS200,” Reference Manual, Release 3.3.1, Tarzana, CA, 1990. [86] C. Stanfill and D. Waltz, “Toward Memory-Based Reasoning,” Communica tions of the ACM, vol. 29, num. 12, December 1986. [87] H. Tomabechi, “Direct Memory Access Translation,” Proceedings o f IJC AI- 87, pp. 722-725, 1987. [88] H. Tomabechi and M. Tomita, “The Integration of Unification-based Syn tax/Sem antics and Memory-based Pragmatics for Real-time Understanding of Noisy Continuous Speech Input,” Proceedings o f AAAI-88, 724-728, 1988. [89] M. Tomita, H. Tomabechi and H. Saito, “SpeechTrans: An Experimental Real-Time Speech-to-Speech Translation System,” Proceedings o f SICONLP- 90, 1990. i * j i [90] A. Waibel, “Prosodic Knowledge Sources for Word Hypothesization in a Con tinuous Speech Recognition System,” Proceedings of ICASSP-87 , 20.16.1- 20.16.4, 1987. [91] A. Waibel and K. Lee, Readings in Speech Recognition, Morgan Kaufmann Publishers, Palo Alto, CA., 1990. [92] D.L. Waltz and J.B. Pollack, “Massively Parallel Parsing: A Strong Inter- - active Model of Natural Language Interpretation,” Cognitive Science 9, pp. 51-74, 1985. [93] D.L. Waltz, “Massively Parallel AI,” Proceedings of AAAI-90, 1990. [94] W.H. Ward, A.G. Hauptmann, R.M. Stern, and T. Chanak, “Parsing Spoken Phrases Despite Missing Words,” Proceedings of ICASSP-99, 1988. I [95] Y. Wilks, “A Preferential, Pattern-matching, Semantics for Natural Lan guage Understanding,” Artificial Intelligence 6, 1, 1975. [96] T. Winograd, “Understanding Natural Language,” Academic Press, New ! York, 1972. [97] T. Winograd, Language as a Cognitive Process - Volume 1, Syntax, Addison- ! Wesley Publishing Co., 1983. 1 [98] W. A. Woods, M. Bates, G. Brown, B. Bruce, C. Cook, J. Klovstad, J. ; Makhoul, B. Nash-Webber, R. Schwartz, J. Wolf, and V. Zue, “Speech Un derstanding Systems - Final Technical Report,” Technical Report no. 3438, Bolt Berbank and Newman Inc., Cambridge, Mass., 1976. [99] W. A. Woods, “Research in Knowledge Representation for Natural Language Understanding,” Annual Report no. 4785, Bolt Berbank and Newman Inc., Cambridge, Mass., 1981. [100] D. Wu, “A Probabilistic Approach to Marker Propagation,” Proceedings of ■ IJCAI-89, 1989. ■ i [101] S.R. Young, A.G. Hauptmann, W.H. Ward, E.T. Smith, and P. Werner, “High Level Knowledge Sources in Usable Speech Recognition Systems,” Communications o f ACM, vol. 32, num. 2, pp. 183-193, February 1989. [102] S.R. Young and M. Matessa, “MINDS-II Feedback Architecture: Detection and Correction of Speech Misrecognitions,” Technical Report CMU-CS-92- 119, School of Computer Science, Carnegie Mellon University, 1992. [103] S.R. Young and M. Matessa, “Using Pragm atic and Semantic Knowledge to Correct Parsing of Spoken Language Utterances”, Technical Report CMU- CS-92-120, School of Computer Science, Carnegie Mellon University, 1992. [104] Y.-H. Yu, and R. F. Simmons, “Truly Parallel Understanding of Text,” Proceedings of AAAI-90, 1990. [105] V.W. Zue, “The Use of Speech Knowledge in Automatic Speech Recogni tion,” Proceedings of IEEE, 73(11), 1602-1615, November 1985. 114
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255773
Unique identifier
UC11255773
Legacy Identifier
DP22862