Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Speech and language understanding in the Sigma cognitive architecture
(USC Thesis Other)
Speech and language understanding in the Sigma cognitive architecture
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Speech and Language Understanding in the Sigma Cognitive Architecture Himanshu Joshi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2018 ii Table of Contents 1 INTRODUCTION 1 1.1 Cognitively Plausible Speech and Language Processing 3 1.2 Summary of Contributions 7 1.3 Reader’s guide to the dissertation 9 2 RELATED WORK 10 2.1 Cognitive Architectures and Language, Perceptual processing 10 2.1.1 Cognitive architectures 11 2.1.2 Integration of Speech and Language in Cognitive Architectures 12 2.1.2.1 HPSA77 13 2.1.2.2 Soar 16 2.1.2.3 Perceptual processing in other cognitive architectures and frameworks 21 2.2 Conversational agents 25 2.2.1 Conversational Agents 26 2.2.1.1 Properties of human conversation 26 2.2.1.2 Conversational agent architecture 28 2.2.2 Speech Recognition 32 2.2.3 Natural Language Understanding 40 2.2.4 Syntax Processing 41 2.2.4.1 Sum Product Networks 43 2.2.5 Integration of speech, language and cognition 44 3 INTRODUCTION TO THE SIGMA COGNITIVE ARCHITECTURE 48 3.1 Introduction to Sigma 48 3.1.1 Sigma desiderata and hypothesis 49 3.1.2 Sigma’s view of cognitive architecture 52 3.1.3 Cognitive layer 54 3.1.4 Graphical layer 57 3.1.5 Problem Spaces and Control Structure in Sigma 58 3.1.6 Programming in Sigma 60 3.1.7 Cognitive Idioms 62 4 SPEECH AND LANGUAGE IN THE SIGMA COGNITIVE ARCHITECTURE 64 4.1 Isolated word recognition 65 4.1.1 Discussion 70 4.2 Connected word recognition 72 4.2.1 Discussion 74 4.3 Continuous Phone Recognition 74 4.3.1 Results on TIMIT 86 iii 4.3.2 Discussion 89 4.3.3 Continuous Word Recognition 91 4.4 Dialogue Agent 92 4.4.1 Discussion 95 4.5 Grammar Parsing using SPNs 96 4.5.1 SPNs and Graphical Models 97 4.5.2 Converting SPNs to Sigma Conditionals 103 4.5.3 Parsing in Sigma with SPNs 107 4.5.4 Discussion 109 5 SUMMARY AND FUTURE WORK 111 5.1 Summary 111 5.2 Sigma’s Desiderata and Spoken Language Processing 112 5.2.1 Functional Elegance 112 5.2.2 Grand Unification 112 5.2.3 Sufficient Efficiency 113 5.3 Impact Beyond Sigma 113 5.4 Open Issues and Future work 114 5.4.1 Full integration of Continuous Speech 114 5.4.2 Learning 115 5.4.3 Syntax Processing 115 5.4.4 Sum Product Networks and Implications on Graphical Architecture 116 6 REFERENCES 117 iv List of Tables TABLE 1 RESULTS FROM THE ISOLATED WORD RECOGNITION WORK. 70 TABLE 2: SUMMARY RESULTS FROM THE PHONE RECOGNITION TASK. 88 TABLE 3: AVERAGE COGNITIVE CYCLE TIMES AS A FUNCTION OF THE AMOUNT OF FUTURE CONSTRAINT INCORPORATED 89 v Acknowledgements This work was possible due to the support of many people. First and foremost, I would like to thank my advisor Paul Rosenbloom. Paul, you provided me the tools necessary for becoming a scientist. I admire your tremendous ability to think clearly about the pros and cons of decisions big and small. Your scientific style is rare and serves as a continuous source of inspiration. Thank you for reading several drafts of papers and helping me become a better writer. But, above all, thanks to you, I fell in love with AI and cognitive architectures. I would also like to thank Volkan Ustun, whose guidance and friendship helped me a lot along the way. Volkan, you created the right environment for me thrive and take on new challenges. Your ability to see nuance is admirable. I owe gratitude to Paul Bogdan and Stefan Scherer for being on my committee and helping me with their suggestions and questions. I want to acknowledge that I would not be here today, were it not for my parents and my teachers. In particular, Rajeev Pansare helped me discover the wonderful world of programming; Michael Crowley encouraged me to pursue my doctorate and was always on my side; Kenji Sagae helped me tremendously throughout my PhD journey by helping me find path forward. My parents and my teachers instilled in me values of perseverance and determination, which when mixed with a sense curiosity inherited from my grandmother made this possible. Finally, none of this would be possible without the love and support of my wife. Gayatri, thank you for always being there for me. vi Thanks are owed to my sponsor the U.S. Army. Statements and opinions expressed do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred. vii Abstract Cognitive architectures model fixed structures underlying intelligence and seek to heed the original goal of AI – a working implementation of a full cognitive system in aid of creating synthetic agents with human capabilities. Sigma is a cognitive architecture, developed with the immediate aim of supporting real time needs of intelligent agents, robots and virtual humans. In Sigma, this requirement manifests as a system whose development heuristically is guided by knowledge about human cognition with the ultimate desire to explain human intelligence at an appropriate level of abstraction. Spoken language processing is an important cognitive capability and yet not addressed by existing cognitive architectures. This is indicative of the mixed – symbolic and probabilistic – nature of the speech problem. Sigma, guided in its development by a core set of desiderata that are an evolution of the desiderata implicit in Newell’s Unified Theories of Cognition, presents a unique opportunity to attempt the integration of spoken language understanding in a cognitive architecture. Such attempt is an exercise to push cognitive architectures beyond what they are capable of, taking a first step towards enabling an architecturally based theory of spoken language understanding – deconstructed in terms of the interplay between various cognitive and sub-cognitive capabilities that play an important role in the comprehension process. viii This dissertation investigates the issues involved in integration of incremental speech and language processing, with cognition, in aid of spoken language understanding, guided by the desiderata driving Sigma’s development. The space of possibilities this integration enables is explored and a suitable spoken language understanding task is chosen to evaluate the key properties of the theory of spoken language understanding developed in Sigma. Speech signal obtained from an external speech front end is combined with linguistic knowledge in the form of phonetic, lexical and semantic knowledge sources. The linguistic input is converted into meaning using a Natural Language Understanding (NLU) scheme implemented on top of the architecture. In addition to phonetic, lexical and semantic processing, language processing involves a syntactic component. Probabilistic context free grammar parsing is an important form of grammar processing that has not been possible to realize in cognitive architectures. Probabilistic context free grammar parsing poses a challenge to Sigma’s grounding in graphical models. Sigma is shown to be able to perform syntactic processing via Sum Product Networks (SPNs), a new kind of deep architecture that allows efficient, tractable and exact inference in a wide class of problems, including grammar parsing. It is shown that Sigma’s cognitive language is sufficient to specify any arbitrary valid SPN, with the tractability and exactness expected of them. This shows Sigma’s ability to efficiently specify a wide range of problems. The implications of this are discussed, along with Sigma mechanisms that allow for specifying SPNs. This leads to a novel relationship between neural networks and SPNs in the context of Sigma. ix 1 1 Introduction A cognitive architecture is a hypothesis about the fixed structures underlying intelligent behavior. Cognitive architectures support an important goal of AI – to understand and create synthetic agents with human capabilities (Langley, Laird, & Rogers, 2009). Integration across a wide range of capabilities is a key requirement for cognitive architectures (Rosenbloom, 2015). Spoken language understanding is an important cognitive capability and yet not addressed by existing cognitive architectures – indicating the mixed (symbolic and probabilistic) and hybrid (discrete and continuous) nature of the speech problem. Traditional symbolic architectures such as Soar (Laird, 2012) interface to sub-cognitive modules outside of the core architecture for perceptual processing. Connectionist approaches (Sun, Merrill, & Peterson, 2001), (O'Reilly, 1998) (O’Reilly, Hazy, & Herd, 2012) do a good job of processing sub-symbolic input but do not have the symbolic capabilities to induce a breadth of intelligent capabilities (Langley, Laird, & Rogers, 2009). Even hybrid architectures such as CLARION (Sun, 2006) or SAL (Lebiere, O'Reilly, Jilk, Taatgen, & Anderson, 2008) have not tackled the speech problem. Sigma is a new breed of cognitive architecture that aims to ultimately explain human cognition in terms of the capabilities it integrates and the interactions amongst them. In pursuit of this goal, Sigma is guided by the following four desiderata: 2 1. Grand Unification: aiming to integrate both symbolic and key sub- symbolic (perceptual) capabilities. 2. Functional Elegance: aiming to derive cognitive and sub-cognitive capabilities from a single, theoretically elegant base or core. 3. Sufficient Efficiency: aiming to execute fast enough for real-time applications 4. Generic Cognition: aiming to integrate both natural and artificial intelligence. These desiderata motivate Sigma’s blending of graphical models (Koller & Friedman, 2009) with cognitive architectures, yielding a broadly capable and theoretically elegant base – referred to as the graphical layer – which supports a broad cognitive layer on top. The cognitive layer provides a language for adding skills and knowledge (Rosenbloom, 2009) on top of the architecture. Together, the cognitive and graphical layers have been shown to support a wide variety of capabilities – perception and decision making (Chen, et al., 2011), reinforcement learning (Rosenbloom, 2012), episodic memory & learning (Rosenbloom, 2014) etc. – as demanded by the goal of grand unification. This dissertation takes a first step towards a computational model supporting various capabilities – and their partial integration – in service of speech and language processing by presenting a computational model of speech processing that is developed with the following goals derived from Sigma’s desiderata: 1. Breadth: Cover at a broad level the speech phenomena, from acoustic signal to phonetic processing, syntactic processing, semantic 3 understanding and choosing the next response. Each capability is integrated on to the architecture (as knowledge instead of a capability in the architecture) and a partial integration of them all is demonstrated. 2. Architecturally based: The proposed model is built upon an independently developed model of cognition. 3. Functionality: A working system that implements – albeit partially, in some cases – various capabilities, and a form of their integration, using the cognitively constrained model of language processing discussed in this dissertation. 4. Supraarchitectural integration: Each capability is integrated on top of the architecture as knowledge, with no specific mechanism added to the cognitive architecture in service of them. 1.1 Cognitively Plausible Speech and Language Processing In addition to Sigma’s desiderata this work was motivated in part and guided by Newell’s analysis of HARPY – a narrow AI speech system – and its mapping onto the HPSA production architecture (Newell, 1978). The following summary presents important properties of cognitively plausible speech processing: 1. Incremental processing: Speech is processed in a natural, forward fashion, with each word, utterance being understood as it is heard (Christiansen & Chater, 2016). Most narrow AI speech processing systems process the utterance multiple times with different acoustic and language models contravening cognitive plausibility. Some incremental systems involve a limited Viterbi search with audio input of 200-300ms, obtain a limited 4 recognition network for that portion of the utterance, and construct the recognition network for the entire utterance incrementally, to obtain real time performance. Due to the backtracking required in the Viterbi search, this is not cognitively plausible. The model presented in this work selects each word incrementally by committing to it within 250ms of the word being heard, as the audio is being processed. 2. Bounded working memory: Human speech processing operates with limited working memory, whereas HARPY and most modern speech recognition systems (both based on deep learning & traditional HMM/DBN based systems) represent hundreds of thousands of states in a recognition network. This network (Mohri & Riley, 2008), (Mohri M. , 1997) is precompiled with associated parameters obtained during training. This poses two problems during performance: (i) The size of this network necessitates a mechanism – such as beam search – to prune the search space as each new observation creates an exponential increase in the portion of network active during recognition and (ii) addition of new knowledge (such as words for example) requires regenerating the entire network. Newell argued this is not cognitively plausible (Newell, 1978). The model present does not use a recognition network and is instead based in graphical models. The language model proposed in this work is a fusion of symbolic problem solving and sub- symbolic speech processing, where the discourse model constrains the set of possible words in a cognitively plausible fashion by basing it on the state of the current conversation. 5 3. No language specific capability: Human speech processing entails dynamically combining multiple knowledge sources with no language specific ability enabling this fusion. The model presented in this work proposes processing, and demonstrates partially combining, phonetic, lexical, syntactic and semantic knowledge sources using the same cognitive mechanisms that underlie other processing. This is a first for cognitive architectures. Newell’s mapping of HARPY on HPSA resulted in a set of promissory notes demonstrating cognitive architectures were not ready to tackle speech processing. Sigma’s basis in graphical models and support for quantitative metadata (Laird, Lebiere, & Rosenbloom, 2017), holds promise towards fulfilling these promissory notes. This work proposes a cognitively constrained language model and presents various individual capabilities in service of realizing the model. This is followed by a partial integration of individual capabilities in service of a discourse agent. In addition to acoustic processing, semantic understanding, and dialogue management, grammar is an important aspect of language processing. Previous work on language in cognitive architectures has focused on grammar parsing and learning as the central problem in language processing (Lewis, 1993). However, these theories have been non-probabilistic in nature. Jurafsky and Martin (2009) propose that more recently parsing complexity can be explained by “unexpcted (low probability, high entropy) structures” noting the importance of statistical, probabilistic parsing models. This work considers grammar parsing as a subset of the language faculties and explores how probabilistic parsing can integrated onto the architecture. This dissertation also demonstrates – albeit without fusing it with other capabilities – 6 Sigma’s suitability for grammar parsing of probabilistic context free grammar (PCFG) in the form of a supraarchitectural capability. PCFG parsing is a difficult task – even with Sigma’s grounding in graphical models – due to the loopy and exponential nature of inference implied by the parse chart. To overcome this problem, this work explores the use of sum product networks (SPNs) in Sigma. SPNs are a form of deep architecture that are designed to be efficient and tractable and have yielded results comparable to state of art in various domains. An algorithm to convert any SPN to a Sigma program is presented. The accompanying proof establishes that any SPN can be converted into an equivalent Sigma program, in time and space linear in the number of edges of the SPN. It is shown analytically and empirically that inference in the resulting graph is also linear in the size of the underlying SPN. The novelty here is not merely in the ability of the Sigma cognitive language to represent any valid SPN but for the reuse of mechanisms exercised in achieving efficient inference implied by SPNs. In particular, the mechanisms added in support of rule-based processing and neural network processing are key to supporting efficient exact inference implied by SPNs. This represents a new idiom – a stylized knowledge fragment that can be reused – at the graphical architecture level, allowing Sigma to perform tasks that are exponential in graphical models to potentially be performed efficiently – linear in the number of edges of the SPN. The theoretical implications of mechanism reuse across rule-based processing, neural networks, and SPNs are discussed along with the exciting possibilities they entail for Sigma. 7 1.2 Summary of Contributions 1. A novel mapping of and partial integration of speech, language and cognition in a cognitive architecture. i. Mapping the speech and language processing capabilities on to Sigma’s organization of memories, selection mechanisms and cognitive processing cycle. This includes speech recognition, grammar parsing, a simple form of language understanding and a discourse agent. ii. This integration presents a unique opportunity, in the context of cognitive architectures, to provide one potential characterization of the interplay between the three capabilities – how speech informs language understanding and task constrains language with the fusion brought about by cognition. iii. Sigma’s grounding in factor graphs (Kschischang, Frey, & Loeliger, 2001) is motivated by its desire to perform symbolic and subsymbolic processing in a uniform manner. This thesis serves as a milestone to help guide Sigma’s progress towards its desiderata by pushing the envelope on functional elegance and grand unification. iv. Demonstration of speech and language processing in a manner that is both uniform with cognition and uses the same knowledge structures, a goal set by Allen Newell over thirty years ago when he proposed to map the Harpy speech understanding system (Lowerre & Reddy, 1976) – an early variant of Hidden Markov Models – onto a cognitive architecture based on the HPSA77 production system (Newell, 1978). 8 2. A novel construction of a suitably chosen discourse capability based on probabilistic natural language understanding (NLU). i. This work presents a novel mapping of a discourse task onto Sigma’s processing model, staking a position regarding the mix of reactive, deliberative and reflective processing involved in coupling language processing with symbolic problem solving. ii. This is then extended by integrating incremental speech processing that can potentially exploit context or react to other modalities. Currently, this integration is demonstrated only for a portion of the dialog task and not the whole task. This serves to both extend the speech capability and partially evaluate its usefulness in building higher order supraarchitectural capabilities. 3. A new graphical idiom exploring the relationship between rule-based processing, neural networks and Sum Product Networks (SPNs) – a new deep architecture for efficient, tractable inference. i. Sigma’s cognitive and graphical architectures are able to represent and process, valid & consistent SPNs – via the same mechanisms that enable neural networks and rules-based processing. This relationship between rules-based process, neural networks and SPNs is discussed in the context of the cognitive architecture. ii. This work presents an algorithm to encode any SPN to an equivalent Sigma program demonstrating a way to exploit the exact, efficient inference properties of SPNs using a factor graph substrate. It is then 9 proven that this algorithm can convert any valid SPN into its equivalent Sigma model, in time and space linear in the size of the SPN and that the complexity of inference is linear in the size of the underlying SPN. iii. This is then evaluated by demonstrating a PCFG parsing task. The SPN corresponding to simple toy PCFG is converted to its Sigma equivalents to perform PCFG parsing and it is shown that the inference is linear in the size of the underlying SPN. 1.3 Reader’s guide to the dissertation This thesis begins with a discussion of the various capabilities required for spoken language processing, including speech recognition, grammar parsing, and semantic understanding of spoken utterances. This is followed by a discussion of cognitive architectures and Sigma. Subsequently, a reference language model inspired by (Lehman, Newell & Lewis, 1995) & (Jackendoff, 2011) is proposed. Section 2 reviews each speech and language capability developed individually followed by an attempt towards partial integration of them. The challenge posed by probabilistic grammar parsing is discussed next and Sigma’s ability to support exact parsing is discussed. It is shown, theoretically and empirically (with one example), that Sigma is capable of specifying any arbitrary SPN and retain desirable inference properties of SPNs. Subsequently, Section 4.1 discusses the results from previous section and evaluates their contributions using Sigma’s desiderata. Section 4.2 discusses future work. 10 2 Related Work This thesis explores an open and important problem in cognitive architectures: integration of spoken language understanding in a cognitive architecture. Speech recognition and natural language understanding are mature fields and given the tremendous amount of previous work, it is impossible to provide a complete review; instead the primary purpose of this chapter is to present the material relevant to this thesis. This chapter serves to: (a) motivate the choice of speech recognition and conversation capabilities for supraarchitectural integration; (b) review previous work in speech, language, syntax parsing and discourse on which this thesis builds; (c) motivate the challenges faced by cognitive architectures and potentially by Sigma. The literature review is divided into two sections: first section discussing cognitive architectures and previous approaches towards language and perceptual processing in cognitive architectures, the second discussing speech processing, natural language understanding and dialogue systems in the context of virtual humans (Rickel & Johnson, 1999) or embodied conversational systems (Cassell, 2000). 2.1 Cognitive Architectures and Language, Perceptual processing This section provides background on cognitive architectures and motivates the problem of integrating speech recognition in cognitive architectures followed by the issues and opportunities it presents. Finally, there is discussion of previous attempts towards integrating language and perceptual processing in the context of cognitive architectures, especially speech recognition and language understanding. 11 2.1.1 Cognitive architectures A cognitive system is an integrated computational model of human behavior and consists of a cognitive architecture supporting a set of behaviors. This follows from Newell’s view of the human mind as a knowledge system (Newell, 1990) – analogous to that of a computer system – which can be deconstructed in terms of the fixed architecture and the programs that run on top of it (Lehman, Laird, & Rosenbloom, 1996). Cognitive architectures model the infrastructure underlying intelligent behaviors and are concerned with those aspects of an agent that stay constant over time and across different application domains; these include (Langley, Laird, & Rogers, 2009): i. short and long term memories – that store knowledge about beliefs, goals etc. ii. representation and organization of elements in those memories, particularly how higher order structures result from these and iii. processes – including decisions, planning, learning etc. – that operate on these structures and learn them. Various cognitive architectures take different approaches towards these issues and thus different cognitive architectures can be compared or contrasted along these axes. An excellent summary of these can be found in (Langley, Laird, & Rogers, 2009). This work uses these axes to focus the analysis of cognitive architectures and issues related with speech, language, and discourse, and the architecture’s ability to support them. Cognitive architectures aim towards supporting a breadth of capabilities across a diverse set of tasks and domains in the context of a fixed infrastructure, 12 thereby offering a view of behavior at the systems level i.e. at the level of the general mechanisms that enable all these diverse capabilities. This position was advocated in (Newell, 1973) as well as in (Langley, Laird, & Rogers, 2009), & (Rosenbloom, 2009) and guides the research to focus on issues involved in integrating many findings into a single, theoretical framework (Langley, Laird, & Rogers, 2009). The need for cognitive architectures to support interaction with other agents and human users via natural language understanding is well accepted and discussed in (Langley, Laird, & Rogers, 2009), (Newell, 1978). Previous works in cognitive architectures, such as NL-Soar (Lewis, 1993), or ACT-R (Ball J. , 2011) have focused on developing theories of natural language comprehension in the context of the particular architecture, focusing on largely on linguistic syntax i.e. the structure of language, attempting to explain various linguistic phenomena. Syntax is one aspect of the language processing model proposed in this work. The focus of this work is on spoken language understanding, with a form of acoustic signal serving as the linguistic input to the system – a capability that cognitive architectures have previously not integrated directly on top of the architecture. Finally, it is important to note that incorporating novel capabilities to cognitive architectures and exploring the issues that lie therein is considered an “obvious area for improvement” in the space of cognitive architectures (Langley, Laird, & Rogers, 2009). 2.1.2 Integration of Speech and Language in Cognitive Architectures This section discusses perceptual capabilities in the context of cognitive architectures and previous work towards integration of speech and language in cognitive architectures. The HPSA77 and Soar cognitive architectures are discussed first, both 13 for being important antecedents to the Sigma cognitive architecture and their place in the space of cognitive architectures. Next, the hybrid cognitive architectures SAL (Jilk, Lebiere, O’Reilly, & Anderson, 2008) and robotics architectures are discussed and analyzed in terms of their support for speech recognition. 2.1.2.1 HPSA77 The Human Problem Solving Architecture 1977 (HPSA or HPSA77) (Newell, 1978) was an earlier attempt towards specifying a cognitive architecture in terms of a production system. Only relevant aspects of the HPSA77 architecture and its assumptions are summarized here, with interested readers referred to (Newell, 1978). The top-level architecture is shown in Figure 1 and can be analyzed in terms of the three dimensions presented in section 2.1.1. The long term and short-term memories – referred to as the production memory (PM) and working memory (WM) respectively – hold knowledge elements in the form of productions (for PM) or data (for WM) and correspond, respectively, to what is generally known versus what is known specifically about the current situation. Each production consists of an ‘if’ condition with a set of instructions or actions to execute when the particular condition is satisfied. The conditions and actions can contain variables – Long Term/Production Memory P: C1 && C2 -> A1, A2 (Probe =X)(Digit =X) -> (Say “FOUND”) Short Term/Working Memory (Digit 5) (Digit 6) (Probe 5) Perception Motor system Figure 1: HPSA77 architecture showing various memories and mechanisms. Production memory represents what is generally known in terms of productions and working memory contains symbolic data about what is known about current situation. 14 that take arbitrary values – or be variable free by specifying specific values when the production is satisfied. Data enters the working memory via the perceptual system and also as a side effect of actions that are executed. A production is satisfied, i.e. considered for execution when all its conditions find matching elements in the working memory. The behavior of the system can be broadly specified by the ‘Recognize-Act’ cycle that: (i) matches all productions from the production memory against data in the working memory to instantiate them, (ii) performs a selection of which production to execute and (iii) finally execute the corresponding actions. The HPSA77 architecture can be broadly specified by: (i) arrangement of the PM and WM memories – in this case, interaction between PM and the perceptual system occurs strictly through the WM – (ii) the knowledge they contain – the productions for PM or data for WM – and (iii) finally the processes that operate on them, for example the recognize-act cycle. Many other details of the architecture need to be specified further, such as for example, the process – referred to as the ‘conflict resolution’ mechanism – that selects a particular production to execute during a particular recognize-act cycle. This is accomplished by considering evidence from psychology and making suitable assumptions to limit or generalize each mechanism. Further details, including the particular assumptions HPSA77 makes can be found in (Newell, 1978). It is pertinent to note that HPSA77 did not specify any mechanism for learning either the productions in the PM or the processes that operate on them. This sub-section briefly discusses relevant aspects of the important processes and assumptions underlying HPSA77 in the broader context of speech recognition and language processing with finer details of HPSA77 in the context of speech and 15 language being discussed in subsequent sections, when the need arises. The most important aspects of HPSA77, in the context of this thesis are: (i) data elements – the productions in the PM and their data in WM – are symbolic in nature and the various processes that act on them are themselves symbolic; (ii) the productions in the long term memory interact with the perceptual system strictly through the working memory. The perceptual system’s input to cognition is sensory data – for example, speech features – and is symbolized in the working memory. An important observation immediately follows: this property places no requirement on the form of the speech features and thus only specifies that the knowledge in the long-term memory cannot be directly involved in the process that obtains these features for symbolization; (iii) human processing follows ‘recognize in parallel, act serially’ paradigm. This distinction maps to the (Shiffrin & Schneider, 1977) theory of automatic processing versus controlled processing. HPSA77 assumes productions without variables can execute asynchronously, concurrently and without constraint every recognize-act cycle but only one instantiation of a production with a variable can be executed per recognize-act cycle. Any capability requiring perceptual input must be mapped on to this combination of processing, based on what is known about the capability. This mapping determines the feasibility of the capability in terms of the architectural mechanisms and the performance it yields and thus determines whether the supraarchitectural capability can be routinely deployed as part of the cognitive system. In fact, (Rosenbloom, 2015) notes that the nature of this mapping is key in achieving routine deployment of 16 capabilities. Subsequent sections discuss how this thesis relies on properties of lexical access (Marslen-Wilson, 1987) (McClelland & Elman, 1986) (Frauenfelder & Tyler, 1987), time scale of cognition (Newell, 1990), and their mapping on to Sigma’s memory structures, the organization of long term and short term memories, cognitive decision cycle etc. This also motivates choosing the feasibility of routine deployment of the speech recognition capability with a discourse capability. Newell mapped the Harpy speech recognition system (Lowerre, 1976) onto the HPSA production system (Newell, 1978) resulting in a set of promissory notes. Most of these concerned the mixed and hybrid nature of the speech problem and the lack of suitable path to realize Harpy’s use of the Itakura metric for matching acoustic signal with phonetic templates – the equivalent of the acoustic function (discussed in section 2.2.2 ). Other promissory notes concerned the difficulty with the nature of selection of linguistic units, words in this case, to be accomplished via comparing and matching productions in the working memory. The difficulty arises out of how production selection mechanism of HPSA should handle closeness of real valued numbers that arise out of the comparison of acoustic data with phonetic templates. Both these difficulties arise out of lack of support for probabilistic representations of state information in WM and the ability to select a WM element via the use of probabilistic matching. 2.1.2.2 Soar The Soar cognitive architecture (Laird, The Soar Cognitive Architecture, 2012) was a more recent attempt at specifying a cognitive architecture than the HPSA77 system. It is based on (Newell, 1990) and elaborates on several notions specified only as 17 assumptions or hints while specifying HPSA77. Soar identifies properties of cognitive behavior that are common across several domains (Laird, 2012) and uses the Problem Space Computation Model (PSCM) to organize knowledge and behavior (Newell, 1990). PSCM extends the ideas behind HPSA77 and the notion of parallel perception followed by serial action – specified via the recognize-act cycle – and specifies human behavior as a sequence of decisions in service of selecting actions through a problem space. Soar can be thought of as a general-purpose problem solver that works on achieving goals – specified via problem spaces – with the architecture deciding how to proceed in this space by selecting a particular action to apply. This selection is made using available knowledge associated with the current problem space. 18 Figure 2 shows the top-level Soar block diagram. Similar to the PM in HPSA77, the Long Term Memory (LTM) in Soar stores what is generally known except that in Soar, the LTM is further specialized into procedural, semantic and episodic memories. These store knowledge about how and when to do something, facts about the world and states or sequences of states respectively. Akin to HPSA77, Soar WM stores what is known about the current situation. The ‘recognize-act’ cycle of HPSA maps on to the more elaborate ‘decision’ cycle in Soar and consists of: input, elaboration, decision, application and output. Instead of selecting a production to apply, the Soar decision cycle selects an appropriate operator or action to apply – thereby retaining the ‘perceive in parallel, act serially’ edict followed by HPSA77. Soar further proposes a Body Cognition Long Term Memory (Symbolic) Working Memory (Symbolic) (Digit 5) (Digit 6) Perceptual STM Action Procedural Semantic Episodic Perception Perceptual LTM Mental imagery RL Appraisals Chunking Semantic Episodic Decision procedure Figure 2 The Soar cognitive architecture. The perceptual memories are outside of the symbolic working memory (WM) and require external modules to introduce symbolized perception into the WM. 19 three-level processing model, based on PSCM and its decision cycle (Laird, The Soar Cognitive Architecture, 2012): (i) reactive processing: corresponding to system 1 or habitual thinking, specifying what happens inside each decision cycle – aggregating perceptual input with LTM and the data in WM to select the next operator to be applied and obtain its output; (ii) deliberative processing: corresponding to system 2 or controlled processing, specifying what happens across a sequence of decision cycles; and, (iii) reflective processing: involves metalevel thinking and involves halting normal processing – via impasses – to bring more knowledge into play to find a way to proceed. Soar uses this mechanism to implement chunking – a form of procedural learning (Rosenbloom, Newell, & Laird, 1991) – that is able to learn procedural knowledge structures whenever an impasse is resolved, thus alleviating a future possible impasse in similar situations. Soar suffers from the same issues that HPSA77 does, although Soar allows specifying numeric preferences over which operator is selected for execution. This can be extended to support probabilistic reasoning via knowledge encoded as operators and rules that operate on them (Laird, 2012). Soar requires perception to be introduced into the WM via a perception module that symbolizes perceptual input.. For example (Laird, Kinkade, Mohan, & Xu, 2012) introduce symbolized input in Soar’s WM. Even the imagery memory in Soar, which stores knowledge associated with symbolic and perceptual aspects of mental imagery processes, requires special 20 interface with the WM that retains its symbolic character (Laird, 2012). The WM retains its symbolic character due to the processing cycle’s symbolic nature. This thesis seeks to demonstrate Sigma’s unique blending of graphical and cognitive layers is able to represent and process acoustic signal and the corresponding processes in a manner that is uniform with cognition. Thus for each decision cycle, the WM contents are updated by combining perceptual input – processed by a suitable perceptual function 1 – and state information, represented as probabilities to yield a posterior probabilistic distribution over state. The cognitive architecture also supports a language that allows extracting arbitrary values from the WM 2.1.2.2.1 NL-Soar NL-Soar is a theory of sentence comprehension developed based in the Soar architecture (Lewis, 1993). This work shares with NL-Soar, its goal to develop a theory of language processing that is architecturally based and aims to cover a broad range of linguistic phenomena but over a wider space of capabilities. Whereas NL- Soar was focused on incorporating a range of linguistic phenomena and exploring the role structure (i.e. syntax) played in conjunction with other cognitive abilities, this work aims to integrate speech, language and discourse abilities on top of a cognitive architecture that is developed with support for quantitative metadata from the ground up. With Sigma’s basis in factor graphs, probabilistic grammar parsing is not assumed to be straightforward. As shown in section 2.2.4, this is because inference in 1 This corresponds to the acoustic function, as discussed in section 2.2.2 21 the underlying factor graph is exponential in nature and approximate due to the loopy nature of the graph. The focus on syntax in this thesis is limited probabilistic parsing in a purely reactive fashion, where the entire sentence is processed in a single cognitive cycle. Although not described in this dissertation, Sigma has previously been used to process grammar in a deductive fashion using a shift reduce (SR) parser (Shieber, Schabes, & Pereira, 1995). 2.1.2.3 Perceptual processing in other cognitive architectures and frameworks As noted previously, hybrid cognitive architectures such as CLARION (Sun, 2006), SAL (Lebiere, O'Reilly, Jilk, Taatgen, & Anderson, 2008) appear to have the potential to support speech processing and understanding. SAL development – in agreement with Sigma – is guided by the hypothesis that the unique blending of symbolic and sub-symbolic processing makes human intelligence very powerful (Lebiere, O'Reilly, Jilk, Taatgen, & Anderson, 2008). However, the SAL architecture differs with Sigma on several important details of the architecture. SAL maps the ACT-R (Anderson, et al., 2004) cognitive architecture’s symbolic reasoning capabilities on to the connectionist cognitive architecture Leabra (O'Reilly, 1998) in a principled and biologically plausible way. Specifically, the visual module of ACT-R is replaced with a vision module based on a portion of the Leabra architecture. This integrated architecture was deployed in a game 2 setting where the agent had to navigate between rooms to identify objects. ACT-R decided which room to visit and Leabra performed perceptual processing in service of object recognition. This attempt to 2 https://www.unrealtournament.com 22 implement a hybrid scheme provided further evidence of the feasibility of combining connectionist approaches with symbolic architectures and led to an initial exposition of the problems involved in achieving such a fusion: (i) symbol grounding – the conversion of perceptual features to relevant symbols (Newell & Simon, 1976) – and (ii) top down control of visual module and its effects on visual attention. The symbol grounding problem (Newell & Simon, 1976), (Nilsson, 2007) involves bridging the distinction between the representation subsystem (able to efficiently, tractably represent perceptual information) and the control subsystem (allowing for efficient reasoning, sequencing, reflective processing) – respectively (Jilk, Lebiere, O’Reilly, & Anderson, 2008). This grounding was handled by narrow interfaces provided by SAL between the representation subsystem implemented by Leabra and the control subsystem implemented by ACT-R. Sigma takes an approach guided by functional elegance towards bridging this divide via the mixed and hybrid nature of its cognitive decision cycle. Additionally, SAL did not implement speech understanding although it is conceivable to assume it would be possible by representing the speech decoder rules in the procedural memory – implemented as ACT-R module – and priming Leabra’s aural memory to recognize aural cues – such as phonemes, syllables etc. – to achieve a speech recognition capability. The current SAL implementation is top down, with the control of the visual module via a directed visual search, whereas a more natural realization would be priming Leabra’s visual module with cues related to objects and having bottom-up input from Leabra informing ACT-R of a possible object in the visual field to let ACT-R direct top-down attention and more resources for further inspection (Jilk, Lebiere, O’Reilly, & Anderson, 2008). This highlights an 23 important aspect of capability integration in cognitive architectures. Sigma has demonstrated ability to handle the dichotomy of top down versus bottom up effect on cognition exemplified above in (Rosenbloom, Gratch, & Ustun, 2015). This thesis seeks to build on this work in visual search to find the appropriate divide between habitual (reactive), deliberate (controlled) and reflective (meta) processing for speech understanding. The robotics research community has traditionally investigated issues involved in combining symbolic and sub-symbolic aspects. An important distinction between robotics architectures and cognitive architectures is that the latter focus on modeling human behavior explicitly whereas the architectural mechanisms, organization of memories and representations of knowledge in robotics architectures are influenced by concerns regarding real time and constrained resource usage requirements. Classical approaches like the 3T architecture (Gat, 1998) broadly organize processing into three layers: reactive, sequencing and reflective tiers analogous to Soar’s three level processing model discussed in section 2.1.2.2. However, a major difference in these two processing models is that in the Soar (and Sigma) three layer processing model, the reactive layer forms the inner loop of the deliberative layer which in turn forms the inner loop of the reflective level (in line with functional elegance an important desiderata) whereas the 3T architecture places no such constraint on the three tiers. The three tiers are partitioned to provide real time response to sensors and actuators, sequencing of operations to yield complex behaviors and planning to handle search aspects involved in generating intelligent behavior. Indeed, this layering allows for a wide variety of disparate reactive 24 algorithms that are used to handle uncertainty associated with sensors, to be sequenced and controlled by various algorithms at the upper layers. One consequence of this layering is that any interaction between these layers is across well-defined, narrow interfaces (Gat, 1998) with disparate capabilities implemented as modules. This thesis takes the position that the three layer cognitive processing model developed in Soar and later reused in Sigma (refer to section 3.1.3) is a more elegant approach towards developing a uniform theory of cognition and is in line with what Newell advocated in (Newell, 1973) when he presented the case for building a unified platform for experimentation. This thesis is unique in its attempt to take the first step towards a theory of speech recognition and understanding using the three- layer cognitive processing model of Sigma. More recently, robotics architectures are attempting to unify high level planning, goal-oriented behavior, with perceptual uncertainty and motor control using graphical model based probabilistic inference. Toussaint, Plath, Lang, and Jetchev, (2010) integrate a form of vision – using approximate nearest neighbor algorithm (ANN 3 ) to search for extracted vision features (SURF points) – with motion and trajectory control and high-level planning expressed in Dynamic Bayesian Networks (Murphy, 2002). This first attempt does not fully integrate perception, instead relying on the approximate nearest neighbor algorithm to search for objects in the visual field. This thesis seeks to demonstrate that Sigma’s graphical and cognitive layers together can support the speech recognition capability, uniformly with cognition. 3 http://www3.cs.stonybrook.edu/~algorith/implement/ANN/implement.shtml 25 Commercially available robotics architectures such as iCub 4 or ROS 5 typically layer their processing according to concerns of real-time and available resources. Thus, the reactive layers are located locally on the robot whereas the planning layer, which involves resource intensive search, is located on a more powerful computer. Speech recognition is typically integrated as a separate module using an off the shelf speech recognizer and communication between the speech module and high level planning and control layers is via well defined narrow interfaces. 2.2 Conversational agents Embodied conversational agents (Cassell, 2000) integrate speech recognition, natural language understanding and discourse management in a virtual human (Rickel & Johnson, 1999). Here, the goal is human-like behavior and such systems are used in a variety of real world tasks, such as training naval officers (Campbell, et al., 2011), complex multi-party negotiations (Traum, Marsella, Gratch, Lee, & Hartholt, 2008), (DeVault & Traum, 2012), serving as guides to museum (Traum, et al., 2012). Speech recognition, language understanding and discourse management systems have a rich history and only the ideas relevant to this work are presented. Conversational agents are reviewed followed by speech recognition and language understanding. This order serves to motivate the supraarchitectural integration of speech and language with a discourse management capability to form a conversational agent. 4 https://en.wikipedia.org/wiki/ICub 5 http://wiki.ros.org 26 2.2.1 Conversational Agents This section presents material on conversational agents and is divided into two subsections. Construction of a conversational agent relies on important properties of human conversation and these are presented first; these serve as basis for the decisions made in the specific ideas built on from automatic speech recognition (ASR) and natural language (NL) domains. Next, reference architecture for conversational agents is presented and various components are delineated along with a discussion of the interaction of speech recognition and language understanding components. The speech recognition and natural language understanding components are discussed further in the following sections followed by a discussion of previous attempts to integrate them. 2.2.1.1 Properties of human conversation Human conversation is a complex, highly interactive process characterized by turn taking where each agent takes turn to speak (Jurafsky & Martin, Speech and language processing, 2008). These turns are characterized by little or no delay (Sacks, Schegloff, & Jefferson, 1974) and may involve providing feedback on the original speaker’s utterance via backchannels (Yngve, 1970) such as nodding or other overlap behavior. Overlap behavior may manifest itself as a collaborative completion of the original speaker’s utterance (Goodwin, 1979) or other grounding behavior (Clark & Schaefer, 1987). These observations point to an incremental interpretation of speech understanding within human listeners (Marslen-Wilson & Welsh, 1978), (Marslen- Wilson, 1987) and being able to perform incremental understanding of speaker 27 utterances, in the context of discourse systems, enables rapid response (Skantze & Schlangen, 2009) and informs the discourse agent’s ability to provide early or overlapping feedback (Allwood, 1995). An utterance in a conversation is modeled as an action performed by the speaker, referred as a ‘dialogue act’ (Wittgenstein, 1953). More recently, (Schlangen, 2005) notes the nature of utterances as being non-sentential, i.e. consisting of disfluencies & repair and shorter than full sentences, proposing that the unit of analysis be an ‘intentional unit’ such as a dialogue act, further supporting modeling user utterances as acts. These acts depend on the context in which they occur and influence ‘what’ is currently understood commonly or jointly by the participants (Jurafsky & Martin, Speech and language processing, 2008). Agents engaged in conversation use evidence, such as nods or continuer words such as uh-huh via backchannel, to gain closure (Jurafsky & Martin, Speech and language processing, 2008), ensuring they cooperate on turn-taking, avoid unnecessary overlap and reach mutual understanding. This points to incremental understanding of partial utterance followed by an action – whether to interrupt or provide feedback – in human listeners. (DeVault, Sagae, & Traum, 2011) further note that beyond incremental understanding, certain properties of the utterance, such as, for example, its semantic content, are necessary to provide some responsive behaviors including evocative function (Allwood, 1995), perlocutionary effect (Sadek, 1991), grounding by demonstration (Clark & Schaefer, 1987), incremental grounding (Schlangen, 2005) etc. 28 2.2.1.2 Conversational agent architecture This section discusses the general architecture of dialogue systems and establishes the requirements for the speech recognition and language understanding capabilities, in the context of the properties of human conversations from the previous section. Figure 3 shows the major components of a dialogue system, specified according to functional blocks. The architecture can be understood based on the functional concerns specified by (Schlangen, 2005), (Jurafsky & Martin, 2008): (i) what was said, (ii) what to say next and (iii) how to say it. Broadly, the speech recognition and language understanding units convert linguistic input into a meaning representation – covering ‘what was said’ – and the language generation and speech synthesis units convert meaning to speech – covering ‘how to say it’. The dialogue manager controls this process with input from the task manager and is responsible for ‘what to say next’. This thesis focuses on the recognition and understanding processes and hence the discussion of the language synthesis and speech synthesis blocks is limited. The architecture denotes a ‘batch-processing’ mode – also known as a pipe and filter Speech Recognition Language Synthesis Language Understanding Dialogue Manager Speech Synthesis Task Manager Speech input Speech output Figure 3 A modular approach to a dialogue system (Jurafsky & Martin, 2008). Input from each module along the chain activates the next module, triggering processing. Information is shared in only one direction. 29 system (Shaw, 1996) – where each block operates only when it has data to process, until eventually the system generates an output. Turn taking behavior entails responding to user utterances in an overlapping manner as noted previously. An immediate consequence of the batch-processing mode of the architecture is that the overall latency – the delay between the end of human utterance and the response – of the system is dependent on the frequency of interaction between individual components. Lowering the latency and improving real-time performance of the system is important to make the conversation realistic (Morbini, et al., 2013) and serves to establish rapport. The speech recognition component converts audio input from a microphone to text transcription. Many aspects of the speech recognition system are tailored towards particular requirements of both, the system architecture and the dialogue task. For example, to reduce latency of the system, a speech recognizer may output partial or incremental utterance information (Skantze & Schlangen, 2009), (Morbini, et al., 2013) in real-time. Constraint from the structure and nature of human language can be used to restrict the space of hypotheses the speech recognition engine must explore (Chelba, 2000), (Schuler, Wu, & Schwartz, 2009). The nature of these approaches is covered in section 2.2.2. The NLU engine converts speech input to meaning. As with speech recognition, the NLU component impacts the latency of the system and its design involves similar tradeoffs in terms of constraining the space of meaning hypotheses versus generality of the approach and its application across domains (Jurafsky & Martin, 2008). The nature of these tradeoffs and various NLU approaches is covered in Section 2.2.3. The generation and synthesis components are 30 responsible for expressing a response. The synthesis component can be realized using a hidden Markov model (HMM) based approach (Pieraccini, Levin, & Lee, 1991) and the text can be synthesized to generate speech using a voice-XML 6 engine. The dialogue manager is responsible for the development and structure of the dialogue – thereby handling both ‘what’ and ‘when’ of the next response. The content of the next response is largely influenced by the state of the dialogue and the intended structure of the agent-human interaction. The dialogue manager can be characterized as a state model (Schlangen, 2005), (Jurafsky & Martin, 2008) with transitions occurring between states based on linguistic input, its interpretation and the intended structure of the dialogue. The structure of the dialogue is characterized by which agent can talk when, formally referred to as the ‘initiative’ (Walker & Whittaker, 1990). Four major types of dialogue managers are summarized by (Jurafsky & Martin, 2008) in literature: (i) finite state automata, (ii) frame based, (iii) Markov decision process based and (iv) plan based architectures. These approaches vary based on the kind of information to be represented and how state transitions must be specified (Schlangen, 2005). Systems that control the structure of the dialogue tightly, such as a question answer prompting system, are called single initiative system and modeled by finite state automata (Jurafsky & Martin, 2008), (Morbini, et al., 2013). In such a setting, the response depends on the state the system is in and the user must answer the exact question asked. A more natural interaction, where the conversation control shifts between users, requires handling a variety of user inputs in each state or hierarchically partitioning the state space (Jurafsky & 6 http://www.voicexml.org 31 Martin, 2008). This can be obviated by using a frame and slots representation where the current state can be built over dialogue acts as each slot is filled by turn taking. This system shares initiative, allowing user input to dictate which frame or slot in a frame the response is mapped on. These systems (Seneff & Polifroni, 2000) can use production rules to switch control between different frames (Jurafsky & Martin, 2008). The information-state architecture (Traum & Larsson, 2003) for dialogue management is a more general approach towards dialogue management and explicates various concerns in modeling the state of the dialogue process (Jurafsky & Martin, 2008). The choice of conversational actions that depend the information-state of the dialogue and the agents involved can be extended to choosing these actions based on optimizing rewards or costs based on the current information-state (Jurafsky & Martin, 2008). This gives rise to POMDP (Selfridge, Arizmendi, Heeman, & Williams, 2012) based approaches. Finally, planning based approaches cast the problem of interpreting and generating a sentence as a planning problem. This idea builds on how plan-based models can help interpret speech acts (Perrault & Allen, 1980) and generate speech acts (Cohen & Perrault, 1979). The TRIPS agent helps planning with emergency management, planning where and how to supply ambulances or personnel in a simulated emergency situation and the planning algorithms underlying these tasks can be used to generate and interpret sentences (Allen, Ferguson, & Stent, 2001). These models can be used for mixed-initiative dialogue tasks and require semantic interpretation of linguistic inputs – such as frames or frames with slots – and work in conjunction with the task model to generate the next best dialogue action given the goals of the task. 32 2.2.2 Speech Recognition This section reviews core concepts in speech recognition. The speech recognizer converts the raw acoustic signal into a transcription that contains a string of words in language. Speech recognition systems can be parameterized along the following axes (Jurafsky & Martin, 2008): (i) the size of the vocabulary – small, medium or large vocabularies; (ii) fluency of recognized speech – isolated words or continuous fluent speech; (iii) channel noise characteristics – lab quality speech or a noisy car speech recognizer and (iv) speaker class characteristics – standard speech or accented speech. This thesis focuses on a medium vocabulary task with models trained on large vocabulary speech recorded in laboratory conditions. The analysis below follows the axes used to analyze cognitive architectures in 2.1.1, focusing on the knowledge sources, representation of structures and the processes operating on these structures. Speech recognizers are predicated on the Ur-theory of phonology, that spoken words are composed of smaller units of speech (Jurafsky & Martin, 2008). Each such unit of speech – phone, sub-phone or syllable – can be modeled using a Hidden Markov Model (HMM) (Rabiner L. R., 1989), (Jelinek, 1997), (Young, et al., 1997). HMMs are generative models, where an acoustic unit is assumed to have generated a sequence of observations – i.e. lower order units – for example each phone is modeled to have produced a sequence of acoustic observations or each word is modeled as a sequence of phones. Higher order units, such as words or utterances can be composed from individual HMMs spliced together, yielding a higher order HMM. A detailed description of the individual components of HMMs can be found in (Rabiner L. R., 1989). Very briefly, the process responsible for 33 producing the observations is factored into a state space where the state transition function models probabilistic inter-state transitions and the observation function governs the generation of observations in each state. Inference – retrieving the sequence of states that produced the observation sequence – is possible via either the forward (Rabiner L. R., 1989) or the Viterbi algorithm (Rabiner L. R., 1989). Sigma supports both versions of inference via the graphical model layer (Kschischang, Frey, & Loeliger, 2001), (Rosenbloom, 2009). The noisy channel model of speech recognition views the acoustic signal as a noisy version of the text string, modeling how the channel modulates the source to recover the string (Jelinek, 1997), (Young, et al., 1997), (Jurafsky & Martin, 2008). The classic speech equation specifies the speech recognition task problem: 𝑊 1…𝑀 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑤 1…𝑀 𝑃 (𝑤 1…𝑀 |𝑂 1…𝑇 ) − (1) Using Bayes’ law, this can be represented as: 𝑊 1…𝑀 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑤 1…𝑀 𝑃 (𝑂 1…𝑇 |𝑤 1…𝑀 )𝑃 (𝑤 1…𝑀 ) − (2) The 𝑃 (𝑊 1…𝑀 ) term represents the likelihood of the sequence of words and is modeled by a statistical language model function (Jelinek, 1997), (Young, et al., 1997), (Jurafsky & Martin, 2008). The 𝑃 (𝑂 1…𝑇 |𝑤 1…𝑀 ) represents the likelihood of observing a sequence of acoustic observations given a particular string of words and is modeled by the acoustic function (Jelinek, 1997), (Young, et al., 1997), (Jurafsky & Martin, 2008). Together the language model and the acoustic model can be operationalized in a graph to compute the maximal probability string corresponding to the observations. Using the HMM representation, equation (2) can be formulated as: 34 𝑊 1…𝑀 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑊 1…𝑀 ∑ 𝑃 (𝑂 1…𝑇 , 𝑠 1…𝑇 |𝑊 1…𝑀 )𝑃 (𝑊 1…𝑀 ) 𝑠 1…𝑇 − (3) This can be approximated using the Viterbi approximation: 𝑊 1…𝑀 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑊 1…𝑀 𝑃 (𝑊 1…𝑀 ) max 𝑠 1…𝑇 𝑃 (𝑂 1…𝑇 , 𝑠 1…𝑇 |𝑊 1…𝑀 ) − (4) It is important to note that while equation 2 is written in terms of one hidden Markov process, as previously stated, the speech recognition problem involves composing individual units of speech into a hierarchical model of an utterance and thus equation 2 represents a set of Markov processes. During recognition, the recognition process – or the decoder process – finds the transcription 𝑊 1…𝑀 and works on a representation of this hierarchical model. Figure 4 depicts the components of a typical speech recognition system with a deconstruction of the main processes and the various knowledge sources it uses. Specifically, the left side of the figure shows the acoustic processor that is responsible for extracting perceptual features represented in a format that is suitable for processing by the speech decoder. Speech is a non-stationary process i.e. a stochastic process whose parameters change over time (Juang & Rabiner, 1991). However, it is assumed to be stationary over a Acoustic Processor Speech Decoder Acoustic models Phonetic lexicon Language model Filterbank weights Feature templates Perceptual features (MFCC/VQ code) sub-cognitive cognitive P(O|phone) cat: k ae t P(“cat”|”the”) “I want ice cream” Speech Figure 4 A proto-typical speech recognition system broken down into the knowledge sources it uses (top), the various processes (decoder and acoustic processing). 35 short – typically 10 ms – interval and this segmentation of the input signal is the first function performed in the acoustic processor. The segmented signal is transformed to frequency domain – a function performed in the human cochlea (Denes & Pinson, 1993) – and is filtered in frequency domain. The weights of the filterbank are set according to the Mel-scale (Jurafsky & Martin, 2008), (Young, et al., 1997). The acoustic processor then extracts features – via discrete cosine transformation – from the filterbank output. The output of the acoustic processor is the acoustic signal represented as a vector of real numbers, corresponding to the features extracted. The extracted features facilitate discrimination between phones (Jurafsky & Martin, 2008). The filterbank weights are knowledge used by this process. A Vector Quantizer (VQ) (Gray, 1984) can be used to discretize this signal and this is the acoustic signal representation used in this work. An advantage of using VQ codes for representing speech signal is that the acoustic scoring is very fast (Jurafsky & Martin, 2008). Although (Jurafsky & Martin, 2008), (Young, et al., 1997) note that this leads to a performance degradation, (Jelinek, 1997) notes techniques that can be used to mitigate this degradation and these involve building higher order feature labels from the VQ codes. One such technique is demonstrated in (Bahl, De Souza, Gopalakrishnan, Nahamoo, & Picheny, 1994), where the system was able to obtain performance comparable to its contemporary systems that used real valued representations of continuous signals. This thesis shall not explore these techniques, as a more natural path forward for this research would involve working on the raw acoustic signal and performing the frequency domain transformation along with feature extraction using Sigma’s graphical layer. 36 The speech decoder process works on the acoustic signal obtained from the acoustic front end and combines it, via the acoustic function, with various knowledge sources that are composed into a decoding graph to obtain the transcription (Kanthak, Ney, Riley, & Mohri, 2002), (Aubert, 2002). The acoustic function gives the likelihood of the acoustic signal being generated by a particular phone and is treated as a match score. The decoding graph is composed of HMM states, each state associated with a sound or phone template corresponding to a word in the vocabulary (Young, et al., 1997). Each state encodes the understanding that a sequence of words has been heard with the expectation over what words will follow the present word. The major tasks performed by the decoder include (Aubert, 2002): • generating hypothetical word sequences, by using the language model; • scoring active hypothesis using the acoustic model; • pruning of the search space to make the decoding task more manageable; and, • creating “back-pointers” required by the Viterbi algorithm (Rabiner, 1989) to retrieve the most likely state sequence, and thus, the corresponding sequence of words. As noted previously, the decoder works on the decoding graph that is a result of composition of the knowledge source shown in Figure 4. This graph is obtained by combining a probabilistic language model that accounts for what words can follow a particular word, with appropriate smoothing (Kneser & Ney, 1995) (Gale & Sampson, 1995) for the purposes of generalization, with a phonetic lexicon that specifies how each word sounds and HMM representation for each sound. (Aubert, 2002) suggests a classification of decoding approaches using the axes: (i) static or dynamic expansion 37 of the decoding graph, (ii) search strategy synchronous with respect to speech segmentation and (iii) the structure of the search space. Expanding the whole search space, prior to decoding, has the advantage of being able to perform optimizations offline and before performance time (Mohri M. , 1997). This involves composing all the knowledge sources – the lexicon, the language and the acoustic models – into a giant graph followed by several steps of optimizations that exploit structural sparseness and redundancies in the graph (Mohri M. , 1997), (Mohri & Riley, 2008). Typically, these decoders make multiple passes over the decoding graphs using more accurate models with each successive pass (Jurafsky & Martin, 2008). This approach has been shown to achieve lower word error rate and better performance, in terms of decoding latency, over approaches that construct the utterance representation dynamically (Kanthak, Ney, Riley, & Mohri, 2002). Constructing a portion of the decoding graph dynamically has the advantage of being able to combine various knowledge sources dynamically. These approaches (Bahl, Jelinek, & Mercer, 1983) have resulted in one-pass decoders, albeit with the use of Viterbi backtracking, and require less working memory as they store only the relevant portions of the search space. This work represents the search space using graphical models and processes the utterance incrementally, committing to words or phones as they are heard instead. Explicitly representing the decoding graph – which is a hierarchical composition of various knowledge sources – can be accomplished by graphical models (Jordan, 2004). All the knowledge sources required for speech recognition can themselves be represented as HMMs and their composition can be expressed as 38 a hierarchical HMM (HHMM) (Murphy & Paskin, 2002). Exploiting the sequential nature of speech processing, dynamic graphical models can structure common processing across a set of time steps into a common template. This gives rise to Dynamic Bayesian networks (DBNs) (Murphy, 2002). DBNs have been shown to successfully model the system of probabilistic distributions represented by equation 2 (Bilmes & Bartels, Graphical model architectures for speech recognition., 2005). The Graphical Modeling toolkit (GMTK) (Bilmes & Zweig, 2002)provides a general library of algorithms for representing and learning structures required for ASR based on DBNs. DBNs are more general than HMMs and can represent hierarchical structures efficiently (Murphy, 2002). Particularly, DBNs are as powerful as context free language with a bounded stack and can be used to model relationships between linguistic classes, spectral features and lexical pronunciations. Thus, for example, DBNs can be used with probabilistic context free grammar to further constraining the search space of the decoder. This approach was explored by (Chelba, 2000) and (Schuler, Wu, & Schwartz, 2009) and shown to reduce the perplexity – branching factor i.e. the number of words that can follow a particular word in a language – and improving ASR performance. To construct a continuous word speech recognizer, GMTK extends the notion of DBNs in the following ways to facilitate their application to speech and language processing tasks (Bilmes & Zweig, 2002): • forward and backward directed time links to specify the flow of constraint from past and future; • network slices specifying multiple frames; • ability to switch parents for any node in the DBN; 39 • relax the first order Markov assumption to model dependencies across speech frames; and, • special multi-frame structures to do special operations at the beginning and end of an utterance. The first two bullets deal with using constraint from the future – in conjunction with state information carried from the past – i.e. relaxing the strong independence assumption implicit in standard HMMs and DBNs. Sigma’s use of factor graphs allows bidirectional processing and should support obtaining such constraint as by maintaining ambiguity over a buffer of limited length. Sigma’s cognitive language of conditionals and condacts has been used to implement the equivalent of the switching parent functionality as shall be seen section 4.2. In fact, factor graphs are more general than DBNs and allow for bidirectional constraint and information propagation both top-down and left-right, in terms of state space hierarchy and time respectively (Kschischang, Frey, & Loeliger, 2001). ASR performance is typically measured in terms of the Word Error Rate (WER) metric, which calculates the number of mistakes – deleted, inserted and incorrect words – the recognizer committed in the final transcription. The WER metric is appropriate for a standalone ASR which aims to be generally deployed as a separate module. This thesis evaluates the ASR capability in the context of a virtual agent by deploying it routinely with a discourse management and language understanding capability to realize a conversational agent. 40 2.2.3 Natural Language Understanding The Natural Language Understanding (NLU) unit is responsible for mapping linguistic input into a meaning and thus contributes towards the ‘what was said’ question in the context of dialogue systems. Meaning representation languages capture the syntax and semantics of these representations. Example approaches include logic-based methods (Woods, 1973), (Woods, 1979), (Steedman & Baldridge, 2011) or recursive transition networks (Issar & Ward, 1993). A probabilistic parser – such as (Pieraccini, Levin, & Lee, 1991) – can be used to parse linguistic input into a suitable semantic representation. However, traditional approaches typically work on whole sentences and may not be suitable for dialogue tasks due to the incremental and unsegmented nature of continuous speech. Approaches that work on smaller units such as words or partial utterances are desired for incremental understanding of spoken language. Lexical semantics – mapping a word to a possible sense or meaning – are used in language understanding. Distributed representations model each word as a point in Euclidean space with similarity of meaning between two words modeled as the distance between these points has their roots in psychology (Osgood, Suci, & Tannenbaum, 1957). These approaches model the meaning of a word by considering the frequency of words occurring in its context. The CSLM toolkit (Schwenk, 2005) casts these continuous space vectors to n-gram language models and have shown performance improvement of up to 1% when compared with Kneser-Ney smoothing (Kneser & Ney, 1995). Sigma’s realization of these distributed representations for words have been shown to outperform state of the art on certain tasks (Ustun, Rosenbloom, Sagae, & Demski, 2014). Word sense disambiguation (WSD) tasks 41 extract features – such as bag of words, part of speech type etc. – for each word by examining its use in various contexts. A classifier, such as the naïve Bayes or decision list classifier (Yarowsky, 2000), can be trained using these features to classify a word into its potential meanings, given contextual features. This approach can be extended to extract features relevant to a frame – where each frame represents a meaning of a partial utterance – as was employed in (DeVault, Sagae, & Traum, 2011). Here, the set of possible utterances was mapped into a set of possible meanings, represented as integral frames. A classifier was trained on both partial and complete utterances so that the classifier can be used in an incremental system. The incrementality of the NLU is used to predict points of maximal understanding in the speaker’s utterance, allowing for the system to initiate overlap behavior. An important issue is to measure the performance of such an NLU, as the classifier accuracy is not always the best indicator of the quality of NLU. (Morbini, et al., 2013), (DeVault, Sagae, & Traum, 2011) suggest using f1, precision and recall metrics rather than classifier accuracy. 2.2.4 Syntax Processing Parsing of probabilistic context free grammars (PCFGs) (Jurafsky & Martin, 2008) defines an important class of problems in computational linguistics. Formally, a context free grammar (CFG) consists of a set of rules, called productions, expressed over a set of non-terminal symbols, including a special start symbol and terminal symbols. A PCFG extends the notion of a CFG by assigning probabilities to each rule. A context free language is one that consists of all possible sentences that are derived from a particular context free grammar. Intuitively, the non-terminals model 42 consecutive words as a group – also referred to as a constituent – with evidence existing that constituency plays a role in human language processing (Jurafsky & Martin, 2008). Due to its importance in language processing, and its potential architectural importance in this context, we use it here to demonstrate the tractable use of SPNs in Sigma. The problem of parsing consists of assigning structure to a sentence from the language described by the PCFG. This is typically accomplished by the well-known CKY algorithm from computational linguistics. The CKY algorithm considers an entire sentence as input and generates a set of possible parse trees in a structure called a parse chart. The parse chart encodes all valid parse trees that can give rise to the input sentence. Naively representing the parse chart as a graphical model and using the belief propagation algorithm gives rise to exponential inference (exponential in the size of the largest clique). The size of the largest clique – where a clique is a set of vertices where all vertices are adjacent to one another – provides a measure of how connected the graph is and is used to characterize the efficiency of inference. Additionally, inference is not exact here due to the presence of loops in the graph (Figure 5(c)). A similar approach was used in (Pynadath & Wellman, 1998) where PCFGs were mapped onto graphical models by representing the parse chart as a Bayesian network. To avoid the exponential and approximate nature of inference, more recently (Naradowsky, Vieira, & Smith, 2012) introduced a special factor, CKYTree, that encapsulated the dynamic programming necessary to perform cubic time parsing. This special purpose factor requires a modification to the belief propagation message passing schedule to handle the requirements of this factor. Our 43 work here also extends the message passing algorithm but in a way that requires no special purpose factors. 2.2.4.1 Sum Product Networks Sum-product networks (SPNs) are a newer type of graph-based (actually tree-based in this case) computational model that represent complex functions via sums and products (Poon & Domingos, 2011). Inference in SPNs is guaranteed to be exact and linear in the size of the network. SPNs can encode any graphical model or factor graph where inference is tractable. However, the converse is not true; i.e. a function that can be represented as an SPN does not necessarily lend itself to tractable, non- exponential inference when expressed as a factor graph – a kind of graphical model that is used in Sigma as shall be seen in Section 3. In fact, several classes of problems exist that can be solved tractably and exactly as SPNs, but that lose their exactness and tractability when represented as factor graphs (Demski, 2015; Gens & Domingos, Figure 5 (a) A PCFG specified in Chomsky Normal Form. (b) Corresponding SPN, specifying precisely sums and products necessary for inside-outside algorithm, with all invalid trees not represented. (c) Corresponding factor graph version, with each variable node representing a distribution over a set of non-terminals and factor nodes encoding the grammar rules along with their associated probabilities. 44 2013). PCFGs provide one key example, for which in fact the SPN representation over a sentence of bounded length encodes the well-known inside-outside algorithm for PCFG inference (inside) and learning (outside) (Poon & Domingos, 2011). An SPN encoding the inside-outside algorithm for a toy language is shown in Figure 5b. The SPN is cubic in the size of the sentence. Due to the absence of loops, inference is tractable and exact. To understand why inference is efficient, note that the SPN encodes only the sums and products that are necessary in the context of the grammar. Friesen and Domingos (2016) have further generalized earlier results on SPNs to show that inference can remain tractable for a larger class of problems – characterized by summing over semirings – that “includes satisfiability, constraint satisfaction, optimization, integration, and others.” 2.2.5 Integration of speech, language and cognition The USC Institute of Creative Technologies 7 (ICT) has several real-world tasks involving conversational agents operating under various constraints and situations. A good survey can be found in (Morbini, et al., 2013). The majority of these systems developed use the architecture specified in Figure 3. The ASR component is typically deployed with a task adapted language model, thereby improving the performance of the ASR (DeVault & Traum, 2012). The output of the ASR – typically the 1-best text string – is handled by the NLU. NLU schemes such as the NPC Editor (Leuski & Traum, 2011) use the concept of a ‘relevance models’ (Lavrenko, Choquette, & Croft, 2002) 7 http://ict.usc.edu/prototypes/all/ 45 to convert the 1-best output of the ASR to an appropriate response, in a question- answer system (Leuski & Traum, 2011). The NLU scheme (DeVault, Sagae, & Traum, 2011) is applied in more complex tasks, such as the SASO-EN task (Traum, Marsella, Gratch, Lee, & Hartholt, 2008). The SASO-EN task involves complex multi-party negotiation to achieve a desired outcome, in a particular scenario. Both the above schemes work on 1-best ASR output represented as a string of words, produced after the entire utterance has been processed by the ASR. The NLU approach has been extended in (DeVault, Sagae, & Traum, 2011) to avoid the requirement to wait for the entire utterance from the ASR before the NLU can start processing the utterance by allowing the NLU to work on partial utterance information. This thesis follows this method to achieve incrementality at the NLU level with slight modifications: an incremental naïve-Bayes classifier is used to maintain a distribution over utterances. This proposed architecture tightly couples the NLU with the ASR and thus a more continuous state over the space of utterances as a function of input utterance is possible. The latter feature can be thought as passing rich lattices – composed of words along with phonetic and potentially spectral features – to the NLU, instead of just the 1-best output incrementally at the rate of each cognitive cycle (~50 ms). The relevance model based NLU in (Leuski & Traum, 2011) was extended in (Wang, Artstein, Leuski, & Traum, 2011) to handle phonetic information (derived from a phonetic lexicon, such as the CMU 8 lexicon) in addition to lexical tokens leading to an improvement of 5-7% in the NLU performance. Previously, (Alshawi, 2003), (Huang & Cox, 2006) have explored using phonetic information directly – without using 8 http://www.speech.cs.cmu.edu/cgi-bin/cmudict 46 lexical information or lexical features – as input to an NLU in the call routing application. These approaches show promise towards improvement in language understanding given more information at the phonetic level. In the above examples, such as SASO-EN (DeVault & Traum, 2012) or INOTS (Campbell, et al., 2011), the language model used by the ASR was adapted to the task at hand. The larger question these examples raise is the impact of ASR and NLU on each other. Increase in the performance of ASR – measured in terms of word error rate – results in an increase in the performance of the NLU – measured in terms of f1, precision & recall (Morbini, et al., 2013). The NLU (DeVault, Sagae, & Traum, 2011) approach shows that the NLU can be made robust to ASR performance. However, important questions remain: can the ASR performance can be improved by information from the NLU, where the ASR uses general non-task specific acoustic and language models. The referential semantic language model (RSLM) approach of (Schuler, Wu, & Schwartz, 2009) integrates phonological, syntactic and referential semantic language model. The focus there is on developing a referential semantic language model that is based on a hierarchical hidden Markov model HHMM (Murphy & Paskin, 2002) by incorporating semantic information corresponding to real world referents with the hidden states and transitions in the HHMM. This is accomplished by casting referential semantics in terms of shift and reduce operations over referent entities corresponding to a world model. The shift reduce operations correspond to a sequence of hierarchical state transitions of the HHMM. The resulting model is as expressive as a probabilistic context free grammar (PCFG) with a limited stack. RSLM 47 is tested using isolated utterances corresponding to the task at hand, using acoustic models trained from the TIMIT corpus and the parser’s language model developing dynamically from the task grammar. Phonological state information is obtained via an acoustic model trained using a Recurrent Neural Network (Robinson, 1994). 48 3 Introduction to the Sigma Cognitive Architecture This section serves to briefly introduce Sigma and the proposed language model, demonstrating how each capability is realized in a supra-architectural fashion followed by the partial integration of them in a discourse task. The challenge of incorporating grammar parsing in Sigma is then examined. Sum Product Networks (SPNs) are described and their suitability to handle grammar parsing is discussed. 3.1 Introduction to Sigma Sigma is a cognitive system based on the Sigma cognitive architecture. As discussed previously, the desire with Sigma is to develop a fully working system whose development is guided heuristically by what is known about human cognition. Eventually, Sigma aims to explain human intelligence at an abstract level, in terms of cognitive capabilities. Sigma thus sacrifices short-term gains from matches to human behavior or obtaining the best performance on a particular narrow task and instead focuses on the integration of diverse functionality, with the hope that the integrated capabilities together with their interaction – a cognitive synergy achieved via their promotion or inhibition – shall help explain human intelligence. Sigma is influenced by Soar and follows its development model. However, Sigma diverges from the Soar model, both in the kinds of assumptions it makes about cognition – particularly sub-symbolic cognition – and its goals. This section first provides a very brief overview of the goals guiding Sigma’s development and further motivates this thesis in the context of Sigma’s desiderata and its ability to live up to 49 them. This is followed by a discussion of the actual architecture, the capabilities it provides and the main mechanisms it entails. 3.1.1 Sigma desiderata and hypothesis The Sigma system is developed according to a quartet of desiderata that are a mixture of long standing goals from cognitive architectures along with some additional goals that are not traditional in cognitive architectures. These desiderata help to go beyond the notion of ‘understanding and replicate all of human-level intelligence’ to both evaluate and in turn guide research on Sigma. The desiderata Sigma’s development are as follows: i. Grand unification: captures the desire to go beyond what is possible from a purely cognitive perspective and identify those pieces that are missing from purely cognitive theories, covering for example all the time scales in human cognition (Newell, 1990). Another way to think about this is that the aim is to unify the full arc from perception and attention to cognition and emotion and finally down to intention and action. This places issues of embodiment – speech being an embodied capability of human intelligence – in the foreground, alongside traditional cognitive concerns in a cognitive architecture. Grand unification motivates Sigma’s choice of mixed and hybrid processing and representations, being able to represent and process functions that are continuous or discrete and symbolic or probabilistic. ii. Generic cognition: concerns the desire to provide solutions to problems from both natural (cognitive) and artificial (AI) sciences and in turn mix ideas from both. On the natural side, this desideratum raises the questions of 50 characterizing different properties of each band from the unified theories of cognition (Newell, 1990), time scales within each band and the time scales different human functionalities fall and whether a cognitive system can provide comparable layers and a comparable distribution of functionalities across layers. Sigma’s basis in factor graphs that model several algorithms from AI combined with its focus on the time scales of cognition is an example of conflating ideas from both natural and artificial sciences. iii. Functional elegance: concerns the ability to support various capabilities entailed by grand unification using the same, theoretically elegant core. Functional elegance brings forth a principle of parsimony into the design of the architecture that encourages uniformity in how disparate capabilities are implemented and thus also potentially enables tight coupling among them. Sigma’s position on functional elegance is somewhat unique in some ways in the space of cognitive architectures, with AIXI (Hutter, 2001) advocating a stronger form of elegance and most others not aspiring towards any notion of functional elegance. Several arguments can be made both for and against staking out this position as a fundamental desideratum of the architecture (Rosenbloom, 2013). This desideratum manifests itself on several levels in Sigma, particularly at the graphical layer. Graphical models (Jordan & Sejnowski, 2001) generally provide an efficient way of computing with complex multivariate functions by decomposing them into products of smaller more local functions. Graphical models provide a functionally elegant base on which to build a cognitive layer (Rosenbloom, 2009) and the choice of factor 51 graphs as the model makes it particularly expedient in achieving this aim. Factor graphs are more general than Bayesian Networks, Markov random fields and several other graphical models (Kschischang, Frey, & Loeliger, 2001).This work serves as an important milestone in Sigma’s thesis of being able to achieve grand unification via graphical models by: (i) integrating and demonstrating a novel spoken language processing capability that informs a dialog task capability, where both these capabilities and their coupling happens uniformly with cognition. A mapping of this capability onto the cognitive mechanisms present in the architecture, and (ii) showing that the broad class of problems characterized by Sum Product Networks can be specified in Sigma and solved in an efficient and exact manner, retaining the desirable inference properties of SPNs. Sigma’s ability to support these capabilities from a uniform base is an important demonstration of functional elegance. iv. Sufficient Efficiency: concerns the ability to execute fast enough for anticipated uses, by implementing the cognitive systems in an efficient manner. Sufficient efficiency focuses the development on how well does the overall system match anticipated needs, rather than focusing on locally optimizing to achieve a very fast execution in narrower contexts. Most work on sufficient efficiency in Sigma has been directed at the graphical layer. The ability of Sigma to specify any arbitrary SPN bears on sufficient efficiency. Sigma’s view of fulfilling the four desiderata above manifests itself in the form of the graphical architecture hypothesis: the key to achieving these desiderata is to 52 blend lessons from cognitive architectures and graphical models. This work aspires to both, demonstrate the efficacy of the graphical model hypothesis applied towards deconstructing and integrating a capability never before integrated in cognitive architectures and contribute towards the desiderata. In the remaining sub-sections of this section, Sigma is reviewed very briefly followed by where this work fits in the overall context of Sigma’s development. 3.1.2 Sigma’s view of cognitive architecture Sigma is best understood as an analogy to computer systems. Figure 6 shows this analogy in pictorial form. The cognitive and graphical architecture layers together form the operationalization of the graphical model hypothesis i.e. a blending of cognitive architectures and graphical models. The cognitive architecture layer corresponds to the lessons learnt from Soar, ACT-R, and other cognitive architectures. The graphical model layer builds on methods and techniques from graphical models (Jordan, Graphical models, 2004), (Jordan & Sejnowski, 2001), (Koller & Friedman, 2009). This work pushes the envelope on the graphical architecture hypothesis by integrating speech processing capability at the supraarchitectural level – i.e. at the level of knowledge and skills. At this level, structuring knowledge and skills in a way Figure 6 The Sigma cognitive system viewed in analogy to a computer system. Figure from (Rosenbloom, Demski, & Ustun, 2016). 53 that makes them useful in deploying other, complex capabilities is an important concern. This concern manifests itself in finding cognitive idioms, structural constructs that correspond to software libraries, services and software design patterns in the traditional sense of software systems. For example, the specialization of the long term memory into procedural and declarative idioms occurs at the level of knowledge rather than at the cognitive level. In part, this work seeks to discover the cognitive idioms involved in speech recognition and understanding. Another way this work seeks to push the envelope on the graphical architecture hypothesis is via considering probabilistic grammar parsing. Traditionally graphical models require exponential time inference for PCFG inference which is still approximate. 54 3.1.3 Cognitive layer Figure 8 shows the Sigma cognitive architecture in terms of the more traditional view of cognitive architectures seen previously in section 2.2.1. All the memories shown in the figure contain functions with a distinction made between the perceptual buffer (PB), WM and the LTM. The contents of the WM and perceptual buffer (PB) correspond, respectively, to what is known about the current situation and perceived currently whereas the contents of the LTM correspond to what is generally known and embody patterns over multiple functions requiring an elaborate retrieval process. In the context of this work, an example would be that these can be fragments of probabilistic networks, such as HMMs, being one way to structure an LTM fragment. However, Sigma’s LTM does not have to be probabilistic network fragments (or limited to HMMs as the choice of probabilistic network fragments) as it can also take the form of rules, similar to those seen for HPSA and Soar. The PB can be further specialized into external PB – that Figure 8 Sigma cognitive architecture viewed as a traditional cognitive architecture. Figure from (Rosenbloom, Demski, & Ustun, 2016). Figure 7 The Sigma cognitive cycle broken into two minor (input, output) and two major phases (elaboration, adaptation). Figure from (Rosenbloom, Demski, & Ustun, 2016). 55 which is perceived from the external environment – and the internal PB, corresponding to any input introduced as a result of proprioception. The core of the cognitive layer is the cognitive cycle – corresponding to the decision and recognize-act cycles of the Soar and HPSA architectures. The cognitive cycle, shown in Figure 7, is deconstructed in terms of various operations it performs, arranged according to the two major phases (elaboration and adaptation), and two minor phases (input and output). The cognitive cycle is intended to correspond to the 50 ms cycle in humans (Laird, Lebiere, & Rosenbloom, 2017) and onto the corresponding cycles found in other architectures. The input and output phases perform very basic transduction necessary for interface with the outside world because the principles of grand unification and functional elegance dictate most of the perceptual and motor processing happen uniformly and within the two major phases of the cognitive cycle. The two major phases corresponding roughly to elaboration, understanding what is known about the current situation and adaptation, deciding how to act and adapt to the current situation respectively. The elaboration phase is a function of what is generally known by the agent (via retrieval from the LTM), what is specifically captured by perception in this cycle – obtained from the PB – to be incorporated with what was known about the current situation from the previous cognitive cycle – as resident in the WM. The adaptation phase depends on information obtained from the elaboration phase. The elaboration phase in Sigma corresponds roughly to the elaboration phase in Soar but the kinds of knowledge representation that it can process has been expanded to include representations that are mixed and can thus maintain 56 probabilistic state information – a key limitation in both Soar and HPSA that prevented an effective, functionally elegant integration of the speech processing capability. The phase is designed to be mostly monotonic and consists of updating distributions in parallel and repeatedly until no further state updates are possible. The adaptation phase consists of making choices, and changes to both the WM and LTM. Modifications to WM manifest as making decisions about what operator or value from a predicate to choose, any meta-level reflective processing that can occur as well as changing any affective state. Changes to LTM consist of learning, which can take the form of learning distributions used for retrieving functions from LTM as well as potentially structure learning, although structure learning is not the focus of this thesis. If a selection cannot be made, based on operator preferences, an impasse occurs, resulting in meta-level reflective processing. Sigma’s cognitive cycle can generate behavior via a single cycle or across a sequence of cycles, again via selection and application of operators in each successive cycle, or via reflective processing. This is analogous to the processing model discussed for Soar in section 2.1.2.2 with Sigma’s cognitive cycle giving rise to a reactive capability, a sequence of cognitive cycles forming the deliberative capability and impasses yielding reflective processing. Each cognitive cycle enables selection of operators and their application to yield updated state in the WM with a sequence of them being potentially unlimited computationally. For example, given sufficient knowledge a sequence of cognitive cycles will be able to learn a map in SLAM. However, if knowledge is insufficient, it leads to an impasse and triggers reflective 57 processing which leads to breaking out of the current problem space to allow access to additional information in service of resolving this impasse. A task can be solved in a reactive, deliberative or reflective manner with reactive processing being much faster, deliberative processing being non-monotonic and involve potentially unbounded computation, and reflective processing can be more flexible by bringing more knowledge to bear. This work takes the initial step towards the mapping of the speech recognition and understanding in service of a discourse capability in terms of this tri-level processing model and finds an initial mapping of the speech understanding task in terms of the reactive (habitual), deliberative (search or sequential) and potentially any reflective (meta-level) processing. 3.1.4 Graphical layer The graphical layer embodies the second half of the graphical architecture hypothesis. Operations and representations at the cognitive layer are cast as operations on graphical models – particularly, factor graphs. Factor graphs evolved out of coding theory, where they have for example proven to yield effective performance in turbo codes (Kschischang, Frey, & Loeliger, 2001). Factor graphs can represent complex multivariate distributions and subsume both Bayesian and Markov networks (Koller & Friedman, 2009). Factor graphs are themselves undirected, potentially loopy bipartite graphs composed of variable nodes that represent variables and factor nodes that represent functions between those variables. Factor graphs can represent both probabilistic and arbitrary functions and are the most expressive form of graphical model known. Figure 9 shows the mapping 58 of the cognitive layer on the graphical layer. Processing at the graphical layer is divided into two phases; the graph solution phase uses an approach based on the sum-product algorithm (Rosenbloom, Demski, & Ustun, 2016) i.e. the belief propagation algorithm (Koller & Friedman, 2009). Sigma’s implementation of summary product is serial, based on the asynchronous variant of belief propagation. The belief propagation algorithm works by computing marginals – via integrating, in Sigma – from distributions and sending these as messages on links corresponding to particular variables. At variable nodes, incoming beliefs are multiplied – via computing inner product without the final sum – to yield an updated distribution corresponding to the particular variable. Message passing continues until quiescence criteria are met, i.e. either no messages are to be sent out or the message sent out on a link does not significantly differ from the previous message sent out on that link. The graph modification phase updates the distributions corresponding to the WM, LTM and the internal PB. 3.1.5 Problem Spaces and Control Structure in Sigma As stated in Section 2.1.2, human behavior is goal oriented and the Sigma cognitive architecture structures cognitive behavior via the Problem Space Computational Model (PSCM) which consists of goals which are achieved by searching within Figure 9 The cognitive cycle deconstructed at the graphical layer, in terms of operations on graphical models. Figure from (Rosenbloom, Demski, & Ustun, 2016). 59 problem spaces (Yost & Newell, 1989). Problem spaces are stated in terms of states and operators. The cognitive cycle described in previous section supports problem spaces by allowing a functionally elegant tri-level control structure. The tri-level control structure consists of: (i) a reactive capability, that is based on a single cognitive cycle and serves as the inner loop of (ii) a deliberative capability that is based on a sequence of cognitive cycles which itself serves as the inner loop of (iii) a reflective capability that consists of impasses in decision making plus processing at the metalevel. Activities such as recognizing a fragment of speech or execute a set of rules can occur reactively in one cognitive cycle. The computation in one cognitive cycle is largely monotonic. A sequence of cognitive cycles allows support for knowledge- driven – or algorithmic – behavior. Cognitive cycles enable intelligent selection of operators and their application to yield new states. Allowing arbitrarily long sequence of cognitive cycles to select and apply operators within a problem space gives the ability to perform potentially unlimited, and largely non-monotonic computation within a single problem space. Reflective capability in turn uses this deliberative capability as its inner loop. In particular, when an impasse occurs, sub- goals are automatically generated along with accompanying metalevels in which deliberative processing – either in the same or different problem spaces – can be used reflectively to yield knowledge that can resolve the impasse. In other words, reflection can be thought of as ‘reasoning about reasoning.’ At the base level, reasoning is about the world, in the problem space corresponding to the task. When the agent cannot proceed, an impasse is generated, and a sub-goal is automatically 60 generated to resolve this impasse. Normal processing at base level stops and subsequent processing is in service of resolving the impasse. If there is another impasse while resolving this impasse, a second metalevel is created with another sub- goal to resolve this impasse. This creates a hierarchy of impasses. When information that resolves this or any impasse in this hierarchy is received, the impasse and any higher impasses are resolved. Reflection and metalevels are initiated when an impasse occurs in processing at some level that makes progress at that level impossible. This work demonstrates the fusion of an impasse-based discourse agent with sub-symbolic deliberative reasoning supported via the same cognitive cycle that supports other forms of reasoning and memories. 3.1.6 Programming in Sigma Sigma’s cognitive architecture defines a language of conditionals, predicates and functions. Predicates provide relational data structures for cognitive processing that join together typed arguments representing objects, entities or concepts. Each predicate induces a region of the system’s working memory (WM) for temporary storage of the state of the system, and may also induce a region of the system’s perceptual buffer for input from the outside world. The types themselves can be discrete – symbolic or probabilistic – or continuous, thus allowing the predicates to conjointly handle richly structured representations that are both mixed (symbolic + probabilistic) and hybrid (discrete + continuous). By way of example, consider that the acoustic observation in speech recognition – a spectral label – can be represented, by the discrete numeric type observation(value:[0:63]) if there are 64 such labels; the HMM state can be 61 represented by the discrete non-numeric (i.e., symbolic) type phone(value:{p1 p2 p3}) if there are 3 such states (as is the norm); and a recognizer’s vocabulary can be represented by the symbolic type word(value:{one two three four five six seven eight nine zero}) for a ten-word recognizer. A predicate for specifying the acoustic observation can be defined as Observation(spectral-label:observation) – which has one argument of type observation. Relationships between predicates are specified using conditionals and functions. Conditionals represent generalized knowledge fragments in Sigma’s long-term memory (LTM) by blending concepts from rule-based systems and probabilistic networks (Rosenbloom, 2013). They are built from predicate patterns – conditions, actions and condacts – plus optional functions. Conditions and actions are analogous in their behavior to the respective parts of rules and provide the forward momentum characteristic of procedural memory – conditions match to evidence in working memory and actions generate proposed changes in working memory. Condacts support bidirectional processing – both matching to working memory and suggesting changes to it – as needed in general probabilistic reasoning. A conditional that creates a knowledge fragment in Sigma’s long-term memory (LTM) is shown in Figure 10. Functions specify relationships among variables. They can represent joint or conditional probability distributions or arbitrary multivariate functions. Sigma’s functions in general can range over arbitrary non-negative 62 numbers. However, in the case of probability distributions, the range is restricted to [0,1] and for purely symbolic functions it is limited to 0 (false) and 1 (true). The function in Figure 10 is the acoustic function for an HMM, which specifies the conditional probability distribution 𝑃 (𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 | 𝑃 ℎ𝑜𝑛𝑒 ) . Functions can be provided as part of conditionals, as in Figure 10, or as part of predicates. 3.1.7 Cognitive Idioms Cognitive idioms are stylized forms of knowledge that can be assigned meaning and correspond to the idea of software libraries or idioms from software engineering. Cognitive idioms can occur at various granularities for example, both an HMM and a simple cognitive rule can be considered as cognitive idioms. The notion of procedural and declarative memories also occurs as idioms in Sigma. Cognitive idioms can be implemented as a single conditional or a combination of them. The availability of a broad range of cognitive idioms depends on the generality and expressibility of the Sigma cognitive language and ability to integrate across several knowledge sources above the architecture. This thesis is an important exercise in exploring the cognitive idioms involved in speech processing. To understand the form such an analysis can take, consider by way of example that HMMs or DBNs are forms of cognitive idioms. An HMM idiom CONDITIONAL Acoustic-Perception Conditions: Observation(spectral-label:o t ) Condacts: Predict-Current-Phone(phone:p) Function(o t, p): 0.01<L 1, P 1 >, 0.02<L 2, P 1 > … Figure 10 Conditional for the perception function; that is, the conditional probability distribution for the spectral labels given the phone. The observation ot is underlined to mark that it is a child (or conditioned) variable in the distribution. 63 results from the combination of a probabilistic transition function and a perceptual memory idiom (as shown in Figure 10). This resulting HMM idiom can be considered to be a combination of procedural and declarative memories. A fully expanded trellis or an unrolled DBN can be considered a reactive idiom as the entire trellis can be processed in one cognitive cycle. A single slice trellis with a rolling window, where the state information combined with perception for the current time slice can be reused as a prior in the next time slice is a form of deliberative HMM idiom as the result is obtained only after a sequence of cognitive cycles. This deliberative HMM idiom is supported in Sigma in a supraarchitectural fashion and can process both isolated word recognition and SLAM (Joshi, Rosenbloom, & Ustun, 2014). Other cognitive idioms for continuous speech processing, and syntax parsing are described in subsequent sections. 64 4 Speech and Language in the Sigma Cognitive Architecture The Sigma cognitive system is shown in terms of its layered design. The unique nature of this system derives from the fact that the speech recognition, language understanding and discourse components are specified entirely in terms of knowledge on top of the architecture. This is the first time a speech recognition capability has been specified in a cognitive architecture and integrated entirely as knowledge on top of the architecture. Figure 11 shows the conceptual diagram in terms of Sigma’s layers. Together, the three components are specified as cognitive idioms with the supraarchitectural integration of these capabilities yielding a partial conversational agent. Due to the fact that all knowledge is specified as predicates, conditionals and functions, a tight coupling between these capabilities results, which allows for bidirectional flow of information, with each capability potentially aiding and assisting the other. The next section discusses the system in further detail followed by what was done to realize this system. To begin with, an isolated word recognition capability is built. This demonstrates Sigma’s ability to process acoustic signal and decode a set of HMMs. Figure 11 The proposed system is a supraarchitectural integration of the speech recognition, language understanding and discourse management capabilities. Figure based on similar figure from (Rosenbloom, Demski, & Ustun, 2016). 65 This is then extended to connected digits recognition where audio from the previous capability is spliced together to create an audio utterance corresponding to a sequence of digits spoken in succession to decode a sequence of HMMs. Subsequently, continuous phone recognition demonstrates Sigma’s ability to decode continuous fluent audio with silence present between words. This is then extended to continuous word recognition based on phones. Then the INOTS (Campbell, et al., 2011) discourse agent is synthesized in terms of Sigma’s tri-level processing model with the continuous word recognition capability fused to the discourse agent. Finally, the challenges posed by Sigma’s basis in graphical models to incorporating PCFG parsing are analyzed and Sigma’s ability to specify any arbitrary SPN is proven. 4.1 Isolated word recognition The isolated word recognition task implemented in Sigma uses the TI46 (Linguistic Data Consortium, 1991) corpus. The task is simple yet involves real world complexity. A digit is spoken in isolation and the task is to recognize the digit. Each digit is represented with a single HMM and thus there are ten HMMs to choose from at the end of the utterance. The acoustic signal is represented as a vector quantized or VQ number (section 2.2.2) number, corresponding to the mel-frequency cepstral coefficient (MFCC) representation of a 10ms audio frame (Young, et al., 1997). The MFCC representation of the audio signal models the processing of the human ear. Figure 12 shows the block diagram of the architecture for this task. The task was speaker dependent, that is audio from only one speaker was used to train and test, but this was repeated for two speakers. The full details of this system can be found in (Joshi, Rosenbloom, & Ustun, 2014). 66 The corresponding factor graph, for three time-slices, is shown in Figure 13. The acoustic observations are denoted by 𝑂 while the time index is shown as the subscript. The state nodes maintain a joint distribution over the hidden state of each digit, for all digits and hence includes the 𝑤 subscript in addition to the time subscript. Figure 12 Block diagram for an isolated word recognizer. At the end of the utterance, the most likely word is chosen yielding an isolated word capability. 67 The 𝑎𝑟𝑔𝑚𝑎𝑥 required in equation 2 is performed by Sigma’s decision cycle (via selection as part of the adaptation phase) and is shown via the decision node E. To process a sequence of arbitrary length, the three-slice network shown in Figure 13 can be expressed as a sliding window, specifying processing for only one observation. Specified in this way, the HMM represents a deliberative idiom (whereas a fully unrolled trellis would be a reactive idiom) over a sequence of decision cycles. This deliberative HMM idiom is shown in Figure 14. At each decision cycle, the observation is combined with current state information and Sigma’s decision cycle performs the argmax by summarizing out the state from each word’s HMM to extract the most likely word. To specify this graph Figure 13 Factor graph yielding the isolated word recognizer. Figure 14 Isolated word recognizer using a sliding window. The trellis, or unrolled graph, is collapsed into one processing stage, with the (prior) state (Sw,p ) from the previous time-slice and the (current) state (Sw,c ) from the current time-slice. The decision portion of the graph remains the same. The dotted arrow from the current state to the previous one shows automatic (architectural) copying of the current state to the previous state after message passing has reached quiescence. 68 in Sigma, the predicates corresponding to each quantity must be specified. The observation predicate can be specified by first indicating its type observation(value:[0:63]) and then specifying the predicate definition as Observation(spectral-label:observation) which has a single argument of type observation. Similarly, the state of each HMM can be specified as a symbolic type, phone(value:{p1 p2 p3}) and the current state predicate as Predict-Current-Phone(phone:phone). The required structures can be specified using three conditionals, one to combine state information from the previous time slice with the state transition function to form the state information for the current slice, another conditional to combine the perception with this current state information – via the acoustic function – and finally the third conditional to perform the 𝑎𝑟𝑔𝑚𝑎𝑥 operation. The three conditionals are shown in Figure 15, Figure 16, and Figure 17 respectively. CONDITIONAL Acoustic-Transition Conditions: Previous-Phone(word:w p phone:p p ) Condacts: Predict-Current-Phone(word:w c phone:p c ) Function(p p ,w p ,p c, w c ): 0.95<P 1 ,ONE,P 1 ,ONE>, … Figure 16 Conditional for the transition function; that is, the conditional probability distribution of the current state (and word) given the previous ones. The function is jointly normalized over the current phone and word. CONDITIONAL Acoustic-Perception Conditions: Observation (spectral-label:o c ) Condacts: Predict-Current-Phone (word:w c phone:p c ) Function(o c ,p c ,w c ): 0.01<L 1, P 1, ONE>,0.02<L 2, P 1, ONE>,… Figure 15 Conditional for the perception function; that is, the conditional probability distribution for the spectral labels given the phone (and word). The function is normalized over the word. 69 The work on isolated word recognition (Joshi, Rosenbloom, & Ustun, 2014) goes further and reuses idioms developed in other tasks to automatically generate the first two conditionals shown above, by reusing the perception modeling idiom introduced in SLAM (Chen, et al., 2011) and the action modeling idiom generated in RL (Rosenbloom, 2012). This shows that an HMM is effectively a combination of the two idioms developed in service of SLAM and RL, reused in the context of speech. Thus, the only knowledge required to perform isolated word recognition is to specify the predicate definitions, format for the predicate functions followed by the selector conditional. The architecture generates the rest of the knowledge automatically and can learn the acoustic and transition functions to use in the word recognition task. Learning the acoustic and transition functions was accomplished using Sigma’s gradient descent learning mechanism (Rosenbloom, Demski, Han, & Ustun, 2013) and the performance was compared with that obtained using acoustic and transition functions learned from HTK. It is important to note that this learning was accomplished using the incremental structure shown in Figure 14. CONDITIONAL Word-Selector Conditions: Predict-Current-Phone(word:w c ) Action: Word-Selected(word:w S ) MAP: true Figure 17 Conditional to select the most likely word from the current state across all word HMMs. The MAP clause specifies that the phone is to be summarized out from the Predict-Current-Phone predicate via max rather than sum. 70 The results of these experiments are summarized in Table 1 with more details found in (Joshi, Rosenbloom, & Ustun, 2014). Sigma is able to perform this simple speech processing task using both the HTK learnt parameters and parameters learnt by Sigma. Furthermore, when the structures are autogenerated, using perception and action modeling, Sigma is still able to perform the speech processing task with the same efficacy. Table 1 Results from the isolated word recognition work. Configuration Accuracy (%) Speaker1 Speaker2 HMM parameters trained with HTK + hand coded conditionals 98.75 96.25 Sigma trained state and perception functions + hand coded conditionals maintaining HTK data split 98.75 99.37 Sigma trained state and perception functions + hand coded conditionals with 5 fold cross validation 9 99.6 100 Sigma trained state and perception functions + automatically generated predicates and graph with 5 fold cross validation 9 99.2 100 4.1.1 Discussion This task takes a step towards a full speech processing capability. No new architectural mechanisms were needed for processing the HMMs or for learning their parameters. Furthermore, most of the structure of the HMMs themselves could be derived automatically using slightly extended mechanisms previously introduced in the context of reinforcement learning and SLAM. In addition to defining the appropriate types and data structures required for perception and working memory, the only long-term knowledge that was explicitly added was the forward movement 9 The order of the training examples was randomized for this configuration. 71 pre-constraint on the transition function plus a conditional to select the correct word from the current state. The (pre-constrained) transition function and perception functions were also learned automatically via the gradient learning mechanism already present in Sigma. To understand that this is an incremental structure, note from Figure 14, that only the forward message from the previous time-slice is combined – via the transition function – with the perception from the current time-slice. The current best word is extracted using the conditional in Figure 17 at each time-slice and is made available in the word predicate. This can be used, in a deliberative fashion to inform any upstream language capability. This shall be seen in subsequent sections. 72 4.2 Connected word recognition The connected word recognition task tests isolated words spoken in sequence. This task extends the isolated word recognition task to recognize an utterance of a string of digits uttered in quick succession. This is not continuous, fluent speech (thus no cross-word effects and intervening silence between words) and is meant to test the performance of segmentation of a string of digits when the performance on isolated words is ~99% i.e. the acoustic function is good. To test this capability, a sequence of hundred digits was randomly generated and the corresponding audio files were spliced together to generate a sequence of connected digits. The expected output was a transcription of these hundred digits. The WER was measured using the minimum edit distance algorithm specified in (Jurafsky & Martin, 2008), implemented in our Figure 18 Sigma factor graph for connected word recognition. 73 lab. The average WER over a sequence of five runs was ~18%, where each sequence was a randomly generated string of hundred digits. Figure 18 shows the model that processes this concatenated audio. The model can be thought of as recognizing a language of spectral labels in the spectral alphabet of 𝑜 0 : 𝑜 63 and reducing it to the corresponding sequence of digits [0: 9]. Each digit is modeled as a sequence of three hidden states 𝑝 1 , 𝑝 2 & 𝑝 3 , specifying an HMM for each digit. This model extends the deliberative HMM idiom introduced in 3.1.7. To begin with, i.e. for the first time slice, the prior reflects the unigram distribution over the span of digits. For subsequent time slices as before, the observations are introduced in the perceptual predicate 𝑂 𝑐 and combined with state information from the previous time slice after processing through the perceptual function 𝑃 𝑤 . The segmentation decision is made based on comparing the most likely phone – obtained by summing out the digit in the 𝑆 𝑤 ,𝑐 predicate – for the past two time slices. This extraction function is shown in the 𝐸 𝑝 predicate. The segmentation decision tests for when the previous most likely phone was 𝑝 3 for any digit and the next most likely phone for any digit is 𝑝 1 . The digit corresponding to the previous time slice, when 𝑝 3 was most likely, is extracted in the 𝐸 𝐷 factor node. When a segmentation decision is made, the state information for the next frame should correspond to the bigram probability for digit transitions, i.e. the 𝑃 (𝑛𝑒𝑥𝑡𝑑𝑖𝑔𝑖𝑡 |𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑒𝑑𝑑𝑖𝑔𝑖𝑡 ). Otherwise the state information from the previous time slice should be carried over to this time slice after multiplying by the state transition function. This is the switching parent functionality implemented in the GMTK by extending DBNs (Bilmes & Zweig, 2002). In Sigma, this functionality is achieved by the probabilistic OR node that is triggered 74 by the digit extraction factor node. The probabilistic OR node is not standard in factor graphs and was originally used in Sigma to combine results of a predicate participating in multiple actions. Here, it is shown that this helps model the switching parent functionality of GMTK (Bilmes & Zweig, 2002). 4.2.1 Discussion The main purpose of this task was to test Sigma’s ability to perform an incremental segmentation of an acoustic sequence. The acoustic stream is segmented into a string of phones, where each phone is a represented by a sequence of hidden states. This hierarchical segmentation requires extending the sliding window mechanism and specification of rules to extract the best phone in an incremental fashion. This capability is used in the next section in aid of continuous phone recognition. 4.3 Continuous Phone Recognition The most important lacuna in the isolated word recognition is the fact that the audio for the task was not continuous in nature. Leading and trailing silence from the audio was manually removed. The connected digits task extended the isolated word task to test Sigma’s ability to perform hierarchical segmentation. The audio used in this task was not natural, fluent audio. The continuous phone recognition task works on continuous fluent audio, where the recognition task must account for cross word effects and potential silences between words. This requires extensions to: (i) the cognitive idiom presented earlier, and (ii) the acoustic signal representation. These are discussed next and a new cognitive idiom is identified. This idiom is reused in the continuous words task and subsequently in the dialogue agent task. 75 As noted previously, each phone is modeled as a three-state HMM, with the states of the HMM corresponding to the beginning, middle and end of the phone. An HMM consists of a Markov chain that models the evolution of the speech code over time while producing acoustic observations. Each phone is considered to be a finite state automaton (Figure 19), as modeled by the states of the HMM. As each phone automaton progresses, it passes through the beginning, middle and finally the end of the phone while producing the acoustic observations. The relationship between the states of the HMM and the acoustic observations is governed by the perception function. The progression of the state sequence, i.e. which state can follow the current state, is governed by the transition function. Together the perception and transition functions combine to yield the joint probability of the sequence of states and acoustic observations. The prior function of an HMM specifies the initial distribution over the states of the HMM. Typically, each phone starts in the beginning, i.e. state s0, but the HMM construct allows for more complex phone models. When a phone ends, the phone-to-phone transition – or the phone bigram – function (Figure 20) governs the distribution over the phones that can follow it. Each spoken utterance, modeled as a sequence of HMMs, can itself be considered a higher order HMM, where the intra- HMM transitions are ignored and only the inter-HMM transitions – i.e. those that happen from state s2 of one phone to state s0 of the next phone (Figure 20) – are recorded to reconstruct the utterance. Whereas the work on isolated word Figure 19 A finite state automaton model of a phone. The states correspond to the beginning, middle and end of a phone. Each state can stochastically transition to itself or the next state, to account for variable duration of phones. 76 recognition in Sigma used one HMM per utterance, with only one word per utterance, this work focuses on a more natural form of continuous utterance, where the number and order of the phones is not pre-determined. The only restriction here is that each utterance must end in the final state of some HMM, i.e. s2 for the HMM. The conditionals presented previously model (Figure 15, Figure 16) the entire duration of the acoustic signal as one HMM. These are extended in this section to model the acoustic signal as a sequential composition of HMMs where each three-state HMM still models a phone. Next, a choice must still be made regarding the representation of the acoustic signal. The source audio is treated as a sequence of acoustic signals, with each such chunk of acoustic signal converted to a corresponding block of parameter vectors that the recognizer can use as spectral features. As stated previously, this process is meant to model cochlear processing of the speech signal along with cortical feature extraction. The source acoustic signal is first processed using a traditional DSP front- end and converted into the frequency domain after splitting the signal into successive overlapping windows. The amount of source signal used is 25ms and there is a 15ms overlap between successive windows, thereby generating a new data frame for every 10ms of the source signal. This frequency-domain signal is then filtered using a bank Figure 20: A sequence of two phones. The continuous phone recognition task consists of identifying an arbitrarily long phone sequence given a spoken utterance. Typically, the conditional probability of the next phone given the previous phone – i.e. the phone bigram – is used to approximate the joint probability of the sequence of phones. The transition leaving state s2 and entering state s0 is an inter- phone transition while the rest are intra-phone transitions. 77 of twelve mel-filters to obtain a vector of twelve filtered signal values. The scale and center frequencies of the filters are designed to correspond to the processing in the human cochlea (Denes & Pinson, 1993). The output – the mel-spectrum – is then brought back into the time domain to obtain better phonetic discrimination. This vector of processed audio is the mel-frequency cepstral coefficient (MFCC) feature. Each such vector of MFCC values is one feature and is converted to a spectral label via a clustering algorithm from HTK’s HQuant tool (Young, et al., 1997). This spectral label suffices to yield good performance in isolated word recognition but may face severe performance degradation in the face of speaker independent, phonetically varied data (as opposed to single speaker data from previous task), as is the case for the current task. To mitigate this performance degradation, the original twelve-dimension MFCC vector is augmented with delta and double-delta feature vectors, where the delta vector represents the change in the MFCC from the previous frame to the current frame and the double-delta vector represents the change in MFCC from two frames ago. This yields a 36-dimension vector for each 10ms acoustic frame. The energy from each vector is also appended, yielding a 39-dimension feature vector that consists of four feature types: the MFCC, the delta MFCC, the double-delta MFCC, and the corresponding energy in Figure 21: Structure to process a multidimensional observation vector. One acoustic perception function is needed per dimension. The information from all dimensions is combined multiplicatively. Prior on the state is not shown and it is obtained via the combination of the state information from previous time slice and the state transition function. 78 each of these vectors. Each feature type can be independently clustered to obtain a spectral label using HQuant. This yields a four-dimensional vector of spectral labels for each 10ms acoustic frame for use in the continuous phone recognition task. The distribution over the four dimensions of this feature vector are combined multiplicatively, in a naïve-Bayes fashion as shown in Figure 21, to yield a perceptual distribution over the states of the HMM. Combining information across multiple dimensions in this fashion implies the assumption that each dimension is independent of the other and corresponds to using a diagonal covariance matrix in a traditional acoustic front-end based on Gaussian mixture models (Young, et al., 1997). The phone recognition problem can now be specified mathematically: given a sequence of observations {𝑜 1 , 𝑜 2 … 𝑜 𝑇 } , obtain a sequence of M phone labels {𝑝 ℎ 1 , 𝑝 ℎ 2 … 𝑝 ℎ 𝑀 }. Each observation 𝑜 𝑡 is a vector of four-dimensional features. Note, the length of the observation sequence and the length of the phone sequence can be different, and the length of each utterance varies arbitrarily. Given a set of HMMs, one for each phone from the alphabet 𝑃 , and the set of all possible strings of phones, 𝑃 ∗ , the phone recognizer can be conceptualized using Bayes’ theorem as the sequence of phones 𝑃 ℎ, that maximizes the product of two factors: 𝑃 ℎ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 ℎ ∈𝑃 ∗ 𝑃 (𝑂 |𝑃 ℎ) ⋅ 𝑃 (𝑃 ℎ) (1) The factor 𝑃 (𝑂 |𝑃 ℎ) is the probability of the observation string 𝑂 – a sequence of four-dimensional spectral labels – being generated by the sequence 𝑃 ℎ of phones. This is obtained from the perception function of the HMMs. As discussed above, the perception function used here is a naïve-Bayes classifier that uses the four dimensions of the feature vector for each 10ms frame to yield a distribution over the 79 states of the HMM. The Sigma conditional in Figure 7 showed how the front-end is programmed when the observation is one-dimensional. Extending the front-end to handle four-dimensional input involves specifying four such conditionals, one per dimension of the observation. In this fashion, the acoustic front-end can be set up for one time slice, where the acoustic signal is processed using HMM models obtained via HTK’s training to generate a distribution over the states of phones. The second factor 𝑃 (𝑃 ℎ) represents the prior joint probability of the string of phones and is commonly referred to as the language model. For a phone string 𝑃 (𝑃 ℎ) of length M, the prior is given by 𝑃 (𝑃 ℎ) = 𝑃 (𝑃 ℎ 1…𝑀 ) = 𝑃 (𝑃 ℎ 1 ∧ 𝑃 ℎ 2 ∧ … ∧ 𝑃 ℎ 𝑀 −1 ∧ 𝑃 ℎ 𝑀 ). This is an M-dimensional function for a string of length of M. However, this function is too unwieldy to compute or specify for arbitrarily long phone sequences. To alleviate this problem, a phone bigram function – a conditional probability function that assigns a probability to the next phone given the previous phone – is used. When multiplied together, these conditional probabilities and the prior on the first phone approximate 𝑃 (𝑃 ℎ 1…𝑀 ): Figure 22: A bigram language model with special start and end states. The s/ and /s states are both mapped to the /sil/ phone in the work here. The probabilities on the arcs are the conditional probabilities of the possible next phones occurring given the current phone. 80 𝑃 (𝑃 ℎ 1…𝑀 ) ≈ 𝑃 (𝑃 ℎ 1 ) ⋅ ∏ 𝑃 (𝑃 ℎ 𝑖 |𝑃 ℎ 𝑖 −1 ) 𝑀 𝑖 =2 (2) The 𝑃 (𝑃 ℎ 𝑖 |𝑃 ℎ 𝑖 −1 ) term specifies the conditional probability of observing a particular phone, after having observed the previous phone. This reduces the size of the factor to |𝑃 | 2 where |𝑃 | is the number of phones in the phone alphabet. Because unseen bigrams will have probability of 0, smoothing must be applied to generalize this function to account for bigrams that might be observed during performance. This ensures there is a potential path between each phone pair, albeit at a lower probability in the presence of unseen phone bigrams. An example of a language of phones is shown in Figure 22 in the form of a finite state acceptor (FSA) (Mohri & Riley, 2008). The nodes represent the phones in the alphabet of P and the arcs represent the probability of a transition from one phone to the next. Each of the states of the language defined in the FSA shown in Figure 22 are modeled individually as an HMM. This language model and subsequent smoothing are both generated by HTK’s HBuild tool (Young, et al., 1997), from the utterance transcripts provided with the dataset used in this work. Now that the nature of the acoustic front end along with the nature of the phone language that must be recognized is specified together with HMM models for individual phones, the task of phone recognition can be thought of as generating a transcript by following the most likely path through the state space obtained by composing the language FSA in terms of the states of individual HMMs and expanding it for the length of the utterance. This is the task of inference in this context and is typically performed by the Viterbi algorithm. The standard formulation of the Viterbi 81 algorithm requires both the availability of the entire utterance prior to inference and the existence of a data structure that records the most likely state that resulted in the current state, for each time slice. At the end of the utterance, once the most likely final state is determined, the data structure is traversed to obtain the most likely state sequence corresponding to the observation string. This typically takes the form of a speech decoder augmented by control structures – referred as ‘tokens’ in the speech literature (Young, Russell, & Thornton, 1989; Ran & Wang, 2008) – that record whenever the final state s2 of a phone transitions to the state s0 of another phone. Token passing is used in several speech recognition platforms, for example Kaldi (Povey, et al., 2011) and HTK (Young, et al., 1997). Each token accumulates a summary of the states it has visited in sequence. At the end of the utterance, the best token – corresponding to the most likely path – is selected and its history is parsed to obtain the desired transcript. However, it has been long known that human speech processing is online: humans incrementally segment an utterance into words and assign these words an interpretation as they hear them (Marslen-Wilson & Welsh, 1978; Marslen-Wilson, 1987), although with delays of up to 250ms. Online versions of the token passing algorithm still suffer from the problem of state space explosion where a large number of tokens exist in the recognition network and beam search is applied to limit the number of active tokens at any time. To achieve online, incremental segmentation and extraction, Sigma uses a summary of the past, in the form of a predicate that stores the previous state, plus a window of limited constraint from the future (as explained below) and a simple rule that detects transitions between phones. Because only a limited constraint from the 82 future is used, the results will likely not be as accurate as full Viterbi, where the whole utterance is processed before any transcript is obtained, but it may be possible to come close. Figure 13 showed the factor graph for a single HMM, unrolled for three time slices, with a one-dimensional observation for the sake of simplicity. This graph can be modified, by parameterizing the transition and acoustic factors to handle multiple HMMs. The perception function now represents a concatenated perception function, obtained by merging the perception functions of all of the HMMs. The transition function can be obtained by first merging the intra-state transition functions for each HMM and then adding in the inter-HMM transitions by introducing a transition from the final state s2 of each phone’s HMM to the beginning state s0 of each other phone, as shown in Figure 20. The transition into state s0 from state s2 of the previous phone is the inter-phone transition. The other transitions that do not involve leaving state s2 or entering state s0 are intra-phone state transitions. The likelihood of this transition is obtained from the phone bigram function specified by equation (2). This requires the transition function be specified in terms of transitions from the previous phone and state to the current phone and state. The conditional that combines state information from the previous time slice after processing with the transition function is shown in Figure 23. CONDITIONAL Combine_Prior_State Condacts: phone_state_current(phone:p c fsa_state:s c ) phone_state_previous(phone:p p fsa_state:s p ) state_trans_func(phone:p p fsa_state:s p phone:p c fsa_state:s c ) Figure 23: An LTM fragment illustrating how the state information from the previous time slice, available in the phone_state_previous predicate, is processed via the transition function in the state_trans_func predicate and made available for the current time slice in the phone_state_current predicate. The current phone p c and current state s c are underlined to indicate they are the conditioned variables. 83 The modified factor graph, including the parameterized perception and transition functions, is shown in Figure 24, fully unrolled for three time slices. Additionally, all messages required to calculate the posterior at node S2,p are explicitly shown. The posterior distribution at each node j is based on the product of the forward message 𝜇 𝑖 →𝑗 , the backward message 𝜇 𝑘 →𝑗 and the information from the acoustic perception factor 𝜇 𝑜 𝑗 →𝑗 . As the utterance proceeds from left to right, the forward messages do not change, only the backward messages do, based on the new observations. This forward computation can be rolled into a single time-slice structure that records a summary of the utterance heard thus far. The graph structure that achieves this is shown in Figure 25. Predicate F P maintains the state of the utterance up to the previous time slice, acting as a prior for the current time slice. A new observation enters the graph via O t to be combined with the existing state information; that is, the transitioned prior state information. This processing all happens as part of the elaboration phase. During the subsequent adaptation phase, this result is slid back into F P via a S 1,p T p T p Pr p A p A p A p O 1 S 2,p O 2 O 3 S 3,p 𝜇 𝑆 1,𝑝 →𝑆 2,𝑝 𝜇 𝑆 3,𝑝 →𝑆 2,𝑝 𝜇 𝑂 2 →𝑆 2,𝑝 Figure 24: The factor graph for the HMM from Figure 15 as extended to account for all phones. This is accomplished by parameterizing the factor functions and variables with the phone (p). The messages needed for calculating the posterior at S2,p are also shown. 84 diachronic architectural mechanism so as to act as the prior for the next time slice. This forward going ‘sliding window’ mechanism, including the copying at the end of the cognitive cycle, can be implemented by: declaring a closed world predicate – a signal to Sigma that unknown values in the predicate are assumed to be false (0) and that the contents of its segment of WM should be latched across cognitive cycles – turning on diachronic processing, and specifying the corresponding LTM fragments. The decision node E computes the argmax from equation (2), in a local fashion, using information from the previous and current time slices. The rules that detect this phone transition and extract the completed phone into the Ph predicate are similar to that shown in Figure 17. Keeping in mind that there is a lag of about 250 ms in segmentation during human parsing (Marslen-Wilson, 1987), a limited amount of constraint from the future can be used, in the form of backwards messages, for human-like processing. Figure 26 shows how the trellis from Figure 25 can be extended to account for F p F*N p O t A p T p 𝜇 𝐹 𝑝 →𝐹 ∗𝑁 𝑝 𝜇 𝑂 𝑡 →𝐹 ∗𝑁 𝑝 Ph E Figure 25: Forward-processing template structure using a sliding window of one observation. The unrolled graph from Figure 13 is collapsed into one processing stage, with the prior (Fp) combined with the current observation (Ot) to obtain the posterior for the utterance up to time slice t. This posterior is slid back into Fp as part of the adaptation phase – shown by the dotted line – to act as the prior for the next time slice. Also shown is the decision node E that performs the selection of the best phone using state information, based on the rules in Figure 24. 85 information from a limited future. Using the template from Figure 24, the constraint from the past – as obtained via the F*N P predicate – is combined with a message from the acoustic perception factor and the backward messages from the future portion of the graph. The future portion of the graph consists of a pair of perception and state predicates for each time slice that generates the backward message from it. The number of time slices that are added determines how much future constraint is incorporated into the processing. The decision node E works as before, but uses state information from the current time slice (St,p) and the future (St+1,p). This is an extension of the sliding window described earlier, and works similarly except over a wider window. A new observation is introduced every cognitive cycle, in the final time slice of the future portion of the graph. This triggers a chain of backwards messages along the horizontal length of the graph, which is combined with the forward information to obtain an approximate posterior. The quality of the posterior obtained at the S0,p predicate is a function of the amount of constraint used from the future; that is, on how many time slices are explicitly represented in the graph. The next section evaluates the effect this has on the results. This overall approach to phone segmentation and recognition differs from the classical formulation of the Viterbi algorithm, where a complete backwards pass Figure 26 The continuous phone recognizer graph using four time slices of constraint from the future: Ot+1 to Ot+4 (40ms). For the sake of clarity, only one dimension is shown for the acoustic perception factors at each time slice. 86 occurs once the entire utterance has been processed. The approach described here instead uses a limited constraint from the future combined with a summary of the past, and extracts the transcripts in a completely online , incremental manner. 4.3.1 Results on TIMIT For these experiments, the TIMIT database (Garofolo, et al., 1990; Zue, Seneff, & Glass, 1990) was used, which contains a total of 6300 utterances amounting to ~5.4 hours of speech. Despite its age, this corpus is still widely used to test new speech recognition architectures and models and has several desirable properties for this task: tractable size, phonetic variation and compositional aspects of hierarchical sequential classification tasks. The speech community has developed a standard procedure – a recipe – that was used to perform all the experiments. Using a standard database with a standard procedure allows ready comparisons with other approaches. To ensure the training and test sets don’t contain sentences by the same speaker, the recipe uses 3696 sentences for training and 192 for testing. In accordance with the standard way of using TIMIT (Lee & Hon, 1989), the set of 61 TIMIT phone labels are reduced to a smaller set of 48 phones for training. In addition to the 48 phones, the glottal stop ‘q’ is retained for training but ignored for testing, whereas it was removed entirely by (Lee & Hon, 1989). 10 Furthermore, for evaluation of decoded transcripts, (Lee & Hon, 1989) suggest reducing the number of phones from 48 to 39 by treating, for example, the voiced closure {vcl} the same as {sil}. While 10 Removing a phone entirely seems to create blanks or holes in the transcript when using HTK’s HLEd tool. This workaround is suggested by the GMTK’s TIMIT recipe (Bilmes & Zweig, 2002). 87 such a mapping is based on rules obtained from phonetics (Lopes & Perdigao, 2011), the purpose here was not to explore the effects of these rules but to explore the feasibility of Sigma to recognize in an online and incremental manner a continuous stream of audio into constituent phones. All of the preprocessing of the audio and the training of the parameters for these experiments was performed by tools from HTK. First, preprocessing computes the MFCC features and generates the spectral label codebook (via HTK’s HQuant tool). The spectral codebook was then used to convert the TIMIT test and training sets into corresponding spectral labels using HTK’s HCopy tool. The parameters required for this task include the HMM transition and perception models and the parameters of the inter-phone language model. The parameters were introduced into the corresponding knowledge structures in Sigma and the system was run on all of the test cases in the standard TIMIT recipe (Lopes & Perdigao, 2011). The performance of the phone recognizer is determined by using the Levenshtein distance (Lopes & Perdigao, 2011) between the output transcript and the reference transcript. This metric accounts for the three kinds of discrepancies that can occur when comparing two sequences: deletion, where a phone is present in the reference transcript but not in the recognizer’s output; insertion, where a phone not present in the reference transcript is inserted by the recognizer in its output; and substitution, when the wrong phone is recognized by the recognizer. HTK’s HResult tool is used to compare Sigma’s output with the reference transcript to compute the performance metrics. Experiments were run using the HTK decoder, based on the Viterbi algorithm, and in Sigma with varying amounts of future constraint, ranging from 0 time slices up 88 to 20. The results can be seen in Table 2. The top row shows baseline results using HTK’s Viterbi decoder. There are three main results. First, Sigma’s online and incremental recognition and segmentation of the acoustic signal into a corresponding sequence of phones improves monotonically with more future constraint. Second, with sufficient future constraint – that is still bound by the amount of lag in human parsing – it is comparable in accuracy to the results from HTK’s Viterbi decoder that Table 2 Summary results from the phone recognition task. All the parameters were trained using the HTK tool suite. The top row shows results from HTK’s Viterbi decoder. The other rows show results from Sigma’s incremental decoding approach, using limited constraint from future. Recognizer Total labels (N) Correctly identified (H) Total deleted (D) Total substituted (S) Total inserted (I) Correct H/N (%) Accuracy (H – I)/N (%) HTK Decoder using Viterbi 6785 3816 954 2015 387 56.24 50.54 Sigma incremental, no future time slices 6785 2850 2385 1550 99 42.00 40.55 Sigma incremental, 6 future time slices 6785 3731 1067 1987 363 54.99 49.64 Sigma incremental, 12 future time slices 6785 3782 1041 1962 315 55.74 51.10 Sigma incremental, 20 future time slices 6785 3801 1039 1945 317 56.02 51.35 89 operates on the full utterance. Third, Sigma is far from sufficiently efficient for this task. Table 3 lists the average cognitive cycle time for each of the Sigma configurations from Table 2. As expected, the cognitive cycle time increases with the number of future time slices used, ranging from a factor of 5 too slow to nearly a factor of 400, when considering the 50 ms requirement placed on the cognitive cycle. Table 3 Average cognitive cycle times as a function of the amount of future contraint incorporated Configuration Cognitive cycle time (ms) Sigma incremental, no future information 250 Sigma incremental, 6 future timeslices 5811 Sigma incremental, 12 future timeslices 14119 Sigma incremental, 20 future timeslices 19110 4.3.2 Discussion A speaker independent continuous phone recognition task was implemented on top of the Sigma cognitive architecture. The top-level result is that the recognition is performed in an online and incremental manner, with a single forward pass through the utterance that uses limited constraint from the future. This is in line with what is understood about spoken language processing in humans. All of the knowledge required to perform online phone recognition is specified supraarchitecturally. The accuracy of the resulting system is comparable to that of a narrow AI system that uses the same parameters but that assumes the entire utterance is available at the start of decoding; however, the amount of time required for this processing in Sigma is significantly more. This work builds on the isolated and connected word tasks by extending the acoustic perception function to handle a multi-dimensional representation of the acoustic signal and extending the transition function to represent both intra-phone 90 and inter-phone transition parameters. The phone recognizer reuses the diachronic processing mechanism already available in Sigma to generate portions of the graph automatically. Experiments show that the performance improves as the amount of information used from the future increases. Because the acoustic signal is introduced in the graph via perceptual predicates, and is subsequently combined – via LTM fragments in the form of conditionals – with the state of the agent within the same cognitive cycle that processes all other forms of knowledge in Sigma, this work is an important demonstration of functional elegance in action while taking a step towards grand unification. It does however fall significantly short on sufficient efficiency. While the isolated word work showed how Sigma’s cognitive cycle can process an acoustic signal to recognize one word (Joshi, Rosenbloom, & Ustun, 2014), this work Figure 27 The continuous word recognizer idiom, extended from the continuous phone recognizer. This is a monophone based recognizer with the top predicate maintain a distribution over words and the phonetic position in the word. Each word is incrementally extracted in the “W” predicate triggered via a phone transition. 91 shows how this same cognitive cycle can recognize and segment a stream of continuous audio signals in terms of its constituent phones. This raises an interesting question of whether Sigma’s cognitive cycle can blend previously demonstrated cognitive processing and with the sub-cognitive processing discussed here. This would serve as an important example of a broader form of unity and uniformity supported by the cognitive cycle. 4.3.3 Continuous Word Recognition The continuous word recognition capability is realized by extending the cognitive idiom presented in previous section. This is shown in Figure 27. The observations at the word level are phones and the phonetic lexicon acts as the perception function. The lexicon specifies how each word in the system sounds by specifying the phones and their positions in the word. The predicates corresponding to the nodes at the top of the trellis contains a joint distribution over words and the phone position within the word. This is analogous to the phone and FSA state below in the trellis, where the observations are spectral labels and the phone state predicate maintains a joint distribution over the phone and the FSA state, with the added complexity that not all words have the same number of phones whereas each phone FSA has three states. A word transition is detected in a fashion similar to a phone transition described in the previous subsection. Here, the word transition is triggered when a phone transition is detected, and this is the ending phone of the most likely word. This extraction is similar to the phone extraction discussed previously. The rule presented in Figure 17 is modified and shown in Figure 28. This recognizer was tested on two sentences from 92 the TIMIT corpus, using audio from the previous section, as a proof of concept. In either case, the full sentence transcript was produced by the word recognizer. 4.4 Dialogue Agent The dialogue agent chosen to demonstrate Sigma’s suitability is a simple, turn based agent. The INOTS agent (Campbell, et al., 2011) is a real-world task that was developed at USC’s ICT laboratory in aid of imparting leadership training to junior Naval officers. The control structure governing this agent is simple discourse model based on a directed acyclic graph of states and utterances as shown in Figure 29. In each L-state (listen state), the agent waits for the human officer to speak. The utterance is mapped on to a set of allowed utterances. Based on what was said, the agent moves to a T-state (talk state), and based on which (L-state, utterance) combination caused the transition to this particular T-state, the agent selects a response to utter. This model can be implemented as a simple Naïve-Bayes classifier. The word recognizer from previous section can be coupled to this classifier to synthesize a speech-based dialogue agent. The nature of this mapping is a fusion of Sigma’s symbolic and sub-symbolic capabilities and consists of: (1) deliberative movement through a discourse problem space composed of operators for listening CONDITIONAL Extract_Word Conditions: word_select_prev(word:w 1 phone_posn:p n ) word_select(word:w 2 fsa_state:p 0 ) phone_extracted(phone:ph) lexicon_endings(phone:ph phone_posn:p n ) Actions: word_extracted(word:w 1 ) Figure 28 Conditional to extract the most likely word incrementally. When a phone transition occurs and the phone is the end of the word as specified in the phonetic lexicon, it is extracted in the word_extracted predicate. This corresponds to the ‘word transition’ nodes in grammar networks of HTK, Kaldi. 93 and talking utterances, (2) a deliberative incremental Naïve-Bayes bag of words classifier as a simple Natural Language Understanding Unit (NLU), and (3) incremental, deliberative speech processing informed via acoustic signal representation described earlier. Figure 29 The INOTS discourse task uses a finite state automaton to describe a turn based dialogue agent. Each The agent moves from a T-state to an L-state and vice versa until it enters an exit state when the conversation ends and Sigma halts processing. In each state the agent must select an operator to apply to move to the next state. Each operator corresponds to an utterance that the agent is listening to or utters. When more than one operator is possible in the current state, a ‘tie’ impasse occurs to resolve this situation. The tie here is between the utterance operators and indicates insufficient information available to choose between the operators. This halts normal processing at this level and the agent creates a subgoal to resolve this impasse by entering a meta-level state. In this state, the agent processes audio using the continuous word 94 recognizer described in the previous section. Each utterance operator from the level below (‘below’ here corresponds to the level in the impasse hierarchy) induces a distribution over the possible words that can be heard. This top down influence acts as a prior on the words possible given the current state in the conversation, changing their likelihood as a state of the conversation. Speech is processed in a deliberative fashion and informs the discourse level NLU classifier using the recognizer from Figure 27. The words extracted in this fashion are used to inform the utterance classifier. At the end of the utterance, a special frame denotes end of speech. This causes the impasse to end and the operator corresponding to the utterance that was most likely heard is chosen and passed to the base level. This causes agent to end the impasse and return to base level to apply the operator. Application of the utterance operator causes the agent to move to the next state, which is a T-state. In this T-state, the agent selects the operator corresponding to the response the agent utters. This causes the agent to move to the next L-state and another impasse ensues where the agent listens to speech input to determine ho Figure 30 Impasse based dialog agent. A tie impasse results at the base level when the agent cannot select either utterance. This triggers creating the next level, where speech input is processed incrementally using the cognitive idiom from previous section. At the beginning of each impasse, each utterance operator has a utility associated with it which translates into a prior over the words likely at this point in the conversation. When all speech is processed, the most likely utterance is returned to the base level. 95 w to proceed. This is shown in Figure 30Error! Reference source not found.. The INOTS task specifies each conversation prompt. These prompts were recorded and processed using HTK as described previously. The prompts were used to build the bag of words classifier at the NLU level. One L-state transition to two different T-states was implemented. 4.4.1 Discussion This fusion of sub-symbolic speech processing using symbolic problem solving demonstrates the cognitively plausible language model described previously and partially obeys the properties of language processing presented earlier in Section 1: • Incremental processing: Each word is incrementally chosen. However, each utterance is selected only after all acoustic input is processed. This is a limitation that can be addressed in future work. • Bidirectional influence: Speech informs from bottom and the INOTS task model constrains the language model from above by making certain words more likely than others. • Fusion of symbolic and sub-symbolic capabilities: The dialogue agent utilizes Sigma’s symbolic capabilities, using impasses, operators and utilities to make decision of what to say next based on what is heard. The speech and language processing use Sigma’s sub-symbolic capabilities. This unique blending of symbolic and sub-symbolic capabilities is supported by the same cognitive cycle. 96 • Dynamic combination of multiple capabilities: The cognitive cycle combines audio input in the form of spectral labels, linguistic input in terms of the phonetic lexicon and a simple language understanding in terms of the naïve Bayes classifier. Unlike traditional narrow ASR systems where the entire decoding graph is precompiled and scored using beam search or some heuristic, linguistic knowledge is fused dynamically with symbolic problem solving, no combined large graph of knowledge exists in the system. Albeit very simple, this proof of concept discourse model presented in this section satisfies several important desiderata presented in Section 1. However, there are several important limitations, such as the INOTS discourse model being an FSA, the NLU being a simple bag of words classifier, the response being generated only after the entire speech input is processed. Another important limitation is the acoustic signal representation is discrete vector quantized labels. Other limitations are discussed in the next section. 4.5 Grammar Parsing using SPNs The spoken language processing and discourse model present earlier did not use syntax while processing speech and language. As described earlier in Section 2.2.4, constituency parsing is an important form of language processing and is a difficult problem for graphical models to solve exactly. In this section we prove that Sigma’s cognitive language is able to specify any valid SPN while retaining the desirable inference properties of SPNs in the graphs that result. This is accomplished by first considering how inference works in factor graphs, which form the inner loop of the elaboration phase of the cognitive cycle, and in SPNs. An algorithm to convert an SPN 97 into an equivalent set of Sigma conditionals is then presented. The effectiveness of this algorithm in producing Sigma conditionals that retain the desirable inference properties of SPNs is proven by showing that for the smallest valid SPN (and the corresponding Sigma model): (1) the posteriors calculated in the corresponding sum and product nodes in Sigma’s working memory are the same as in the original SPN, (2) the messages incoming at these nodes are the same as in the original SPN, and (3) there are no cycles in the resulting Sigma graph. These properties can then be generalized over any valid decomposable SPN recursively created using the definition of SPNs presented by Gens and Domingos (2013). 4.5.1 SPNs and Graphical Models We borrow notation from Darwiche (2009) and Koller and Friedman (2009) to describe SPNs and graphical models respectively. Let 𝒢 = (𝒱 , ℰ) be a graphical model with variable nodes in set 𝒱 and edge nodes in edge set ℰ. Each variable node is a discrete11 random variable 𝑋 and takes on values 𝑥 𝑖 in its domain 𝛥 (𝑋 ). Figure 31 shows an example of a Bayesian network, the corresponding factor graph representation and the SPN corresponding to them. An indicator variable typically takes on the value 1 if the supporting variable takes on the corresponding value. Here, we extend the definition of indicator variable as done in Gens and Domingos (2013). We define an indicator variable ℐ(𝑋 𝑖 ) for every 11 We consider discrete random variables for the sake of simplicity; however, these concepts can be applied to continuous variables with suitable modifications. 98 𝑥 𝑖 ∈ 𝛥 (𝑋 ). ℐ(𝑋 𝑖 ) takes on the value 1 if the corresponding variable 𝑋 is observed in evidence 𝐸 and it takes value 𝑥 𝑖 , or if X is not observed as part of 𝐸 . ℐ(𝑋 𝑖 ) = { 1, if 𝑋 is in 𝐸 and 𝐸 indicates 𝑋 = 𝑥 𝑖 or X ∉ 𝐸 0, otherwise … (1) The networks shown in Figure 31 encode the distribution: 𝑃 (𝑋 , 𝑌 ) = ∑ 𝑃 (𝑋 )𝑃 (𝑌 |𝑋 ) 𝑋 ,𝑌 … (2) Here, P(X) is the prior on X and P(Y|X) is the conditional probability distribution on Y given X. The factor graph is a bipartite representation 𝒢 = (𝒱 , ℱ) where the variable nodes 𝒱 correspond to variables and the factor nodes ℱ encode functions over variables. The prior and the conditional distribution from Eq. 2 are shown as local factors with the joint distribution being the global function represented by the graph. Inference is carried out via message passing, with Figure 31 showing the messages sent over the links from variable nodes. In particular, the factor graph in Figure 31 encodes the following distribution: 𝑃 (𝑋 , 𝑌 ) = ∑ ℐ(𝑋 𝑖 ) ℐ(𝑌 𝑗 )𝑃 (𝑋 )𝑃 (𝑌 |𝑋 ) 𝑥 𝑖 ∈𝛥 (𝑋 ),𝑦 𝑗 ∈𝛥 (𝑌 ) … (3) The value of the factor graph is determined by the evidence provided, as applied by the definition of the indicator variables to perform inference via message passing. Here we present a brief description of the message passing algorithm using our notation and interpretation. The messages incident on variable nodes are from factor nodes and represent beliefs over the domains of the respective variables. Messages over different links at variable nodes are combined via multiplication. At the variable 99 nodes, we provide evidence by multiplying the respective indicator variables. If the variable is observed as part of evidence, then this reduces the message to a non-zero value for the value observed. The calculation performed at the variable nodes is a pointwise product over the variable’s domain and an update message is generated to other links. The outgoing message from a variable node to a factor node thus includes a product of all incoming messages except the one from that particular factor node: 𝜇 𝑥 →𝑓 ;𝑥 𝑖 ∈Δ(𝑋 ) = ∏ ℐ(𝑋 𝑖 )𝜇 𝑓 →𝑥 ;𝑥 𝑖 ∈Δ(𝑋 ) 𝑓 ∈ℱ 𝑡 ℎ𝑎𝑡 𝑎𝑟𝑒 𝑛𝑒𝑖𝑔 ℎ𝑏𝑜𝑟𝑠 𝑜𝑓 𝑋 \ 𝜇 𝑓 →𝑥 ;𝑥 𝑖 ∈Δ(𝑋 ) … (4) The factor nodes receive messages from variable nodes that are their neighbors and multiply these messages together along with their local functions. Outgoing messages to variable nodes are then generated by summing out the other variables not in this message: 𝜇 𝑓 →𝑥 ;𝑥 𝑖 ∈Δ(𝑋 ) = ∑ ∏ 𝐹 (𝑥 , 𝑦 )𝜇 𝑦 →𝑓 ;𝑦 𝑖 ∈Δ(𝑌 ) 𝑦 ∈𝒱 𝑡 ℎ𝑎𝑡 𝑎𝑟𝑒 𝑛𝑒𝑖𝑔 ℎ𝑏𝑜𝑟𝑠 𝑜𝑓 𝑓 \𝜇 𝑥 →𝑓 ;𝑥 𝑖 ∈Δ(𝑋 ) 𝑦 … (5) 100 This has the effect of selecting a particular state from each of the local factors and multiplying them to obtain the global function, consistent with any provided evidence. For the variables that are not observed, this has the effect of yielding their marginals at the respective variable nodes, given the evidence (Kschischang, Frey, & Loeliger, 2001). It is important to note that the number of calculations at the factor nodes is exponential in the number of variables in the largest factor and their domain size, in particular it is 𝑂 (𝑚 𝑛 ), where 𝑚 is the size of the largest variable domain and 𝑛 is the number of variables. It is important to note that if certain states in the domains of the variables participating in that factor do not participate, the factor graph has no way of specifying that. One way to enable this would be to push the sum inside the products and only perform those operations that are needed for efficient, compact inference, not performing unneeded products and sums. An example of this will be seen later in the domain of PCFG parsing. In this example, the grammar itself specifies which sums and products are necessary, forgoing those that are not needed. Figure 31 A Bayesian network shown in standard notation (a) and the corresponding factor graph (b). Each variable here is Boolean. The factor graph version shows the messages that are to be exchanged along each link. The corresponding SPN is shown in (c). 101 The factor graph shown in Figure 31(b) is a tree, but this overall procedure can be generalized to graphs with cycles. Inference may not be exact in graphs with cycles and the cost is typically proportional to the ‘treewidth’ of the graph – a measure of the connectedness of the graph. In such cases, the inference is approximate as well as exponential in nature. SPNs avoid these problems by being able to specify only the operations that are needed to compute the marginal efficiently. This is achieved via the use of a network polynomial (Darwiche, 2009) and postulating hidden variables to express the network polynomial efficiently. An SPN is a tree rooted in a sum node, encoding a partition function over a probability distribution. More formally, we use the definition of a decomposable SPN from Gens and Domingos (2013) because it helps show how to compose SPNs from basic elements. Consider a set of variables 𝒳 = {𝑋 1 , 𝑋 2 , … , 𝑋 𝑁 } and let 𝒳 𝐾 be partition such that 𝒳 𝐾 = 𝑋 1 ∪ 𝑋 2 … ∪ 𝑋 𝐾 . An SPN 𝑆 (𝑋 ) is recursively defined and constructed by (repeated) application of the following rules: 1. An indicator variable ℐ(𝑋 𝑖 𝑗 ) is an SPN 𝑆 ({𝑋 𝑖 }) . The previous definition of indicator variables presented in Eq. 1 applies. 2. A product ∏ 𝑆 (𝑋 𝑘 ) 𝐾 𝑘 =1 is an SPN, with the SPNs {𝑆 𝑘 (𝑋 𝑘 )} 𝑘 =1 𝐾 factors. 3. The weighted sum ∑ 𝑤 𝑘 𝑆 𝑘 (𝑋 ) 𝐾 𝑘 =1 is an SPN, where {𝑤 𝑘 } are non-negative weights and sum to 1 for probability distributions and they combine the SPNs 𝑆 𝑘 (𝑋 ). The above definition also implies a structure of alternating sum and product nodes with the topmost root node being a sum node. SPNs are evaluated in two distinct passes. There is a bottom-up pass for evaluating the probability of particular 102 evidence, as applied via the indicator variables. To obtain the most probable explanation (MPE) assignment of unobserved variables, a top-down pass is also needed, selecting the most likely branch at each sum node. The top-down and bottom- up messages are calculated via the application of differentiation in arithmetic circuits. Interested readers may refer to Darwiche (2009) for a detailed explanation of how these messages are obtained. Here we present just a high-level overview of them in the interest of showing how they can be computed in Sigma. Figure 32 shows the messages exchanged between child and parent nodes, in both directions. In the bottom-up direction, the message going from a Sum node 𝑆 𝑖 ⨁ to its parent node is simply the weighted addition of its child nodes, and similarly the message going from a Product node 𝑆 𝑖 ⨂ to its parent node is simply the product of its children. These outgoing messages are the values of the SPN rooted in those nodes. In the top-down direction, the value of a product node is simply the weighted addition of its parents’ values: 𝑆 𝑖 ⨂ = ∑ 𝑤 𝑘𝑖 𝛿𝑆 (𝑋 )/𝛿𝑆 𝑘 (𝑋 ) 𝑘 ∈𝑝𝑎 (𝑖 ) … (6) whereas the value of a sum node is: 𝑆 𝑖 ⨁ = ∑ 𝑤 𝑘𝑖 𝛿𝑆 (𝑋 )/𝛿𝑆 𝑘 (𝑋 ) 𝑘 ∈𝑝𝑎 (𝑖 ) ∏ 𝑆 𝑙 (𝑋 ) 𝑙 ∈𝐶 ℎ(𝑝𝑎 (𝑖 ))−𝑖 … (7) 103 It is important to note that by specifying computations directly, in a computational trellis, SPNs provide a compact representation of all the operations needed to perform exact inference. 4.5.2 Converting SPNs to Sigma Conditionals As discussed previously, long term knowledge is specified in Sigma using the language of conditionals. In this section we describe an algorithm to translate an SPN into an equivalent Sigma model. We focus on valid, decomposable SPNs only because we are interested in translating only such SPNs into Sigma. Learning a valid, decomposable SPN from a dense SPN is left for future work. Algorithm 1: Translate an SPN to Sigma model: Input: An SPN 𝒢 = (𝒱 , ℰ), with 𝒱 = {𝑆 ⨁ , 𝑆 ⨂ , ℐ(𝒳 𝑖 )}. Output: A Sigma model with associated conditionals. 1. Declare a set of perception predicates 𝑝𝑒𝑟𝑐𝑒𝑖𝑣𝑒 (𝒳 𝑖 ) for ℐ(𝒳 𝑖 ) indicator nodes. Figure 32 Messages computed by SPNs in the bottom-up pass in (a). Messages computed by SPNs in the top-down pass in (b), from equations 6 and 7 respectively. 104 2. Declare two sets of predicates {𝑠𝑢𝑚 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 ⨁ )} and {𝑠𝑢𝑚 _𝑏𝑒𝑡𝑎 (𝑆 ⨁ )} for bottom-up and top-down values of SPN 𝑆 ⨁ nodes. 3. Declare a set of predicates 𝑝𝑟𝑜𝑑 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 ⨂ ) 𝑝𝑟𝑜𝑑 _𝑏𝑒𝑡𝑎 (𝑆 ⨂ ) for top-down values of SPN 𝑆 ⨂ nodes. 4. Declare a set of predicates {𝑠𝑢𝑚 _𝑔𝑎𝑚𝑚𝑎 (𝑆 ⨁ )} for the posteriors 5. In the bottom-up direction, from indicators to root, for each element of 𝑠𝑢𝑚 _𝑎𝑙 𝑝 ℎ𝑎 (𝑆 𝑖 ⨁ ): i. For each product child node 𝑆 𝑗 ⨂ of 𝑆 𝑖 ⨁ create a conditional such that: a. The predicate in the condition is 𝑝𝑟𝑜𝑑 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑗 ⨂ ) b. There is an action for the predicate 𝑠𝑢𝑚 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑖 ⨁ ) c. There is a function in the conditional that corresponds to the weight 𝑤 𝑖𝑗 6. In the bottom-up direction, indicators to root, for each element of 𝑝𝑟𝑜𝑑 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑖 ⨂ ): i. Create a conditional such that: a. The predicates in the condition are 𝑠𝑢𝑚 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑗 ⨁ ) where 𝑆 𝑗 ⨁ is a child of 𝑆 𝑖 ⨂ b. There is an action for the predicate 𝑝𝑟𝑜𝑑 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑖 ⨂ ) 7. In the top-down direction, root to indicators, for each element 𝑠𝑢𝑚 _𝑏𝑒𝑡𝑎 (𝑆 𝑖 ⨁ ): i. For each parent node 𝑆 𝑗 ⨂ of 𝑆 𝑖 ⨁ create a conditional such that: 105 a. The predicates in the conditions are 𝑠𝑢𝑚 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑐 ⨁ ), where each 𝑆 𝑐 ⨁ is a child of the 𝑆 𝑗 ⨂ . Exclude the 𝑠𝑢𝑚 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑖 ⨁ ) predicate from the conditions. b. Add the predicate 𝑝𝑟𝑜𝑑 _𝑏𝑒𝑡𝑎 (𝑆 𝑗 ⨂ ) to the conditions. c. There is an action for the predicate 𝑠𝑢𝑚 _𝑏𝑒𝑡𝑎 (𝑆 𝑖 ⨁ ). 8. In the top-down direction, root to indicators, for each element 𝑝𝑟𝑜𝑑 _𝑏𝑒𝑡𝑎 (𝑆 𝑖 ⨂ ): i. For each parent node 𝑆 𝑗 ⨁ of 𝑆 𝑖 ⨂ , create a conditional such that: a. The predicate in the condition is 𝑠𝑢𝑚 _𝑏𝑒𝑡𝑎 (𝑆 𝑗 ⨁ ). b. There is an action for the predicate 𝑝𝑟𝑜𝑑 _𝑏𝑒𝑡𝑎 (𝑆 𝑖 ⨂ ). c. There is a function in the conditional that corresponds to the weight 𝑤 𝑗𝑖 . 9. Initiate the bottom-up pass by providing evidence via Sigma’s perception mechanism for the indicator variables in accordance with their definition. 10. Initiate the top-down pass by providing evidence of “1” for the root node 𝑠𝑢𝑚 _𝑏𝑒𝑡𝑎 (𝑆 𝑟𝑜𝑜𝑡 ⨁ ). The number of predicates required in Sigma is at most thrice the number of total nodes in the original SPN. This is because we compute the values of each node in two separate passes, based on separate trees for bottom-up and top-down computations, and then combine them via a third set of predicates. Deconstructing the graph in this fashion breaks the loops that would otherwise occur among the bottom-up and top- down messages along individual links. An example of this shall be seen in the next 106 section. Furthermore, the messages computed by Sigma are a superset of the messages computed by the SPN, but scale by a constant factor as shall be seen in the next section. Finally, selection of the most likely path – as required by some algorithms such as Viterbi (Rabiner, 1989) – can be performed via the selection/decision process provided by Sigma’s cognitive cycle. Proposition 1: Algorithm 1 produces a set of conditionals that computes the same posteriors as the sum and product nodes in the original SPN. Proof: Recall from the earlier discussion of Sigma’s graphical architecture that messages are combined via product for predicates that are in conditions and combined via addition when the predicate appears in actions of multiple conditionals. For example, consider the SPNs shown in Figure 4. It can be established due to the conditionals created in steps 5 and 6 that the messages generated towards the 𝑠𝑢𝑚 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑖 ⨁ ) and 𝑝𝑟𝑜𝑑 _𝑎𝑙𝑝 ℎ𝑎 (𝑆 𝑖 ⨁ ) predicates are the same as those shown in Figure (4a). Similarly, the messages generated towards predicates 𝑠𝑢𝑚 _𝑏𝑒𝑡𝑎 (𝑆 𝑖 ⨁ ) and 𝑝𝑟𝑜𝑑 _𝑏𝑒𝑡𝑎 (𝑆 𝑖 ⨁ ) using conditionals from steps 7 and 8 are those shown in Figure (4b). Since Algorithm 1 can thus generate Sigma conditionals to calculate the messages for basic valid SPNs, then by induction, all decomposable SPN messages – i.e. messages exchanged in decomposable SPNs recursively created using the SPN definition from Section 3.1 – can be generated by conditionals from steps in Algorithm 1 ∎ Proposition 2: Algorithm 1 induces a Sigma graph that is tractable. 107 Proof: This compilation process yields a set of predicates and conditionals that compile into a graph composed of bottom-up and top-down trees that are isomorphic to the corresponding SPN trees, up to the addition of a constant factor of additional nodes and messages. The sum-product algorithm will respect this tree structure in computing the same results in the same manner as the SPN models. ∎ 4.5.3 Parsing in Sigma with SPNs Figure 33 shows a Sigma conditional generated by Step 5 of Algorithm 1, in the bottom-up direction (‘inside’ in the context of PCFGs). The conditions correspond to the RHS of the grammar rule S-> A B of the grammar shown in Figure 5(a) and are generated according to Step 5.ii. The action represents the LHS of the rule and is generated by Step 5.iii. The function corresponds to the probability of the rule and is generated by Step 5.iv. Figure 34 shows the analogous conditional in the top-down direction (‘outside’ in the context of PCFGs) as generated by Step 7 in the algorithm 12 . It is straightforward to note that these conditionals produce a graph that when operated upon by Sigma’s sum-product algorithm yields inference with the desired properties of the underlying SPN. 12 The Sigma-SPN parser code for a set of sample sentences from grammar of Figure 5(a) is available: https://bitbucket.org/hima_cogarch/acs_spn/. CONDITIONAL Bottom-Up-S_AB Conditions: Left(non-terminal:A) Right(non-terminal:B) Actions: Head(non-terminal:S) Function: 0.6 Figure 33 Conditional for grammar rule S-> A B with probability 0.6. Head corresponds to root sum-node in Figure 5(b). Conditional corresponds to leftmost link of four in bottom direction toward root sum node. Function represents weight on link between sum and product node. 108 Exactness of inference is established by Proposition 1 and tractability is established by proposition 2, both from section 3.2. To explore these claims empirically, Algorithm 1 was implemented and applied to the PCFG in Figure 5(a), with the resulting graphs then solved in Sigma. SPNs corresponding to sentence lengths varying from three up to fifteen words were generated. Figure 35 shows the number of messages exchanged in the Sigma SPN model as a function of the number of links in the underlying SPN. It is clear here that the number of messages exchanged in the Sigma SPN graph is linear (R2 of 0.99) in the size of the underlying SPN, as opposed to the exponential growth that would be expected for a pure factor graph (Naradowsky, Vieira, & Smith, 2012). Figure 35 Number of messages exchanged in Sigma graph is linear in size of underlying SPN in terms of number of links in SPN tree. CONDITIONAL Top-Down-S_AB Conditions: Head(non-terminal:S) Left(non-terminal:A) Actions: Right(non-terminal:B) Function: 0.6 Figure 34 Conditional for downward message from root node to an intermediate sum node, corresponding to grammar rule S-> A B with probability 0.6. Head corresponds to root sum-node in Figure 5(b). Downward message to an intermediate node also includes bottom up from sibling, as from Left here. Conditional is sending downward message to right child of rule. 109 4.5.4 Discussion Sum-product networks (SPNs) provide a new graph-based computational model that shows great potential for performing tractable and exact calculations on important problems for which traditional graphical models are exponential and approximate. Although Sigma’s graphical layer is based on traditional graphical models plus a variant of the standard sum-product solution algorithm, an early extension to Sigma that enabled unidirectional message passing in service of mapping rule-like behavior onto its graphs, and which was later leveraged to also map neural networks onto them, has been proven capable of enabling Sigma to encode any valid SPN while retaining their exactness and tractability. As part of this, an algorithm has been presented that maps SPNs on to Sigma’s conditionals. The exactness of this algorithm has been proved theoretically by showing Sigma SPNs calculate the same posteriors as the underlying SPN, and its tractability has been proved by showing that it is within a constant factor of the SPN. This latter result was then empirically demonstrated by showing that the number of messages exchanged in the Sigma SPN is linear in the size of the underlying SPN. These results indicate that Sigma should be able to solve a variety of additional major cognitive problems tractably within the cognitive cycle that would not be possible with pure graphical models, bearing on Sigma’s sufficient efficiency desideratum. They also bear on functional elegance by leveraging the same unidirectional extension to Sigma’s graphical models that underlies rule match and 110 neural networks. The results here also depend on one other extension that was made for rule-like behavior – factor nodes that sum their input – but this was used for convenience rather than as a logical necessity. As discussed previously in section 4.2.1, this disjunctive factor node was added to the standard factor graph approach to combine results across multiple rules. This mechanism is reused in this context to perform sum as a matter of convenience although the same effect can be obtained by using multiple dimensions and summarizing across them. 111 5 Summary and Future Work This section summarizes the main contributions presented in this work and then discusses their impact in terms of Sigma’s desiderata and impact beyond Sigma. This is followed by open issues and future work. 5.1 Summary This dissertation took the first step towards a computational model supporting various capabilities – and their partial integration – in service of speech and language processing by presenting a computational model of speech processing that is developed to be broad, architecturally based, implemented in terms of constituent functionality via knowledge on top of the architecture. Various capabilities required in service of spoken language processing were discussed in terms of the processes and knowledge structures required to realize them. In particular, isolated word recognition was discussed followed by connected word recognition. This was extended to continuous phone and word recognition. A simple turn-based dialogue agent was described followed by the integration of this dialogue agent with spoken language processing is discussed. It is shown that Sigma’s cognitive cycle is able to support all these capabilities – and their integration – in a supraarchitectural fashion. Several graphical idioms – based in extended factor graphs – required for spoken language processing were identified. Additionally, grammar parsing – an important language processing capability – is discussed along with the challenge it poses to Sigma’s basis in factor graphs. SPNs – a new kind of deep architecture – provide one 112 possible solution towards challenges posed by grammar parsing. It is shown that Sigma’s cognitive language is able to specify any arbitrary SPN due to an extension added at the graphical layer in service of rule-based processing. This same extension allows neural networks to be specified in Sigma. This serves to demonstrate that Sigma’s cognitive language is more general than recent attempts to mix graphical models with SPNs. Additionally, this also shows one way to mix graphical models and SPNs. 5.2 Sigma’s Desiderata and Spoken Language Processing This section describes the impact of the present work in terms of Sigma’s desiderata. 5.2.1 Functional Elegance Sigma is shown to be able to achieve spoken language processing, symbolic problem solving for a discourse task and their fusion using the same core set of mechanisms that underlie other forms of cognitive processing and memories. This is an important demonstration of Sigma’s ability to support functional elegance. The SPN work shows that the same extension to Sigma’s summary product algorithm that enables rules and neural networks also enables SPNs. 5.2.2 Grand Unification The novel construction of spoken language processing is a milestone in grand unification. 113 5.2.3 Sufficient Efficiency As seen previously, Sigma’s progress on sufficient efficiency is tested. There is ground to cover in terms of executing fast enough for real time task execution. Table 2 listed the cognitive cycle times for continuous phone recognition. It is slower than expected for real time task execution. The most important implication of this is the inability of Sigma’s conversational agent to hold realtime conversations. The SPN work also has impact on sufficient efficiency by showing that Sigma enables tractable solution to a variety of major cognitive problems within the cognitive cycle that would be exponential otherwise in pure factor graphs. Another interesting direction would be evaluating more generally whether SPNs all by themselves, or some suitable generalization of them, might provide the long-sought solution to limiting the cognitive cycle to tractable, and even potentially bounded, inference while remaining sufficiently expressive to support generic cognition. Such an approach would be akin to, but hopefully more successful due to being more expressive than, earlier efforts in Soar to restrict the expressiveness of rules in order to guarantee tractable match (Tambe, Newell & Rosenbloom, 1990). 5.3 Impact Beyond Sigma The speech recognition work fulfils Newell’s promissory notes from his mapping of HARPY onto HPSA and signal the progress made by cognitive architectures. Beyond cognitive architectures, the SPN results yield a novel approach to combining the efficiency of SPNs with the generality of graphical models. Although other approaches to such a combination have recently been explored, such as Expression Graphs (Demski, 2015) and Sum Product Graphical Models (Desana & Schnorr, 2017), none 114 of these alternatives also extends to include rules and neural networks showing Sigma’s graphical layer to be more general. 5.4 Open Issues and Future work This section identifies several avenues to extend the work presented in this dissertation. 5.4.1 Full integration of Continuous Speech The most obvious next step for this work is to extend the current model to perform word-based speech recognition for a full discourse task; i.e., decode the string of words from the sequence of acoustic observations with a larger vocabulary. This should yield an important step in the close coupling of speech and language, the higher ambition guiding the spoken language processing work in Sigma. The acoustic and HMM transition parameters used for phone recognition in this work were obtained from HTK using embedded Viterbi training, in a non-incremental manner. Exploring incremental learning of the parameters required for phone recognition is therefore an interesting potential next step. Such learning was demonstrated for isolated-word recognition, and so can hopefully extend to this more complex continuous case as well. The perception function is currently a discrete factor, with the spectral labels representing a cluster in the audio spectrum. Although the performance of the local online recognizer is comparable to that of the Viterbi decoder, as implemented by the HTK toolkit, the discrete nature of the acoustic signal here is a major bottleneck in further improving recognition performance. Better performance is expected using a 115 mixed factor, where the spectral dimension is a continuous function such as that represented by a Gaussian mixture model. More recent speech systems use recurrent neural networks to model the acoustic perception function, yielding the best performance to date on TIMIT. Exploring mixed neural-probabilistic factors (Rosenbloom, Demski, & Ustun, 2017) or discrete-continuous Gaussian mixture model-based factors is an interesting future direction for this work. 5.4.2 Learning Besides learning the perception and transition functions in the isolated word task, learning was not explored in this work. Several forms of learning such as the acoustic and transition functions in the continuous word task, the NLU classifier probabilities etc. can be explored. Other more sophisticated forms of learning, such as impasse based automatizing can be used to learn SPNs. This would be an example of a reflective capability being used to learn reactive knowledge in the form of SPNs. 5.4.3 Syntax Processing The SPN based PCFG parsing model described in this dissertation is purely reactive in nature. This is undesirable because this approach requires the cognitive cycle to perform too much work during a single reactive cycle and is not incremental in nature. A suitable incremental approach based on PCFGs or the more expressive Tree Adjoining Grammars (TAG) (Joshi & Schabes, 1997) can be explored. TAGs are more expressive than PCFGs and belong to the class of mildly context sensitive languages. They are weakly equivalent to embodied construction grammars explored in Soar lately (Lindes & Laird, 2017). 116 5.4.4 Sum Product Networks and Implications on Graphical Architecture An interesting question the SPN work raises for the future is whether SPNs by themselves, or a suitable generalization of them, might provide a solution to limiting the cognitive cycle to tractable and potentially bounded inference while remaining sufficiently expressive to support generic cognition (Tambe, Newell & Rosenbloom, 1990). Additional avenues for future work include learning of SPNs and dynamic and online SPNs for sentence processing. Applications of SPNs towards efficient acoustic modeling, language modeling etc. may also be explored. 117 6 References Allen, J., Ferguson, G., & Stent, A. (2001, January). An architecture for more realistic conversational systems. Proceedings of the 6th international conference on Intelligent user interfaces, 1-8. Allwood, J. (1995). An activity based approach to pragmatics. Technical Report (GPTL) 75, Gothenburg Papers in Theoretical Linguistics. Alshawi. (2003). Effective utterance classification with unsupervised phonotactic models. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 1, 1-7. Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004). An integrated theory of the mind. Psychological review, 111(4), 1036. Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech & Language, 16(1), 89-114. Audacity. (2013, October). 2.0.5. Bahl, L. R., De Souza, P. V., Gopalakrishnan, P. S., Nahamoo, D., & Picheny, M. A. (1994). Robust methods for using context-dependent features and models in a continuous speech recognizer. Acoustics, Speech, and Signal Processing, IEEE International Conference on, 1, 533. Bahl, L. R., Jelinek, F., & Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2, 179-190. Ball, J. (2011). A Pseudo-Deterministic Model of Human Language Processing. Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 495-500. Ball, J., Heiberg, A., & Silber, R. (2007). Toward a large-scale model of language comprehension in ACT-R 6. Proceedings of the 8th International Conference on Cognitive Modeling. Bilmes, J., & Bartels, C. (2005). Graphical model architectures for speech recognition. Signal Processing Magazine, 22(5), 89-100. Bilmes, J., & Zweig, G. (2002). The graphical models toolkit: An open source software system for speech and time-series processing. Acoustics, Speech, and Signal Processing (ICASSP),IEEE International Conference on, 4, IV-3916. Campbell, J. C., Hays, M. J., Core, M., Birch, M., Bosack, M., & Clark, R. E. (2011). Interpersonal and leadership skills: using virtual humans to teach new officers. Proc. of Interservice/Industry Training, Simulation, and Education Conference. Campbell, J. C., Hays, M. J., Core, M., Birch, M., Bosack, M., & Clark, R. E. (2011). Interpersonal and leadership skills: using virtual humans to teach new officers. In Proc. of Interservice/Industry Training, Simulation, and Education Conference,, 11358. Cassell, J. (2000). Embodied Conversational Agents. (J. Sullivan, S. Prevost, & E. Churchill, Eds.) Cambridge, MA: MIT Press. 118 Chelba. (2000). Structured language modeling. Computer Speech & Language, 14(4), 283-332. Chen, J., Demski, A., Han, T., Morency, L., Pynadath, P., Rafidi, N., & Rosenbloom, P. S. (2011). Fusing symbolic and decision-theoretic problem solving + perception in a graphical cognitive architecture. Proceedings of the Second International Conference on Biologically Inspired Cognitive Architectures. Chen, J., Demski, A., Han, T., Morency, L.-P., Pynadath, D., Rafidi, N., & Rosenbloom, P. S. (2011). Fusing symbolic and decision-theoretic problem solving + perception in a graphical cognitive architecture. Proceedings of the 2nd International Conference on Biologically Inspired Cognitive Architectures, (pp. 64-72). Chen, J., Demski, A., Han, T., Morency, L.-P., Pynadath, P., Rafidi, N., & Rosenbloom, P. S. (2011). Fusing symbolic and decision-theoretic problem solving + perception in a graphical cognitive architecture. Proceedings of the Second International Conference on Biologically Inspired Cognitive Architectures. Christiansen, M. H., & Chater, N. (2016). The Now-or-Never bottleneck: A fundamental constraint on language. Behavioral and Brain Sciences, 39. Clark, H. H., & Schaefer, E. F. (1987). Collaborating on contributions to conversation. Language and Cognitive Processes, 2, 1-23. Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process. Cognition, 22:1(39). Cohen, P. R., & Perrault, C. R. (1979). Elements of a Plan‐Based Theory of Speech Acts. Cognitive science, 3(3), 177-212. Darwiche, A. (2009). Modeling and reasoning with Bayesian networks. Cambridge University Press. Demski, A. (2015). Expression Graphs. International Conference on Artificial General Intelligence (pp. 241-250). Springer, Cham. Denes, P. B., & Pinson, E. (1993). The speech chain: The physics and biology of spoken language. Macmillan. Deng, L. (2011). An overview of deep-structured learning for information processing. Proceedings of Asian-Pacific Signal & Information Processing Annual Summit and Conference. APSIPAASC. Deng, L., Seltzer, M. L., Yu, D., Acero, A., Mohamed, A. R., & Hinton, G. E. (2010). Binary coding of speech spectrograms using a deep auto-encoder. Interspeech, 1692-1695. Desana, M., & Schnorr, C. (2017). Sum-Product Graphical Models. Retrieved from arXiv: https://arxiv.org/abs/1708.06438 DeVault, D., & Traum, D. (2012). Incremental speech understanding in a multi-party virtual human dialogue system. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstration Session, 25-28. DeVault, D., Sagae, K., & Traum, D. (2011). Incremental interpretation and prediction of utterance meaning for interactive dialogue. Dialogue & Discourse, 2((1)), 143-170. 119 DeVault, D., Sagae, K., & Traum, D. (2011). Incremental interpretation and prediction of utterance meaning for interactive dialogue. Dialogue and Discourse,, 2(1), 143-70. Frauenfelder, U. H., & Tyler, L. K. (1987). The process of spoken word recognition: An introduction. Cognition, 25(1), 1-20. Gale, W., & Sampson, G. (1995). Good-Turing smoothing without tears. Journal of Quantitative Linguistics, 2(3), 217-237. Gat, E. (1998). On three-layer architectures. Artificial intelligence and mobile robots, 195-210. Gat, E. (1998). On three-layer architectures. Artificial intelligence and mobile robots, 195, 210. Goertzel, B. (2014). Artificial General Intelligence: Concept, State of the Art, and Future Prospects. Journal of Artificial General Intelligence, 5(1), 1-48. Gold, B., Morgan, N., & Ellis, D. (2011). Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Hoboken, NJ: John Wiley & Sons. Goodwin, C. (1979). The interactive construction of a sentence in natural conversation. Everyday Language: Studies in Ethnomethodology, 97-121. Gray, R. M. (1984). Vector quantization. ASSP Magazine, 1(2), 4-29. Hemphill, C. T., Godfrey, J. J., & Doddington, G. R. (1990, June). The ATIS spoken language systems pilot corpus. In Proceedings of the DARPA speech and natural language workshop, 96-101. Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., . . . B., K. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82-97. Huang, Q., & Cox, S. (2006). Task-independent call-routing. Speech communication, 48(3), 374-389. Hutter. (2001). Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. In Machine Learning: ECML, 226-238. Issar, S., & Ward, W. (1993, September). CMU’s robust spoken language understanding system. 93. Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: MIT press. Jilk, D. J., Lebiere, C., O’Reilly, R. C., & Anderson, J. R. (2008). SAL: An explicitly pluralistic cognitive architecture. Journal of Experimental and Theoretical Artificial Intelligence, 197-208. Jilk, D. J., Lebiere, C., O’Reilly, R. C., & Anderson, J. R. (2008). SAL: An explicitly pluralistic cognitive architecture. Journal of Experimental and Theoretical Artificial Intelligence, 20(3), 197-218. John Garofolo, D. G. (2007, May 30). WSJ0 Complete - Linguistic Data Consortium. Retrieved July 20, 2015, from Linguistic Data Consortium: https://catalog.ldc.upenn.edu/LDC93S6A Jordan, M. I. (2004). Graphical models. Statistical Science, 140-155. Jordan, M. I., & Sejnowski, T. J. (2001). Graphical Models: Foundations of Neural Computation. Cambridge, MA: MIT Press. 120 Joshi, A. K., & Schabes, Y. (1997). Tree-adjoining grammars. In Handbook of formal languages (pp. 69-123). Berlin, Heidelberg: Springer. Joshi, H., Rosenbloom, P. S., & Ustun, V. (2014). Isolated word recognition in the Sigma cognitive architecture. Biologically Inspired Cognitive Architectures(10), 1-9. Juang, B., & Rabiner, L. (1991). Hidden Markov models for speech recognition. 33(3), 251--272. Jurafsky, D., & Martin, J. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Upper Saddle River, NJ: Prentice Hall. Jurafsky, D., & Martin, J. H. (2008). Speech and language processing. Prentice Hall series in AI. Kanthak, S., Ney, H., Riley, M., & Mohri, M. (2002, September). A comparison of two LVR search optimization techniques. INTERSPEECH. Kneser, R., & Ney, H. (1995). Improved backing-off for m-gram language modeling. Acoustics, Speech, and Signal Processing, ICASSP-95, 1, 181-184. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence , 1137-1145. Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA: MIT Press. Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA: MIT Press. Kschischang, F. R., Frey, B. J., & Loeliger, H. (2001). Factor graphs and the sum- product algorithm. Information Theory, IEEE Transactions on, 47(2), 498-519. Kschischang, F. R., Frey, B. J., & Loeliger, H.-A. (2001). Factor graphs and the sum- product algorithm. IEEE Transactions on Information Theory, 47, 498-519. Laird. (2012). The Soar cognitive architecture. Cambridge, MA: MIT Press. Laird. (2012). The Soar Cognitive Architecture. Cambridge, MA: MIT Press. Laird. (2012). The Soar Cognitive Architecture. Cambridge, MA: MIT Press. Laird, J. E., Lebiere, C., & Rosenbloom, P. S. (2017). A Standard Model of the Mind: Toward a Common Computational Framework Across Artificial Intelligence, Cognitive Science, Neuroscience, and Robotics. AI Magazine, 38(4). Laird, J., Kinkade, K., Mohan, S., & Xu, J. (2012). Cognitive robotics using the soar cognitive architecture. Cognitive Robotics AAAI Technical Report, WS-12-06. Laird, J., Kinkade, K., Mohan, S., & Xu, J. (2012). Cognitive robotics using the soar cognitive architecture . Cognitive robotics AAAI technical report, , WS(12), 6. Langley, P., Laird, J. E., & Rogers, S. (2009). Langley, Pat, John E. Laird, and Seth Rogers. Cognitive Systems Research, 10(2), 141-160. Lavrenko, V., Choquette, M., & Croft, W. B. (2002, August). Cross-lingual relevance models. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 175-182. Lebiere, C., O'Reilly, R. C., Jilk, D. J., Taatgen, N., & Anderson, J. R. (2008). The SAL Cognitive Architecture. AAAI Fall Symposium: Biologically Inspired Cognitive Architectures, 98-104. 121 Lebiere, C., O'Reilly, R. C., Jilk, D. J., Taatgen, N., & Anderson, J. R. (2008). The SAL Integrated Cognitive Architecture. AAAI Fall Symposium: Biologically Inspired Cognitive Architectures, 98-104. Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11), 1642-1648. Lehman, J. F., Laird, J. E., & Rosenbloom, P. S. (1996). A gentle introduction to Soar, an architecture for human cognition. Invitation to cognitive science, 4, 212- 249. Leuski, A., & Traum, D. (2011). NPCEditor: Creating virtual human dialogue using information retrieval techniques. AI Magazine, 32(2), 42-56. Lewis, R. L. (1993). An architecturally-based theory of human sentence comprehension. Proceedings of the 15th Annual Conference of the Cognitive Science Society , 108-113. Linguistic Data Consortium. (1991). TI 46 Word speech database, speaker-dependent isolated word corpus. Retrieved January 10, 2013, from https://catalog.ldc.upenn.edu/LDC93S9 Lopes, C., & Perdigao, F. (2011). Phone recognition on the TIMIT database. Speech Technologies/Book, 1, 285-302. Lopes, C., & Perdigao, F. (2011). Phone recognition on the TIMIT database. Speech Technologies. In I. I. (Ed), Speech Technologies (pp. 285-302). INTECH Open Access Publisher. Lowerre. (1976). The HARPY speech recognition system. Lowerre, B., & Reddy, R. (1976). The Harpy Speech Recognition System: performance with large vocabularies. The Journal of the Acoustical Society of America, 60-S1, S10-S11. Lowerre, B., & Reddy, R. (1976). The HARPY speech recognition system: Performance with large vocabularies. The Journal of the Acoustical Society of America, 60-S1, S10-S11. Marslen-Wilson, W. D. (1987). Functional parallelism in spoken word-recognition. Cognition, 25(2), 71-102. Marslen-Wilson, W. D., & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive Psychology, 10(1), 29-63. McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18(1), 1-86. Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational linguistics, 23(2), 269-311. Mohri, M. P., & Riley, M. (2008). Speech recognition with weighted finite-state transducers. Springer Handbook of Speech Processing, 559-584. Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P., . . . Traum, D. (2013, August). Which ASR should I choose for my dialogue system. Proc. SIGDIAL. Murphy, K. P. (2002). Dynamic bayesian networks: representation, inference and learning. University of California, , Computer Science, Berkeley. 122 Murphy, K. P. (2002). Dynamic bayesian networks: representation, inference and learning. University of California, Berkeley. Berkeley: University of California. Murphy, K. P., & Paskin, M. A. (2002). Linear-time inference in hierarchical HMMs. Advances in neural information processing systems, 2, 833-840. Naradowsky, J., Vieira, T., & Smith, D. (2012). Grammarless parsing for joint inference. Proceedings of COLING, (pp. 1995-2010). Newell. (1973). You can't play 20 questions with nature and win: Projective comments on the papers of this symposium. (W. G. Chase, Ed.) Visual Information processing. Newell. (1978). Harpy, production systems and human cognition. In R. Cole (Ed.), Perception and production of fluent speech, 299-380. Newell. (1978). Harpy, production systems and human cognition. 299-380. Newell. (1990). Unified theories of cognition. Cambridge, MA: Harvard University Press. Newell, A., & Simon, H. A. (1976). Computer science as empirical inquiry: Symbols and search. Communications of the ACM, 19(3), 113-126. Nilsson, N. J. (2007). The physical symbol system hypothesis: status and prospects. Chicago: Springer Berlin Heidelberg. O’Reilly, R. C., Hazy, T. E., & Herd, S. A. (2012). The leabra cognitive architecture: how to play 20 principles with nature and win. The Oxford Handbook of Cognitive Science. O'Reilly, R. C. (1998). Six principles for biologically based computational models of cortical cognition. Trends in cognitive sciences, 2(11), 455-462. Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning. 195, 36. Paetzel, M., Racca, D. N., & DeVault, D. (2014, May). A multimodal corpus of rapid dialogue games. Language Resources and Evaluation Conference (LREC). Perrault, C. R., & Allen, J. F. (1980). A plan-based analysis of indirect speech acts. Computational Linguistics, 6(3-4), 1980. Pieraccini, R., Levin, E., & Lee, C. H. (1991). Stochastic Representation of Conceptual Structure in the ATIS Task. Proceedings DARPA Speech and Natural Language Workshop , 121-124. Plátek, O., & Jurcıcek, F. (2014). Free on-line speech recogniser based on Kaldi ASR toolkit producing word posterior lattices. 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, p. 108. Poon, H., & Domingos, P. (2011). Sum-product networks: A new deep architecture. IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. pp. 689-690). IEEE. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., . . . Vesely, K. (2011). The Kaldi speech recognition toolkit. IEEE 2011 workshop on automatic speech recognition and understanding. Pynadath, D., & Wellman, M. P. (1998). Generalized queries on probabilistic context- free grammars. IEEE Transactions on Pattern Analysis and Machine Intelligence (pp. 20(1), 65-77). IEEE. Rabiner. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286. 123 Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286. Rickel, J., & Johnson, W. L. (1999). Virtual humans for team training in virtual reality. Proceedings of the ninth international conference on artificial intelligence in education, 578-585. Robinson, A. J. (1994). An application of recurrent nets to phone probability estimation. Neural Networks, IEEE Transactions on, 5(2), 298-305. Rosenbloom. (2009). A graphical rethinking of the cognitive inner loop. Proceedings of the IJCAI International Workshop on Graphical Structures for Knowledge Representation and Reasoning. Rosenbloom. (2009). Towards uniform implementation of architectural diversity. Proceedings of the AAAI Fall Symposium on Multi-Representational Architectures for Human-Level Intelligence, 32-33. Rosenbloom. (2010). Combining procedural and declarative knowledge in a graphical architecture. Proceedings of the 10th International Conference on Cognitive Modeling, (pp. 205-210). Rosenbloom. (2011a). From memory to problem solving: Mechanism reuse in a graphical cognitive architecture. Proceedings of the 4th Conference on Artificial General Intelligence, (pp. 143-152). Rosenbloom. (2011b). Mental Imagery in a Graphical Cognitive Architecture. Proceedings of the 2nd Annual International Conference on Biologically Inspired Cognitive Architectures, (pp. 314-323). Rosenbloom. (2012). Deconstructing reinforcement learning in Sigma. Proceedings of the 5th Conference on Artificial General Intelligence. Rosenbloom. (2012a). Deconstructing reinforcement learning in Sigma. Proceedings of the 5th Conference on Artificial General Intelligence, (pp. 262-271). Rosenbloom. (2012b). Graphical models for integrated intelligent robot architectures. Proceedings of the AAAI Spring Symposium on Designing Intelligent Robots: Reintegrating AI. Rosenbloom. (2012c). Towards a 50 msec cognitive cycle in a graphical architecture. Proceedings of the 11th International Conference on Cognitive Modeling, (pp. 305-310). Rosenbloom. (2013). The Sigma cognitive architecture and system. AISB Quarterly, 136, pp. 4-13. Rosenbloom. (2013). The Sigma cognitive architecture and system. AISB Quarterly, 136, 4-13. Rosenbloom. (2014). Deconstructing episodic memory and learning in Sigma. Proceedings of the 36th Annual Conference of the Cognitive Science Society. Rosenbloom. (2015). Supraarchitectural capability integration: From Soar to Sigma. Proceedings of the 13th International Conference on Cognitive Modeling. Rosenbloom, P. S., Demski, A., Han, T., & Ustun, V. (2013). Learning via gradient descent in Sigma. Proceedings of the 12th International Conference on Cognitive Modeling (pp. 35-40). ICCM. Rosenbloom, P. S., Gratch, J., & Ustun, V. (2015). Towards emotion in Sigma: From appraisal to attention. Proceedings of the 8th Conference on Artificial General Intelligence. 124 Rosenbloom, P. S., Gratch, J., & Ustun, V. (2015). Towards emotion in Sigma: From appraisal to attention. Proceedings of the 8th Conference on Artificial General Intelligence. Rosenbloom, P. S., Newell, A., & Laird, J. E. (1991). Towards the knowledge level in Soar: The role of the architecture in the use of knowledge. Architectures for Intelligence: The Twenty-second Carnegie Mellon Symposium on Cognition, 75- 111. Rosenbloom, P. S., Newell, A., & Laird, J. E. (1991). Towards the knowledge level in Soar: The role of the architecture in the use of knowledge. Architectures for intelligence, 76-111. Russell, S. J., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach. Upper Sadle River: Prentice Hall. Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. International Joint Conference on Artificial Intelligence, 95, pp. 1146-1152. Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 696-735. Sadek, M. D. (1991). Dialogue acts are rational plans. Proceedings of the ESCA/ETR workshop on multi-modal dialogue. Schlangen, D. (2005). Modelling dialogue: Challenges and approaches. Künstliche Intelligenz, 3. Schuler, W., Wu, S., & Schwartz, L. (2009). A framework for fast incremental interpretation during speech decoding. Computational Linguistics, 35(3), 313- 343. Schwenk, H. (2005). Continuous space language models. Computer Speech & Language, 21(3), 492-518. Selfridge, E. O., Arizmendi, I., Heeman, P. A., & Williams, J. D. (2012, July). Integrating incremental speech recognition and pomdp-based dialogue systems. Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 275-279. Seneff, S., & Polifroni, J. (2000). Dialogue management in the Mercury flight reservation system. ANLP/NAACL Workshop on Conversational Systems. Shaw, M. (1996). Software Architecture: Perspectives on an Emerging Discipline. Prentice-Hall. Shiffrin, R. M., & Schneider, W. (1977). Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory. Psychology review, 2(84), 127. Skantze, G., & Schlangen, D. (2009). Incremental dialogue processing in a micro- domain. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 745-753. Steedman, M., & Baldridge, J. (2011). Combinatory categorial grammar. Non- Transformational Syntax: Formal and Explicit Models of Grammar. Sun, R. (2006). The CLARION cognitive architecture: Extending cognitive modeling to social simulation. In R. e. Sun, Cognition and Multi-Agent Interaction (pp. 79-99). New York: Cambridge University Press. 125 Sun, R., Merrill, E., & Peterson, T. (2001). From implicit skills to explicit knowledge: A bottom-up model of skill learning. (Vol. 25). Cognitive Science. Sun, R., Peterson, T., & Sessions, C. (2002). Beyond simple rule extraction: Acquiring planning knowledge from neural networks. (Springer, Ed.) Neural Nets WIRN Vietri-01 , 288-300. Toussaint, M., Plath, N., Lang, T., & Jetchev, N. (2010, May). Integrated motor control, planning, grasping and high-level reasoning in a blocks world using probabilistic inference. IEEE International Conference on Robotics and Automation (ICRA), 385-391. Traum, D. R., & Larsson, S. (2003). The information state approach to dialogue management. . Current and new directions in discourse and dialogue. Traum, D., Aggarwal, P., Artstein, R., Foutz, S., Gerten, J., Katsamanis, A., . . . Swartout, W. (2012, January). Ada and Grace: Direct interaction with museum visitors. (S. B. Heidelberg, Ed.) Intelligent Virtual Agents, 245-251. Traum, D., Marsella, S. C., Gratch, J., Lee, J., & Hartholt, A. (2008, January). Multi- party, multi-issue, multi-strategy negotiation for multi-modal virtual agents. Intelligent Virtual Agents, 117-130. Ustun, V., Rosenbloom, P. S., Sagae, K., & Demski, A. (2014). Distributed vector representations of words in the Sigma cognitive architecture. Artificial General Intelligence, 196-207. Vertanen. (2006). Baseline WSJ acoustic models for HTK and Sphinx: Training recipes and recognition experiments. Cavendish Laboratory, Cambridge. Walker, M., & Whittaker, S. (1990, June). Mixed initiative in dialogue: An investigation into discourse segmentation. Proceedings of the 28th annual meeting on Association for Computational Linguistics, 70-78. Wang, W. Y., Artstein, R., Leuski, A., & Traum, D. R. (2011). Improving Spoken Dialogue Understanding Using Phonetic Mixture Models. In FLAIRS Conference. Weide. (2005). The Carnegie mellon pronouncing dictionary [cmudict. 0.6]. (Carnegie Mellon University ) Retrieved 7 20, 2015, from The CMU Pronouncing Dictionary: http://www. speech. cs. cmu. edu/cgi-bin/cmudict Wittgenstein, L. (1953). Philosophische Untersuchungen. Philosophical Investigations. Translated by GEM Anscombe, Blackwell. Woods. (1973). Progress in natural language understanding: an application to lunar geology. In Proceedings of the national computer conference and exposition, 4- 8, 441-450. Woods. (1979). Semantics for a question-answering system. Garland Pub. Yarowsky. (2000). Hierarchical decision lists for word sense disambiguation. Computers and Humanities, 34(1-2), 179-186. Yngve, V. H. (1970). On getting a word in edgewise. CLS-70, 567-577. Yost, G., & Newell, A. (1989). Problem Space Approach to Expert System Specification. IJCAI , 621-627. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., & Woodland, P. (1997). The HTK book (for HTK version 3.4). Cambridge: Cambridge University Engineering Department. 126 Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (1997). The HTK Book. Cambridge: Entropic Cambridge Research Laboratory.
Abstract (if available)
Abstract
Cognitive architectures model fixed structures underlying intelligence and seek to heed the original goal of AI—a working implementation of a full cognitive system in aid of creating synthetic agents with human capabilities. Sigma is a cognitive architecture, developed with the immediate aim of supporting real time needs of intelligent agents, robots and virtual humans. In Sigma, this requirement manifests as a system whose development heuristically is guided by knowledge about human cognition with the ultimate desire to explain human intelligence at an appropriate level of abstraction. ❧ Spoken language processing is an important cognitive capability and yet not addressed by existing cognitive architectures. This is indicative of the mixed—symbolic and probabilistic—nature of the speech problem. Sigma, guided in its development by a core set of desiderata that are an evolution of the desiderata implicit in Newell’s Unified Theories of Cognition, presents a unique opportunity to attempt the integration of spoken language understanding in a cognitive architecture. Such attempt is an exercise to push cognitive architectures beyond what they are capable of, taking a first step towards enabling an architecturally based theory of spoken language understanding—deconstructed in terms of the interplay between various cognitive and sub-cognitive capabilities that play an important role in the comprehension process. ❧ This dissertation investigates the issues involved in integration of incremental speech and language processing, with cognition, in aid of spoken language understanding, guided by the desiderata driving Sigma’s development. The space of possibilities this integration enables is explored and a suitable spoken language understanding task is chosen to evaluate the key properties of the theory of spoken language understanding developed in Sigma. Speech signal obtained from an external speech front end is combined with linguistic knowledge in the form of phonetic, lexical and semantic knowledge sources. The linguistic input is converted into meaning using a Natural Language Understanding (NLU) scheme implemented on top of the architecture. ❧ In addition to phonetic, lexical and semantic processing, language processing involves a syntactic component. Probabilistic context free grammar parsing is an important form of grammar processing that has not been possible to realize in cognitive architectures. Probabilistic context free grammar parsing poses a challenge to Sigma’s grounding in graphical models. Sigma is shown to be able to perform syntactic processing via Sum Product Networks (SPNs), a new kind of deep architecture that allows efficient, tractable and exact inference in a wide class of problems, including grammar parsing. It is shown that Sigma’s cognitive language is sufficient to specify any arbitrary valid SPN, with the tractability and exactness expected of them. This shows Sigma’s ability to efficiently specify a wide range of problems. The implications of this are discussed, along with Sigma mechanisms that allow for specifying SPNs. This leads to a novel relationship between neural networks and SPNs in the context of Sigma.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling social causality and social judgment in multi-agent interactions
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Language understanding in context: incorporating information about sources and targets
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Grounding language in images and videos
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Towards understanding language in perception and embodiment
PDF
Computational modeling of mental health therapy sessions
PDF
Identifying and mitigating safety risks in language models
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
The identification, validation, and modeling of critical parameters in lean six sigma implementations
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Common ground reasoning for communicative agents
Asset Metadata
Creator
Joshi, Himanshu
(author)
Core Title
Speech and language understanding in the Sigma cognitive architecture
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/16/2018
Defense Date
12/15/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AI,cognitive science,graphical models,inference,natural language processing,OAI-PMH Harvest,Sigma,speech recognition,SPN
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rosenbloom, Paul (
committee chair
), Bogdan, Paul (
committee member
), Scherer, Stefan (
committee member
)
Creator Email
enticer@gmail.com,hjoshi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-80264
Unique identifier
UC11671950
Identifier
etd-JoshiHiman-6848.pdf (filename),usctheses-c89-80264 (legacy record id)
Legacy Identifier
etd-JoshiHiman-6848.pdf
Dmrecord
80264
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Joshi, Himanshu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
AI
cognitive science
graphical models
inference
natural language processing
Sigma
speech recognition
SPN