Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enabling open domain interactive storytelling using a data-driven case-based approach
(USC Thesis Other)
Enabling open domain interactive storytelling using a data-driven case-based approach
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENABLING OPEN DOMAIN INTERACTIVE STORYTELLING USING A DATA-DRIVEN CASE-BASED APPROACH by Reid Swanson A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2010 Copyright 2010 Reid Swanson Dedication To my parents who greatly facilitated my return to graduate school and have supported me throughout. To Lee for enduring the seemingly endless pro- cess, the many long nights and the minimal amount of free time I’ve had. To Tim for all the Java help that I would have been lost without. To Chirstina for all her help editing my documents. And to Andrew for his patience, guidance and support. Without all of you I couldn’t have made it. ii Table of Contents Dedication ii List of Tables vi List of Figures viii List of Algorithms x Abstract xi Chapter 1 Introduction 1 1.1 Specific Areas of Impact . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Research Agendas & Roadblocks . . . . . . . . . . . . . . . . . . . . . 5 1.3 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 Related Work 14 2.1 Interactive Storytelling . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Collaborative Fiction . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Story Generation Systems & Interactive Narrative . . . . . . . . 17 2.2 Case-based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 CBR General Approach . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Problems With Traditional CBR . . . . . . . . . . . . . . . . . 24 2.2.3 Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Coherence Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Computational Models of Coherence . . . . . . . . . . . . . . . 30 Chapter 3 Sentence Boundary Detection 39 3.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 HTML Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 iii 3.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1 Preprocessing & Clean Up . . . . . . . . . . . . . . . . . . . . 46 3.5 Feature Sets for Sentence Boundary Detection . . . . . . . . . . . . . . 47 3.6 Chapter Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Chapter Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 4 Corpus Creation 57 4.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Previous Automated Story Collection . . . . . . . . . . . . . . . . . . . 60 4.3 Story Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.1 Corpus Selection & Annotation . . . . . . . . . . . . . . . . . . 63 4.3.2 Data Preparation & Features Investigated . . . . . . . . . . . . 67 4.3.3 Chapter Experiments & Results . . . . . . . . . . . . . . . . . 72 4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 5 Retrieval 81 5.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 A First Generation Model . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Chapter Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.1 Crowd Sourcing With Mechanical Turk . . . . . . . . . . . . . 92 5.3.2 Story Authoring HIT . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.3 Story Rating HIT . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Chapter Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 102 5.4.1 The Author’s Own Assessment . . . . . . . . . . . . . . . . . . 102 5.4.2 Independent Rater Assessment . . . . . . . . . . . . . . . . . . 106 5.4.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . 107 5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Chapter 6 Reranking 112 6.1 Story Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 An Enhanced Retrieval Model . . . . . . . . . . . . . . . . . . . . . . 117 6.2.1 Reranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2.2 Story Modeling Features . . . . . . . . . . . . . . . . . . . . . 123 6.3 Offline Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.4 Online Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.4.1 Author Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.4.2 User Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 iv Chapter 7 Adaptation 146 7.1 Adaptation Based Generation . . . . . . . . . . . . . . . . . . . . . . . 150 7.2 Chapter Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.2.1 Authoring Results . . . . . . . . . . . . . . . . . . . . . . . . . 157 7.2.2 Story Rating Results . . . . . . . . . . . . . . . . . . . . . . . 159 7.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Chapter 8 Conclusions 165 8.1 Overall Summary of the Results . . . . . . . . . . . . . . . . . . . . . 167 8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 8.3 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 176 Bibliography 178 Appendices 189 Appendix A User Comments . . . . . . . . . . . . . . . . . . . . . . . . . 190 A.1 The comments . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Appendix B Story Examples . . . . . . . . . . . . . . . . . . . . . . . . . 198 B.1 Retrieval-Only . . . . . . . . . . . . . . . . . . . . . . . . . . 198 B.2 Reranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 B.3 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Appendix C System Architecture . . . . . . . . . . . . . . . . . . . . . . 212 C.1 Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 C.1.1 Database Schema . . . . . . . . . . . . . . . . . . . . 212 C.1.1 Database Schema . . . . . . . . . . . . . . . . . . . . 212 C.1.2 Client Server Architecture . . . . . . . . . . . . . . . 214 v List of Tables 3.1 The number of times each character class occurred as a boundary marker (bnd) and the total number of occurrences (tot) for each dataset. . . . . 45 3.2 Sentence delimiting feature sets 1 & 2 . . . . . . . . . . . . . . . . . . 48 3.3 Sentence delimiting Feature Set 3 . . . . . . . . . . . . . . . . . . . . . 50 3.4 The best performing feature sets for each classification algorithm. . . . 51 4.1 Performance using the standard Perceptron with the default parameters. 74 4.2 (a) Performance of the standard Perceptron after finding parameters that maximize the F 1 -score on the training data (10-fold cross-validation) and test data. (b) The associated parameters for each model. . . . . . . . 75 4.3 Performance using the Confidence-Weighted linear classifier with the default parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1 Author rating results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Story authoring statistics. . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Independent story rating results. . . . . . . . . . . . . . . . . . . . . . 107 6.1 Performance using the standard Perceptron with the default parameters. 137 6.2 Performance using the enhanced Perceptron. . . . . . . . . . . . . . . . 138 6.3 The most indicative features for feature set 15. . . . . . . . . . . . . . . 139 6.4 Author rating results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.5 Story authoring statistics. . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.6 Independent story rating results. . . . . . . . . . . . . . . . . . . . . . 144 7.1 Author rating results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 vi 7.2 Story authoring statistics. . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.3 Independent story rating results. . . . . . . . . . . . . . . . . . . . . . 160 vii List of Figures 2.1 Basic case-based reasoning flow diagram . . . . . . . . . . . . . . . . . 22 2.2 An entity grid for the short story: John went to the store. He bought some milk. The milked turned out to be sour. . . . . . . . . . . . . . . . 34 3.1 HTML example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Rendered HTML example . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 A partial number-grid for the document: I drove to the beach this morn- ing. I found $20 in the sand. You never know what you’ll find. . . . . . . 71 4.2 Confidence-Weighted linear classifier learning curve. Extended to show the projected performance if the training and development data is used for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 User interface for the writing component. . . . . . . . . . . . . . . . . 89 5.2 Story authoring HIT design . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Story rating HIT design . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4 (a) The left column shows an entire user story. The right column looks at part of the weblog story used as a proxy for generating the highlighted sentence. The words highlighted in dark-gray show the overlap leading to the highest similarity score. (b) Simple narrative arc of the user’s story. 108 5.5 An example story using the bigram model. The matching sentences from each weblog story are presented next to the corresponding user sentence and the overlapping bigram phrases are highlighted. . . . . . . . . . . . 111 6.1 Story analysis graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2 Six ways of modeling a story. . . . . . . . . . . . . . . . . . . . . . . . 126 viii 6.3 SentenceLDAmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.1 The first four steps of the adaptation process. (a) Shows the potential replacements along with their relative frequency. The arrows indicate the subset of items chosen by sampling. (b) Illustrates how verb agree- mentisfixedusingadictionary. (c)Showshowthecoreferencebetween pronounsisresolved. . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.1 The(l o g)percentageofstorieswithagivenlengthforeachmodel. . . . 167 8.2 Comparisonofauthoringtimepersecond foreachofthemodels. . . . . 169 8.3 Summaryofthesubjectiveratingsbytheauthorsandindependentraters. 170 8.4 Theindependentuserratingsforthe50mostprolificusersofthesystem. 172 ix List of Algorithms 1 Simple IR based generation algorithm . . . . . . . . . . . . . . . . . . . 86 2 Reranking based generation algorithm . . . . . . . . . . . . . . . . . . . 119 3 Perceptron ranker training algorithm . . . . . . . . . . . . . . . . . . . . 122 4 Adaptation based generation algorithm . . . . . . . . . . . . . . . . . . 152 x Abstract Digital interactive storytelling (DIS) is a compelling new medium for expressing and communicating ideas that tries to transform a normally passive experience into an active engagement in the creative process. Despite the enormous potential this medium beholds, the cost and complexity of authoring compelling stories primarily driven by user actions is prohibitively expensive in many DIS systems. While the graphical capabilities and physical interaction with these systems have advanced at a lightening pace, the ability for open interaction in complex domains remains extremely constrained. This thesis advances textual-based DIS by introducing a new architecture that allows a seemingly infinite number of complex and branching storylines to be pursued by the user in any domain they choose. The approach uses case-based reasoning methods that leverages stories describing real world events and activities that ordinary people publish to their weblogs every day. The base generation algorithm uses information retrieval techniques to find similar sentences in a corpus of over 1.5 million stories. The weblog stories containing these sentences are used as a proxy for the user’s unfolding compo- sition and the next sentence from the identified weblog story is used as the system’s contribution. To further improve the quality of computer responses, the candidates are reranked using a richer set of linguistic features and finally, portions of the text are adapted to better conform to the discourse of the user’s story. Each of these components xi is evaluated by crowd-sourcing hundreds of users, thousands of generated stories and tens of thousands of user ratings. Moving to this type of data-driven, case-based architecture allows the narrative vari- ation in DIS to scale massively, only limited by the number of stories that can be col- lected from the Web. The breadth and depth of human experiences captured by this approach pushes us closer to free narrative interaction in the full breadth of complex social domains. xii Chapter 1 Introduction The promise of interactive storytelling is the ability to communicate all the richness and complexities of a traditionally crafted narrative, but is driven primarily by the reader’s beliefs, desires and intentions. While the reader in a more conventional storytelling medium, such as a movie, is a mere passive consumer of the narrative, she is a func- tioning agent in the development of an interactive narrative. The practical implica- tions of delivering on this pledge are enormous. There is hardly an application domain that would not benefit, from education, training, and entertainment to personal health and therapy. However, despite the tremendous potential and over 40 years of intensive research, the ultimate vision of Holodeck style environments remains a distant prospect. Although it is certainly true that we currently lack the technology for ultra-realistic 3D holographic virtual environments, the primary obstacle facing the adoption of inter- active storytelling systems is not a graphical environment problem. In fact, the advance- ment of graphic capabilities have far surpassed just about any other area of interactive storytelling. In a relatively short period of time, we have gone from the entirely textual based adventure games of the 1970s [34] to 3D virtual worlds that integrate all of our sensations to create completely immersive environments [19, 25, 29, 91, 128]. However, in spite of these impressive achievements, the stories cultivated through the interplay with these systems remain much closer to a movie directed by an external author than a narrative shaped by the desires of the interacting participant. 1 The reason many of the existing systems do not come to life in a dynamic new way is that in spite of the exciting new prospects envisioned with modern technology, the development of these systems has been remarkably similar to those of traditional media. A movie for example, typically has a primary author and a small team of additional writers who refine and shape the screenplay into a single coherent narrative. A standard interactive narrative is developed in a remarkably similar manner. Usually, a small team of writers starts with a single idea, but instead breaks it into multiple narrative threads that essentially defines several unique and independent stories. Using specialized tools, an effort is made to identify places in these narratives that the user can move from one thread to another without breaking the coherence of the entire story. Although this does allow for some involvement by the user, in the end they always remain on the tracks laid down by the original authors. It is rare to get a true feeling of autonomy in the world created almost entirely by other authors. It should be fairly clear at this point why achieving autonomy with this approach is so difficult. A traditional novel or movie takes months or years to author, and yet it is only the manifestation of one narrative thread. An interactive narrative of the same magnitude requires a huge amount of additional effort to both author the alternative vari- ations, and to assure that these possibilities are coherent based on the events preceding them. It is just not feasible to hire enough writers to author every possible situation that is interesting to a participant. This implies that finding a way to address this content authoring problem is essentially the same solution to eliminating the restrictions on user participation in an interactive storytelling environment. 2 This thesis proposes an alternative, data-driven method that releases the user from the predetermined tracks and allows them to participate in a virtually unrestricted man- ner that still manages to maintain a coherent narrative discourse. In the following chap- ters I will discuss how a large collection of personal narratives written by ordinary peo- ple can be obtained by taking advantage of the rise of weblogs and social media. It will be shown why these types of narratives are of interest to interactive narrative systems and how these stories can be used as the knowledge-base for a completely open domain turn-based interactive storytelling system. The remainder of this chapter will discuss two areas that would be particularly impacted by a solution to the limited interactivity problem, the roadblocks that need to be overcome and the basic overview of how I will solve the problem. 1.1 Specific Areas of Impact Stories are an integral part of human life and social interaction. We use them to com- municate our daily experiences with others, as educational tools and as means of enter- tainment. They are so important, and seem to fit so naturally with the way we think, that some believe they are a basic unit of human memory and reasoning [116]. As Connelly & Clandinin [32] remark, “...the study of narrative is the study of the ways humans experience the world. ” These deep connections between stories and the human mind have often been exploited in education and in the learning sciences. The role stories play in human cognition have deep implications in a person’s ability to learn about the world. Learning abstract concepts and relations through generaliza- tions and propositional statements is not always an effective teaching strategy. Instead, the use of stories can help relate these abstract concepts to real life experiences. By hav- ing to work through the connections to one’s own personal experiences, deeper and more 3 meaningful understandings can be achieved [2]. Brown [16] further argues that the focus of education has changed in recent years from “learning about” (propositional knowl- edge) to “learning-to-be” (learning to learn). Addressing these issues, McDrury & Alte- rio [82] propose an entire teaching strategy based on a five-stage storytelling approach (story finding, story telling, story expanding, story processing and story reconstructing) that mirrors Moon’s [90] learning hierarchy of understanding. Digital storytelling is also becoming an increasingly used tool for educational purposes at various levels. The use of computers in higher education institutions has been shown to facilitate collab- oration and to teach people new ways of expressing themselves through text, images, sound and other new media [71]. An automated system that could understand and adapt to the learning needs of a person through storytelling and story creations would be an invaluable asset to all kinds of educational pursuits. In addition to their utility in practical applications such as training and education, they also play a vital role in both our culture and the mundane aspects of our daily lives. For example, stories play a prominent role in nearly all forms of traditional and digital entertainment media. Nowhere is this more true than in the electronic gaming industry, which has exploded over the last decade. Not long ago, video games were widely regarded as a hobby only for small segments of society and younger generations. However, data obtained by the Electronic Software Association 1 paints a dramatically different picture. As of 2008, 65% of American households play some form of video game, the average gamer is 35 years old and 40% of the players are now female. In 2007 the gaming industry grew at a rate of 28.4%, faster than both the movie industry 2 (1.8%) 1 Data from http://www.theesa.com/facts/ 2 Data from http://www.mpaa.org 4 and the recording industry 3 (-11.8%). The total revenue of electronic entertainment ($9.5B) is now also on par with these other major industries ($9.62B Motion Picture Association of America and $10.4B Recording Industry Association of America). Of these sales, at least 35% of the titles are driven to some degree by narrative content (18.8% RPG, 11.6% shooters and 5% adventure), as opposed to strategy or puzzle type games. The ability to solve the content authoring problem is a major step toward realiz- ing the ultimate dream of interactive storytelling systems and the consequences would provide tremendous benefits to some of the most important aspects of human culture. Achieving success is not easy and the following section will discuss the primary research areas of focus and the major obstacles needed to be overcome. 1.2 Research Agendas & Roadblocks Textual interactive storytelling games, such as the one proposed in this thesis, occupy an interesting position in the entertainment landscape. Although they are often scorned by both game developers and literary experts alike, Montfort [89] clearly articulates the importance of textual interactive fiction on several levels. For researchers, interactive storytelling systems provide a platform for social scientific study and a tool for compu- tational linguistics analysis. For users, they offer a way to improve language skills and reading comprehension in a less painful, and hopefully fun, environment. The use of stories in textual interactive narrative games provides a unique source of entertainment and education that requires the user to actively engage in a mental retrieval and synthesis processes not found in other word games such as crossword puzzles. 3 Data from http://www.riaa.com 5 Although digital interactive storytelling has many important applications, creating interactive narrative systems that provide more than a superficial depth or breadth to the virtual world is an extraordinarily difficult task. The relatively recent introduc- tion of the medium certainly plays a major part in this difficulty. People have been writing and telling stories for thousands of years [114], but narratives that are partially driven by audience participation have only become mainstream in the last 20-30 years. The skills needed to author successful interactive narratives are not well understood and the overwhelming majority of writing instruction is still geared toward more tra- ditional static narratives that are consumable in print, film and television form. Unlike traditional media, the interactive story designer must be aware and plan for all types of expected, and unexpected, interactions with the characters and objects in the world. This is an inherently much more complex problem that poses new challenges in maintaining coherence and control over the unfolding story. Prevalent contemporary approaches to interactive storytelling model their worlds in formal representations and use planning or custom inference engines to drive the char- acters and plot forward. This tends to compound the problem as there is a tremendous gap between the technical abilities required to author standard narrative content versus those needed to work with logical primitives and operators. To author a successful story one must possess the skills of a creative literary master and the technical knowledge of the data-structures and computational mechanisms used for a particular interactive story engine. Unfortunately the people who have the type of technical expertise neces- sary are generally not the same people who are best at authoring meaningful content. Some attempts have been made to create user friendly authoring tools for less techni- cally inclined authors [84], however the level of technical ability and the depth of content possible is still prohibitive in comparison to non-interactive writing. 6 A natural result of a complex authoring environment and poorly understood develop- ment process is a soaring production cost. Fac ¸ade [80] is a highly regarded interactive narrative system that situates the user at an uncomfortable dinner party engagement where the hosting couple is going through relationship problems. The story takes place in their apartment and has four distinct plot lines that are playable for about 20 minutes each. This modest one room environment, and three character drama, took a two person team over two years to author and hundreds of thousands of lines of code in a special pur- pose behavior modeling language. Despite this ground-breaking and innovative effort, it is still fraught with many technical and substantive issues, such as poor natural language understanding and a user’s feeling of having no control [86]. Madame Bovary on the Holodeck [25] is another milestone system that attempts not only to create a 360 degree, fully immersive environment, but also adds extra depth to the characters by enhancing their actions and plans through modeling the characters emotional states. Despite the extremely limited range of emotional state and depth to the characters, Cavazza, one of the prime architects of this system, recently noted that each minute of game-play cost over a million dollars to produce (personal communication, AAAI Spring Symposium on Intelligent Narrative Technologies II, 2009). The primary drawback of piecemeal story authoring in disjointed formal representa- tions is the exponential cost associated with scaling the number of entities and relation- ships in the environment. Depending on the goals and vision of a system, this devel- opment expenditure may be acceptable. For example, the average cost of a Hollywood movie in 2007 was $106.6 million 4 and the average PlayStation 3 game was almost $15 million 5 , both of which have been shown to be viable media formats for many 4 Data from http://news.bbc.co.uk/2/hi/entertainment/3564377.stm 5 Data from http://www.theesa.com/facts/ 7 years. However, even with an entertainment size wallet, the standard piecemeal author- ing approach will not begin to address the range of activities people enjoy writing about in a fictional narrative adventure. In order to produce virtual environments that can react appropriately to these wide range of behaviors, a new methodology for generating content and driving interactive narrative systems is needed. Most interactive narrative systems today represent objects, actions and other virtual entities in a propositional or other formal representation that can be manipulated through automated planning or specialized inferencing algorithms. In these frameworks, stories can be generated statically or interactively through goal specifications in the narrative development or by simulating character motivations. With a rich enough expression language and high quality planning algorithms, state-of-the-art systems are capable of generating compelling, almost human-like quality, static narratives [19]. There is only one catch. The stories generated are heavily restricted by domain, setting, character development and length. While the interactive narrative community acknowledges these problems and makes small incremental progress in some of these areas, the main push seems to be in extending current technologies that enable more dramaturgically complex stories. The focus on algorithmic improvements to well understood approaches is, at least in part, due to the interdisciplinary divide between the engineering and literary components of the problem. Typically, the system engineers view their role as designing the archi- tecture only, and the content is an afterthought left as someone else’s problem. However, in almost all cases there is no community of content authors waiting to develop stories with the system. At best, a few professional writers are hired as part of the initial project funding who end up creating the only scenarios the system will ever play. 8 The primary aim of this thesis is to develop a new data-driven approach where the story content is an integral part of the architecture from the beginning. While the ulti- mate success of a highly structured, yet completely open domain storytelling system will remain an open challenge for many years, this thesis will present a new text-based prototype application, Say Anything. This system will be capable of handling a wider variety of input than any other known contemporary interactive storytelling system. A completely interactive, fully immersive narrative environment requires high pro- ficiency in many areas. In graphical environments the efficiency of the rendering engine and physical modeling of the world are daunting tasks in themselves. However, even when restricted to purely textual environments many of the truly difficult problems remain. The system must be able to interpret the input, usually in some form of nat- ural language, which requires an intricate level of natural language understanding. In order to produce an appropriate response to the action and resulting change of state, the system must also be able to reason with the data that it is given. To actually pro- duce a reply that the user can interpret, the system needs a way of transforming its internal representation into language that a human can read. However, natural language is not understood simply as a sequence of propositional statements or utterances. A further discourse structure must be imposed to provide the necessary interpretation of the sequential trace of the generated output. Although it is easy to conceptually com- partmentalize these facets, in reality they are tightly coupled together, requiring many aspects of linguistics, cognitive functions, world and commonsense knowledge to work together in a highly dependent manner. Beyond the technical understanding of a story, narratives also have a creative and artistic component. A good story is not merely a coherent ordering of sentences that 9 conveys a point. Stories tend to have higher levels of structure as well. At a mini- mum, they usually follow a story arc, consisting of a beginning, development, climax and resolution first recognized by Aristotle [3]. In addition to these simple constructs, Propp, Greimas, Barthes and others have further developed theories of narrative struc- ture, many of which are summarized by Cavazza [26]. Besides structural constraints, narratives also have many other apparatus, such as suspense and mystery, that can shape the emotional state of the reader. Although many have tried to formalize the creative process [104, 109, 112, 131], it is not a straightforward task and is open to much debate and criticism. However, a genuinely authentic automatic storytelling system should have some procedure for dealing with these amorphous and difficult to define constructs. 1.3 General Overview To enable an interactive narrative system that can handle all of these difficult issues in an open domain setting is a very challenging task. At a minimum, a large repository of world and commonsense knowledge has long been posited as necessary as well as efficient machinery to reason with it. While this basic premise is not debated here, the nature and composition of the knowledge-base as it pertains to previous narrative generation solutions is challenged. The success of many previous narrative generation systems lies in their ability to specify narrative attributes and goals in great detail that enables generic, or specialized, planning algorithms to formulate coherent stories based on the user’s input to the sys- tem. The tight integration between the content specification and planning formalisms have enriched these story worlds in many ways, such as producing more emotionally realistic characters modeled with their own beliefs, desires and intentions [108]. Unfor- tunately, the mechanism that leads to this success is also their Achilles heel. By coupling 10 the authoring process so tightly with the inferencing process, it becomes increasingly difficult to scale the solutions to larger and more complex domains. Although many of the innovations in these efforts have produced compelling results, the story worlds in which they are evaluated are usually very small and highly limited. Say Anything, the system described in this thesis, has a different model of game play than other interactive narrative systems as well as slightly different goals defining success. Say Anything is a textual based system that takes turns writing sentences of a story with a human user. Its objective is to generate sentences that continue the user’s story in an interesting and coherent direction. Here, the goal is not to produce lengthy stories that adhere to a preset plot line or that successfully execute the desired narrative devices specified in advance. Instead, the system should be capable of generating short, compelling stories or vignettes, on any topic the user decides to pursue. For example, the following is a highly rated story written with the new approach that illustrates the target quality the system aims to achieve. The weather broke, so we sailed out of the harbor. As Victoria grew nearer, the waves grew larger and we furled some foresail and turned to run. We sailed at about 9 knots with good trim, but the storm eventually caught up with us. With its big open cockpit and heavy nose, I didn’t like its chances in the kind of sea you get out there almost continuously that time of year. Sure enough the boat was completely inadequate, and we were tossed into the cold ocean. Everyone in our group of seven tourists – five locals and a Japanese couple – was pretty excited about the experience. The Japanese couple were the ones that saved us though, with their expert swimming abilities. As far as that goes it was just the four of us. The last tourist was lost at sea, never to be found. Drowned or murdered, the bloated, stinking bodies that turn up by the hundreds will look much the same. Such is the way with storms like that! To achieve these goals, the system described in this thesis breaks from the standard interactive narrative techniques and methods. Instead we follow a philosophy inspired by Case-Based Reasoning [110]. Although CBR will be discussed in more detail in 11 section 2.2, the central ideology is that people primarily reason from past experience and not from a predefined set of rules. Reasoning is a process of retrieving similar past cases and adapting the already known solution to fit the current problem. One of its main advantageous properties is a separation between the knowledge a system contains and the processes by which it reasons about the problem. This framework, therefore, allows a clear division of labor between the content author and the system designer. On its face, this is not much of an improvement over previous approaches, because it still requires human content authors in the loop. However, by choosing a convenient format for the case library, we can exploit the story-writing skills everyone on the planet already possesses. The most convenient format for people is of course the natural language they already speak. Although working with natural language poses many serious hurdles from a computational perspective, the abundance of regular stories that are available would free us from the content authoring problem and allow us to focus entirely on the underlying architecture and algorithms of the storytelling system. Chapter 4 will show that the Web hosts a vast supply of personal stories written every day and demonstrate an effective way to find and collect them. If we accept the hypothesis that the activities and events that people describe in their stories about personal experiences significantly overlap with the types of things mentioned in a fictional story, then this type of collection will be an excellent source of material for our system. Despite an imperfect overlap, this thesis will show that this type of data is sufficient and even when combined with relatively simple prediction techniques, surprisingly compelling results can be achieved. The use of knowledge captured by ordinary people written in natural language is not a new concept. For example, projects like Open Mind Common Sense [121] have used large web-based communities to harvest tremendous amounts of commonsense 12 knowledge contributed as English sentences. The work in this thesis takes a similar approach to knowledge acquisition, but instead uses a massive corpus of stories written by ordinary people. Stories provide many useful advantages as a source of common- sense knowledge over explicit elicited collections. Like Open Mind, the knowledge is written in a language most comfortable to the user, i.e. their own, but with the benefit that many stories already exist and are immediately available on people’s weblogs. A secondary advantage is that there is less potential for bias since the people writing on their weblogs are not actually aware they are contributing knowledge for a particular purpose. Using stories written by people also has the beneficial side effect of directly solving the natural language generation problem without having to resort to elaborate concept-to-text algorithms. Since the concept is natural language, it simply needs to be returned, albeit with some possible modifications. Finally, using real world stories, at least partially, avoids the thorny issues of modeling narrative theoretic constructs and other creative processes, by leveraging the natural writing abilities of ordinary people. The remainder of this thesis will describe the algorithms used to process the natu- ral language stories that enable the system to generate coherent sentences for the user’s story and also how a large enough corpus of stories was collected in order to make this approach possible. Before diving into the details, the next chapter rounds out the first part of the thesis with a review of some previous work in interactive storytelling and also in other areas of research that contribute to the success of the system. Chap- ters 3 and 4 in the second part of the thesis discuss the steps necessary to build the case library. Finally, the third part focuses on several different algorithms for generating user responses and the results of these different approaches based on a large user study consisting of hundreds of users and thousands of interactively written stories. 13 Chapter 2 Related Work The range of disciplines that influence automatic textual story generation span nearly the entire academic department list. A good story taps into human psychology, uses sociological observations and demands philosophical questioning to draw the reader in and form deep connections with their life experiences. Although prior narrative gener- ation systems have tried to model these complex human capabilities, it is an extremely difficult task and the implementations often lack any breadth, depth or realism. Broadly speaking, there are three primary topics that are critical to any textual based interactive storytelling application. First, it is important to understand its relation to previous work in the interactive storytelling community. Section 2.1 fames this work in relation to other interactive narrative genres ranging from traditional media to more similar computer based systems. The approaches described in this section provide a broad overview of the range of interactive storytelling systems that have been devised. This section will also highlight the insufficient mechanisms for content creation and integration that plague nearly all of the digital systems. Second, any automated interac- tive narrative system must have an underlying architecture for modeling and generating portions of the story. The solution proposed in this thesis is an analogical reasoning approach strongly influenced by the Case-Based Reasoning framework, which is dis- cussed in section 2.2. Finally, the eventual artifact of an interaction with a textual based system, such as the one presented here, is essentially a narrative discourse. As such, it is important that this final product adhere to notions of grammaticality and discourse 14 coherence. Section 2.3 will briefly introduce some linguistic theories of coherence and then dive into several computational models that have been developed based on these theories to automatically assess the quality of a discourse. 2.1 Interactive Storytelling Storytelling is one of the oldest cultural traditions originating long before written lan- guage was invented. It is so entrenched in our daily lives that we rarely have to define the essence of its nature. There are many reasons for telling a story, including education, manipulating opinions, garnering sympathy, or purely for entertainment. Typically, we understand it to be a process in which a single person communicates a series of events to a passive audience that is intended to express some information to us for one of these purposes. However, over the course of history, several variations have challenged the standard notion described above. The two most interesting in regard to this thesis are collaborative writing and interactive storytelling. Although computational systems and wide area networks (i.e., the Web) have popularized these genres, both have their roots in more traditional media. This section will examine these areas in more detail to show how Say Anything fits into a larger landscape of existing storytelling approaches. 2.1.1 Collaborative Fiction Say Anything, the interactive narrative system described in this thesis, can be seen as a story prediction engine or generator, but also as a collaborator in a mutual writing process. Collaborative writing is an exercise in which two or more people work together to author a story. It is a particularly interesting case of interactive narrative, because it is one of the few exceptions to traditional print and visual media where there is no 15 strong distinction between the observer and the participant in the story telling. The creation of most literature in Western culture is perceived as a highly personal process with strong (single) authorship; however, the role of author in collaborative works is shared amongst many individuals and the ownership of the work is blurred. While each individual contributes portions of the narrative, the piece as a whole has a life distinct from any one of the creators. Despite the bias towards ascribing ownership and single author assignment to liter- ary works in Western culture, the history of collaborative fiction is prevalent in many of its most celebrated and ancient works including the Illiad and the Old Testament. Col- laborative fiction has also emerged as an entirely new mechanism for authoring stories outside of the traditional literary domain. For example, Dungeons & Dragons [53] is an interesting application in collaborative fiction in which unique narratives emerge as a consequence of the players actions, and the chance events occurring throughout the course of a game. The introduction of computers has allowed these types of narratives to be simulated independently and at the same time the rise of the Internet has also enabled enormous groups of people to engage in this type of process. However, the Internet’s ability to bridge large geographic and cultural divides has also spawned many other genres of collaborative fiction with different emergent properties. For example, Wikinovels 1 is one example that has used the popular wiki idea to create entire works of literature through the contributions and edits of large pools of independent authors. Theatrical improvisation is an interesting tangential collaborative exercise that can produce emergent narratives and is particularly relevant to the interactive narrative story generation introduced in this thesis. Improvisation is a theatrical genre that involves a group of actors who engage in spontaneous games that typically involve participation 1 http://bluwiki.com/go/Wikinovels 16 from the audience [125]. Given only limited information, for example a place, activity or persona, the actors work together to create a narrative by reading and reacting to the events and dialog that unfolds. They are often humorous because of the difficulty in interpreting the context of the other actors’ actions and the outrages attempts to smooth over the jarring discontinuities that frequently occur. It is exactly these breaks in coher- ence, however, that challenge both the participants and the audience to break from their engrained habits and force creative reflections. 2.1.2 Story Generation Systems & Interactive Narrative Story generation is the process of taking a set of narrative primitives (e.g., characters, actions, effects) and producing a coherent, and hopefully interesting, story. There are a vast array of both textual and graphical generation systems that come in both static and interactive flavors. While the dynamic between interactive and static narrative gen- eration systems is quite different, many of the underlying techniques are essentially the same. In this section I will discuss the primary components of any generation system and the technologies used to implement them. However, despite this diverse space of possibilities, the field tends to heavily focus on only a few of these methods. TALE-SPIN [83] is one of the first and most influential narrative generation systems. Prior to TALE-SPIN, the idea of a computer generated narrative was simply to fill in a blank slot in a prewritten template story. TALE-SPIN was the first system to bring life to the characters and story by modeling low level characteristics of the world itself. Eight character types, talking woodland creatures, and 46 locations, such as meadows and bridges, defined the characters and setting of any story produced by the program. Additionally, a dozen or so procedures were defined to enable character movement, obtain and exchange objects, as well as specify more abstract entities such as having 17 wants and plans. To create a story, a set of initial conditions were specified and 41 inference rules were used to derive the unfolding story. Since that time, many other textual based narrative story generation systems [19, 131] have tried to improve upon the richness of the world, the underlying reasoning mechanism and the quality of the natural language output. Text adventure games, introduced by Adventure [34], are another interesting genre of interactive narrative. Here too, a virtual world is modeled with characters, locations, events and goals. However, text adventure games explicitly allow interaction during the development of the story. Typically, a predefined overall narrative goal has been specified by the system designer and the user tries to uncover it through problem solving and exploration. Montfort [88] gives a good overview of the theory and history behind these and other types of interactive fiction. While text adventure games still have a small following, graphical interactive narra- tive generation systems that share many of the same goals have become more dominant. Games are the most visible and well known interactive narrative systems. However, despite their widespread penetration into our culture and the vast catalogs of existing titles, there are actually relatively few games that highlight narrative aspects of the inter- action. A growing community in the academic world, beginning with projects such as Oz [10], have begun to emphasize the human and narrative aspects of 3D virtual envi- ronments. Although other techniques discussed below are used, many of the successful systems, such as Fac ¸ade [80] and Madam Bovary [25] have adopted some form of plan- ning framework for driving narrative and character actions. Regardless of the particular implementation, the automatic generation of a story can be decomposed into three primary components, not necessarily in order. The first specifies any narrative goals for the story and a means for achieving them. This basically 18 controls the plot of the story and any additional narrative elements that should be applied to the story, for example, specifying the sequential order of events to create desired narrative effects. The second specifies the goals of the characters in the narrative – their beliefs, desires and intentions, or even trying to model their emotional state of mind. Third, given the internal representation, deciding on a method for outputting a trace of the story in a way interpretable by people, usually visually or textually. There are many possible ways the narrative and character goals can be modeled in a computational system. Naksone & Ishizuka [92] classify systems into eight categories based on the way they model these components. Rule based systems try to match entities and events with a set of hand crafted rules that transform the state and drive actions in the world. State Transition models take a graphical model based approach, representing the world using Bayesian networks, finite state machines or decision theoretic graphs. Goal or Planning based systems represent events in the world as final and subgoals. A story is generated by fixing states in an initial condition and tracing the actions that lead to the success of achieving the target goal. Template based systems draw from a library of prototypical stories and make perturbations to create a new story. Semantic Inference systems are also graphical based systems, but use a more semantically ori- ented network, typically events linked by Rhetorical Structure relations. However, the prototypical applications using this approach are multi-media presentation generators and are only narrative from a broad point of view. Emergent Narrative systems do not model narrative constructs explicitly but provide the means for the user to create stories of their own through the gameplay. Different styles of narrative generation can be produced by emphasizing or omit- ting one of the three components mentioned above. For example, Cavazza et al. [23] describe a character based narrative generation system using the planning paradigm that 19 only plans and acts upon user oriented goals. The stories in this system are allowed to emerge simply from these actions alone. Young [134] on the other hand uses planning for both narrative and character oriented goals. These paradigm’s are also fairly flexible and nothing precludes using one technique for narrative organization and another for character goals as in Swartout et al. [59]. Regardless of the approach, the majority of heralded interactive narrative systems suffer from a common problem. The issue manifests itself in the system’s architec- ture, but is really a symptom of a misguided pursuit of narrative algorithms over issues of content and knowledge acquisition. An undue emphasis is placed on evaluations derived from the quality of the narrative structure. This may seem like a strange criti- cism for research in understanding and simulating narrative, however, this emphasis has led to an obsession in trying to control every facet of the unfolding narrative. Although other methods have been mentioned above, the most popular approach has been in the planning framework. These systems have demonstrated the ability to produce extremely high quality narratives and even model sophisticated literary devices such as foreshad- owing [4] and suspense [29]. However, the control that can be achieved with these techniques comes at a very high price. The user’s experience in the narrative world is dictated by the system’s limitations. Although the planning algorithms driving these system are not always complex, the input knowledge required to actually generate a set of stories is extraordinarily difficult to author. Every object, event and relationship in the world must be described in a relatively obtuse formal language and extreme care must be given to ensure that inconsistent or incoherent events can never occur. This can be done in a small world, such as Calaway & Lester’s Little Red Riding Hood genera- tor [19], but is virtually impossible for any domain much larger than that. The problem is two fold: the number of relationships that need to be managed grows exponentially 20 with the number of objects in the world and there are simply too many entities in the world to model by hand. Because of these problems, few, if any, current interactive narrative systems allow the user to explore the virtual world on their terms. Say Anything attempts to broaden the scope of interactive storytelling systems by empowering the user to do anything in the virtual world, while still maintaining an acceptable level of narrative coherence. 2.2 Case-based Reasoning 2.2.1 CBR General Approach Case-based reasoning [110] is the central architectural philosophy underpinning this new interactive storytelling approach. It was developed as a more natural analog to human reasoning than the popular rule-based expert systems of the time. It was argued that humans reason from experience, not from a given set of axioms, and so a new framework for building general purpose, artificial intelligent architectures based on this principal was born. The central tenant of case-based reasoning is to solve new problems by adapting the solutions of existing ones. Although there is no single algorithm or data structure that is specific to all case-based reasoners, there is a general template that most implementing examples follow. Figure 2.1 2 shows a schematic of the basic flow of a case-based solver. There are three basic steps to the flow diagram in figure 2.1. First, a problem definition is input to the system. Second, a sufficiently similar problem is retrieved from a repository of previously solved problems. Third, the similarly retrieved case is adapted to suit the particular needs and specifications of the original input. The process is simple enough, but as usual, the devil is in the details. The remainder of this 2 Reproduction of the original figure from [49] 21 Figure 2.1: Basic case-based reasoning flow diagram section will elaborate on the intricacies needed to implement a system from the general formulation. Like any other computational solver, a case-based reasoner must take a machine- readable description of a problem as its input. In general there is no set representa- tion. One popular approach developed at Yale and Northwestern by Roger Schank and his students used graph structures that linked basic units, called Memory Organization Packages, together. Memory Organization Packages are an abstraction of classes of events, such as “going on a trip,” and the features that encapsulate them. These MOPs constitute the nodes of the graph and are connected by several link types. There are four basic types for all problems: abstraction, scene, exemplar and index, although more specialized link types are also permissible. Abstractions link a general type of event to a more specific one. Scenes link a sub-event to a larger ongoing event. Exemplars link 22 to prototypical MOPs, or the MOP from which the current node was derived. Indexes link to specialized MOP nodes that satisfy the meaning of an index. The primary difficulty in case-based reasoning is not explicitly in the problem formu- lation, but in the ability to retrieve similar examples from a case library. This implicitly imposes restrictions on the final representation. Two problems are considered similar if their essential features are similar. The trick of course is determining exactly what these essential features are. From a functional point of view, Schank [115] initially argued that the cases should be indexed according to their analogical reasoning power. Essen- tially, a claim that memory organization is a reflection of its inferential usefulness, based on deep features such as relations between entities rather than shallow surface features. However, psychological studies have shown that shallow features are much more influ- ential in the retrieval of cases from human memory even though they are not as reliable in determining their reasoning efficacy [45]. The goals of understanding human psy- chological behavior are not always the same as those for implementing computational systems that behave in an intelligent way. Hammond, Seifert & Gray [55] for example, recognize these opposing objectives and offer a unified view for leveraging the beneficial qualities of both. In standard case-based systems, the features are extracted in a pre-processing analy- sis phase and much of the work is usually performed by hand. The result is an augmented problem specification that contains the extracted features, called indexes, generally in the form of hierarchical or frame-based structures. Similarity is then measured accord- ing to the number of matching indexes between problem cases and the degree to which each dimension agrees. 23 Adaptation is the process of transforming a retrieved case to match the specific prob- lem at hand. For example, CHEF [54] is a case-based reasoner that solves cooking prob- lems. As an example, it might be asked how to bake a lemon cake. In its case memory it may have many recipes for baking cakes but none specifically about lemon cakes; how- ever, it may have one about strawberry cakes. In order to use the recipe about strawberry cakes it will have to make simple, and possibly some complex changes to the original recipe. At the very least, strawberries must be substituted for lemons. Additionally, the difference in acidity between the fruits may require different amounts of ingredients (e.g., sugar) for the recipes, affecting the baking time and possibly altering the order in which the ingredients are mixed together. Typical CBR research distinguishes between two types of adaptation: structural and derivational. Structural adaptation uses a set of rules to directly transform a retrieved solution. This is the type of transformation seen in the previous cooking example where substitutions and transpositions of the recipe are performed and is the basic mechanism in the CHEF system [54]. Derivational adaptation, on the other hand, does not transform the proposed solution directly, but actually modifies the derivation by which the solution was generated and re-runs the process to recreate the solution. Although it is argued that derivational adaptation is more powerful [120], both are usually necessary in practice because not all solutions are generated in a suitable manner. 2.2.2 Problems With Traditional CBR Case-based reasoning had a strong following and inspired a large number of projects in the 1980s and 90s. However, since that time, solutions to new problems are rarely sought in this framework. While some of the disfavor is probably cultural and cyclical, 24 there are several more fundamental reasons that also account for this downturn in adop- tion. The primary drawback of this approach is in scaling the solutions to new domains. Despite the enormous effort of dozens and dozens of theses devoted to creating reason- ing systems in a variety of domains, these works neither had the depth to address any one of the specified problems nor the breadth to tackle all the areas humans are capable of reasoning about. JUDGE [5], for example, was intended to model the reasoning process of a real criminal trial judge. However, in order for the system to achieve even modest results, the application domain had to be severely restricted. Only a limited number of case types (e.g., murder) and 16 actions were known (e.g., stabbing). Additionally, only a few real cases were input to the system with the majority being hand crafted to highlight the features and actions the system could reason about. 2.2.3 Cases Over the years, frequent debates have ensued over the necessity and proper role of many of the components illustrated in figure 2.1. The competing ideas were often tested and compared through systems built to solve a variety of problems. Riesbeck & Schanck [110] describe 11 systems alone, such as the CHEF system previously described. JUDGE is another early example of a CBR system that is discussed at length. A library of 55 criminal cases involving murder, manslaughter and assault were encoded into the system’s memory to reason from. The goal of the system was to assign an appro- priate sentence given a new description of a crime. It used features based on the severity of the crime, such as the gravity of injury to the victim. It also considered features such as the victim’s age and whether the aggressor had any moral justification for his actions. Using a set of heuristics and transformation rules it was able to modify an existing sen- tence to fit the unseen trial data. This is a particularly pointed example highlighting the 25 similarity of the approach with human behavior considering that many legal systems around the world, including the United States, place a large weight on precedent and case history. These early systems were intended to show the viability of case-based reasoning sys- tems as general purpose solvers and as a new type of expert system. In slightly later sys- tems during the 90s, a heavy emphasis was placed on narrative case-based systems and their role in education. It was argued that the best teachers were also great storytellers and that stories themselves were a highly effective teaching resource. To this end, a push for a large collection of stories illustrating problems, principles and facts that are known by educators should be developed as a knowledge base for computational systems to draw from in interactions with human students [117]. Using computers enables students to actively engage with the material they are learning and can provide a stimulating environment, without which grasping the material becomes difficult. Edelson [39], for example, develops a computational learning system designed to teach young children biology through the collaborative creation of new animal species. ASK systems [42] were another iteration of case-based systems that pushed story driven systems as an effective means for teaching. Although stories often contain the knowledge and insight of the people who tell them, this information can only be absorbed when put into a context amenable to the student’s train of thought. Through the process of repeated questioning and response, exemplified by Plato’s Socratic dialogs, these systems attempted to teach students in a way that was specific to each individual’s needs. ASK-Tom [118] is a prototypical example that allowed students to navigate a multimedia library by asking questions and following links suggested by the system. 26 Buzz [100] is a recent descendant employing the case-based principles described above and is similar to Say Anything in many respects. Buzz is a digital arts and enter- tainment project that uses virtual characters to read recent stories in an emotionally evocative manner to viewers of the exhibit. While many of the underlying principles are the same as earlier systems, it tries to take advantage of the new data resources available today by shifting from a perspective of heavy indexing and analysis to a large scale data and retrieval problem. This is done by using Internet weblogs, treated as a large case library, from which stories are retrieved using a commercial search engine seeded with controversial topics and the day’s most popular search terms. Although not as popular today, the broad ideas of the CBR framework are still implicitly used in many other systems. Recently, Hayes & Efros [57] proposed an inter- esting case-based approach to the task of scene completion that is highly analogous to the interactive story generation presented in this thesis. Given an image with some por- tion of it removed, the task is to fill in the missing region with a reasonable alternative. Not only must the region match in color and texture, but also it must match semantically with the surrounding photograph. To accomplish this task they use a gigantic corpus of images and a coarse to fine matching algorithm that requires no labeled training data. First, a fast unsupervised matching algorithm is run to identify images that are of roughly the same setting (e.g., mountain side, tall buildings, etc.). This is able to rule out over 99% of the images in the database. They then are able to run more feature specific and computationally intensive algorithms to find a highly accurate specific match. Scale is the key to their success as they note that increasing their image collection from 2 million to 70 million had a dramatic effect on the performance of their system. 27 2.3 Coherence Modeling In Chapter 1 we saw that stories are a powerful and flexible means for representing and storing complex information and experiences. Despite the complexity represented by a narrative, this structure is well suited to human cognitive abilities and we have no trouble manipulating them. The ability to author a good narrative is a complex task that involves everything from low level linguistic processing to high level planning skills. Like any native language speaker, the author of a story must possess a non-trivial command of the target language. At a basic level this must at least include knowledge of the lexicon and a familiarity of the syntactic rules that govern grammatical sentence constructions. Stories, as in other high level linguistic phenomena, are not isolated linguistic units that can be described simply by lexical and syntactic relations alone. Narrative constructions are a form of discourse that are composed of multiple clausal and sentential units. These discourses are not merely arbitrary sequences of these units but are ordered in a way that facilitates coherence and are usually used to convey a particular message. A large body of work has been devoted to analyzing the structure of discourse and has been shown to be useful in many computational systems. These theories range in scope from targeting local aspects of coherence [52], primarily concerned with the relationships between entities in nearby regions of text, to theories that try to account for interpretations of the entire document that require huge amounts of world knowledge and inferences to fill in information not explicitly mentioned in the text [61]. Two of the most common general discourse theories are centering theory [52] and rhetorical structure theory [74]. Centering theory is often referred to as an entity-based theory of coherence, because it is primarily concerned with the relationships between significant entities in a discourse. The propositions of this theory attempt to explain three 28 aspects of a discourse: the focus of attention, choice of referring expressions and the perceived coherence of an utterance. Centering theory claims that the more a discourse adheres to these tenets the more people will tend to find the discourse to be coherent. Rhetorical Structure Theory [74] is another theory of discourse that has found widespread use in linguistic analysis and computational models. It is a general model that makes no assumptions about an underlying semantic theory and can be used to analyze virtually any type of text with only a few exceptions, such as legal documents, contracts and some forms of artistic literature. It has been used for a variety of pur- poses including, studying contrastive rhetoric, as an analytic tool for characterizing text, analyzing narrative discourse, natural language generation and many other areas explored by Taboada & Mann [129]. Of particular interest to this thesis, Nakasone & Ishizuka [92] have proposed a general storytelling ontology that extends the typical temporal-event model to include other rhetorical relations as well. The primary goal of Rhetorical Structure Theory is to identify relevant spans of text and organize them into a hierarchical representation linked by various discourse relations. Although these theories have been directly used in computational models of coher- ence, for example [78], they are either underspecified for many practical applications they or suffer from low accuracy automated tools. However, these theories provide the basis for many other feasible approaches that are immediately useful for tasks requir- ing measures of coherence, such as this one. The following section will discuss several computational models that have been developed for exactly this purpose. 29 2.3.1 Computational Models of Coherence The coherence theories above provide a deep understanding and analysis of the attributes that underly the interpretation and coherence of a text. However, a fully automated anal- ysis of a text based on such a theory may not always be possible. Information or dis- course ordering has become a prominent task for developing and testing computational models of coherence. Although there are several variants of the task, the goal is to mea- sure a model’s ability to correctly identify a coherent sentence ordering of a document. At least one reason for the rise in popularity of this task is that high quality models can be learned and evaluated with very little hand annotated training data. In this section I will describe a few of the most common variants of the task and their evaluation metrics and then briefly describe several notable approaches. To assess the quality of a coherence model there are many ways to pose the prob- lem as an information ordering task. Classification, or pairwise ranking, and Kendall’s Tau are the two most common evaluation metrics, although several others will also be mentioned. Given a short document known to be coherent (i.e. written by a human) one could randomly shuffle the sentences and recast the ordering problem as a classification problem, where the goal is to correctly identify the original order. Alternatively, one could enumerate all possible orderings, or heuristically search the space, and evaluate the model based on the rank in which the correct ordering is assigned by the model. Kendall’s Tau, , is another metric for evaluating an information ordering coherence model. Given a sentence ordering predicted by a coherence model, the score is a mea- sure of how many permutations are needed to restore the original ordering and is given by the following formula: = 1 2 S N 2 30 Where S is the proposed sentence ordering and N is the number of sentences in the document. Although not widely used currently, Gandhe [44] proposed a similar metric that has a higher correlation with human judgments. Soricut & Marcu [123] also report BLEU [103] scores, commonly used for evaluating machine translations, for their coher- ence models. Although the sentence ordering problem has arisen in many areas over the years, such as natural language generation and text-to-text generation (e.g., document sum- marization), Lapata [68] first proposed an unsupervised probabilistic approach that addressed this problem in its own right. Her method makes an oft cited assumption for this task, stated by Marcu [78], that global coherence can be achieved by ensuring local coherence 3 . Lapata essentially takes a language modeling approach to the problem, such that the probability of a document order is given by: P (S 1 \S 2 \S 3 S N ) =P (S N jS N1 S 1 )P (S N1 jS N2 S 1 )P (S 1 ) and is approximated by: P (S 1 \S 2 \S 3 S N ) = N Y i=1 P (S i jS i1 ) where S i represents the i th sentence in the ordering and there are N sentences in the document. Since sentences of any considerable length are rarely repeated exactly, Lap- ata devises a set of features used to approximate the value of a sentence. Each sentence in the model is represented by its verbs, nouns and a set of dependency triples (e.g., 3 A weaker, more accurate interpretation also given in many of these approaches, is that local coherence is at least a necessary condition for achieving global coherence, but that it may not be sufficient. 31 N:sub:V) extracted using MINIPAR [70]. The conditional probability of adjacent sen- tences was then reformulated as: P (S i jS i1 ) = Y (a hi;ji ja hi1;ki )2S i S i1 P (a hi;ji ja hi1;ki ) wherea hi;ji is thej th feature of sentenceS i . Frequency counts over the features are used to estimate the individual conditional probabilities in the following way: P (a hi;ji ja hi1;ki ) = count(a hi;ji ;a hi1;ki ) P a hi;ji count(a hi;ji ;a hi1;ki ) Given a bag of randomly shuffled sentences Lapata uses a greedy search algorithm to find the most likely ordering. Barzilay & Lee [9] followed up the work by Lapata [68] by proposing a Hidden Markov Model method for estimating the coherence of a document. In their approach, each state of the HMM is used to represent the topic of a sentence while transitions between states model a change from one topic to another. The topic of a sentence is not usually given in the text explicitly and is not always easy to discover, automatically or not. In this work, sentences are grouped together utilizing a complete-link clustering algorithm using bigram features of the sentence and cosine similarity. Each cluster is then used to represent a single topic. Transitions between topic nodes are created when- ever there is a sentence from a sentence in one cluster to a sentence in another cluster. 32 To estimate the state emission probabilities p s i (w 0 jw) and the transition probabilities p(s j js i ) the following formulas were used: p s i (w 0 jw) =f c i (ww 0 ) + 1 f c i (w) + 1 jVj p(s j js i ) =D(c i ;c j ) + 2 D(c i ) + 2 m Here f c i () denotes the frequency of a particular word or bigram within the sentences of cluster c i . D(c i ;c j ) is the number of documents that have a sentence in cluster c i immediately preceding a sentence from clusterc j andD(c i ) is the number of documents having a sentence from cluster c i . The s are smoothing parameters to prevent zero probability events,m is the number of clusters andjVj is the vocabulary size. It is noted that the location of a sentence in a document also influences its relation to a topic and despite a low amount of lexical overlap, contextual position may indicate a continuation of the topic rather than a change. To try to account for this type of contextual information a Viterbi-EM re-estimation technique is used. After building the first HMM model the sentences are re-clustered by assigning them to the state with the highest Viterbi decoding on the training data and then retraining the HMM parameters based on the new cluster information. This is continued until there are no changes in the clusters or until a fixed number of iterations are reached. To evaluate their model, all permutations of the 500 documents in their test set were ranked according to the score given by the trained model. Barzilay & Lapata [8] propose a new way of modeling coherence, inspired by Cen- tering Theory and other entity based models of coherence cited in their paper. Based on these theories, they hypothesize that constraints imposed by local coherence give rise to regular distributions of entities in a discourse and that the coherence of a text can be measured by tracing the way in which entities are introduced and discussed. There 33 john store milk 1 s o - 2 s - o 3 - - s (a) Entity-grid ss s- o- – -o os 0.17 0.17 0.17 0.17 0.17 0.17 (b) Feature vector representation Figure 2.2: An entity grid for the short story: John went to the store. He bought some milk. The milked turned out to be sour. are three major relations that they consider crucial to an entity-based approach. First, they would like to identify co-referring entities within a document. To accomplish this they use a co-reference resolution algorithm developed by Ng & Cardie [95]. Because this algorithm was trained on a separate domain, a simple string matching approach was also applied. Second, they identify the grammatical/semantic role the entity plays within a sentence using a set of rules [7] to transform a Collins [30] style parse tree into a labeled dependency tree. Third, they consider the salience, or importance, of the entity within a sentence based on the frequency of the entity in the document. The entities in a discourse are first organized into a data structure they call an entity-grid. The rows of the grid represent each sentence of the discourse, the columns represent each entity and each cell contains the entities grammatical relation for that sentence (e.g., Subject, Object, Other, or Not present). See figure 2.2 for an example. They make use of the entity-grid by extracting the implicit transition information into a feature vector representation. For example, features are created for sequences of grammatical relations occurring in consecutive documents (e.g. SS, SO, SX and S- for entities that appear in consecutive sentences as Subject-Subject, Subject-Object, Subject-Other and Subject-Not Present). The value of each feature is computed as its relative frequency to the other features. In this work, solving the information ordering problem is posed as a ranking problem. Training data instances are generated as ordered pairs of alternative sentence orderings of the same document ranked according to their 34 relative coherence. One element is the original order while the other is a random per- mutation of those sentences. For training and testing they use a large margin approach based on Joachims [65] Support Vector Machine ranking algorithm, which imposes the following constraint: w ((x ij ) (x ik ))> 08j;i;k such thatj >k wherew is the vector of weights to be learned and (x ij ) and (x ik ) are the feature mappings of thej th andk th ranked permutation of thei th document. They show that this technique compares favorably with previous approaches on corpora from two separate domains. They demonstrate that since none of the features depend on lexical values, a model trained on one domain performs well, compared to the other methods, when applied to the other domain. Foltz, Kintsch, & Landauer [43] propose a different, semantic, measure of coher- ence. Similar to the language and topic based models of Lapata [68] and Barzilay & Lee [9] the assumption is that coherent documents generally have a high degree of semantic overlap in nearby contexts (e.g. sentences, paragraphs, chapters, etc.). A Latent Semantic Analysis [35] approach was used to model the semantics of individual sentences. In this approach, a matrix was constructed such that each row represents a lexical item and each column represents a sentence. The term frequency-inverse doc- ument frequency score was used as the value in each cell. The dimensionality of the matrix was reduced using a singular value decomposition. The resulting, low dimen- sional, vector for each word is interpreted as its meaning. The meaning of a sentence is defined compositionally as a weighted average over all the word vectors comprising the sentence. They define the semantic similarity between two sentences as the cosine 35 distance between them. To assign a coherence to an entire document they simple aver- age over all the consecutive pairwise sentence similarities. Although the goal of their paper was not originally geared towards information ordering, Barzilay & Lapata [8] re-implemented their model as a point of comparison. While this model performed rea- sonably well on the information ordering task, it was less favorable on other evaluation tasks, such as readability. Soricut & Marcu [123] propose a new coherence model based on word co- occurrence, derived from the IBM machine translation models [17]. The generative interpretation is that words in a sentence i generate the words in sentence i + 1 with some probability and that certain words will be much more likely than others. The probability of a documentD consisting ofn sentences using this model is given by: P (D) = n1 Y i=1 s i+1 Y j=1 js i j + 1 s i X k=0 p(s j i+1 js k i ) The probabilities p(s j i+1 js k i ) are iteratively learned using the EM algorithm. Similarly they define an inverse IBM model in which the words of sentence i + 1 generate the words of sentencei. In addition to proposing a new coherence model they also intro- duce a new compact graphical representation, specified by IDL-expressions, that are capable of specifying all possible orderings from a bag-of-discourse units. To com- bine several methods together they use a log-linear model and perform a discriminative training method [97] to find the appropriate parameters. Among others, they use an A* search to find the best ordering given the bag-of-sentences. By combining several of the previous approaches explained above they are able to show significant improvements on several of the evaluation criteria and datasets. 36 Elsner & Charniak [41] also propose several new features and a generative method for undertaking the information ordering task. Like Barzilay & Lapata [8], Elsner & Charniak consider pronoun coreference to be an important factor in the coherence of a discourse. However, the novelty of their approach relies on the notion of discourse-new and discourse-old entities. A discourse-new entity is an entity that has not previously been mentioned in the discourse, while a discourse-old one has. Similar to one of the models of Barzilay & Lapata they treat any two NPs that share the same head to be co- referring and model the probability of a document as Q np:NPs P (L np jnp). Additionally, they also provide a different, probabilistic, model of pronoun coreference as: P (a i ;r i jai i1 ) =P (a i jh(a i );m(a i ))P gen (a i ;r i )P num (a i ;r i ) Each of these models is able to outperform a random baseline, although neither in iso- lation perform better than Barzilay & Lapata’s entity grids. However, in combination, all three models are able to move beyond any of the individual models. Elsner, Auster- weil & Charniak [40] also describe other probabilistic models based on entity-grids and propose a new HMM based model that is competitive with the other state-of-the art systems. This chapter has expounded on the three most important areas relating to this thesis. It has given an overview of interactive storytelling, the types of systems that have been developed previously, and the major problems that still face these types of systems. Section 2.2 provided the background for the data driven case-based solution advocated for in the remainder of this thesis and section 2.3 discussed several ways in which the quality of textual discourse may evaluated automatically. The next chapter begins the second part of this thesis. It describes several preliminary processes that enable the acquisition of the large case-library of personal stories. This collection is used as the 37 foundation of story generation mechanism that will be described in the third part of the thesis. 38 Chapter 3 Sentence Boundary Detection 3.1 Chapter Introduction Sentence boundary detection (SBD) is an important first step in many natural language processing applications and can have a far-reaching effect on all subsequent processing in the pipeline. For example, part-of-speech taggers often use capitalization and word position (e.g. first word or not) as a feature to help detect proper nouns. This captures the intuition that the first word of a sentence is likely to be capitalized regardless of its part-of-speech, while most words that are capitalized in the middle of a sentence are proper nouns (or perhaps acronyms). One misidentified sentence boundary can lead to at least one spurious part-of-speech assignment and depending on the classification algorithm can trigger a chain reaction of subsequent errors. Although the consequences for these types of shallow tasks are problematic, they have a much more profound effect on deeper natural language processing tasks, such as syntactic parsing. A single error in a syntactic parse tree can dramatically change the implicit predicate argument structure and thematic roles of the agents in the sentence. This can completely alter the meaning and interpretation of the sentence from the speaker’s original intention and make the use of the resulting structure more damaging than helpful. There are at least three components of the storytelling system described in this thesis that require sentence boundary detection. First, sentences are the basic unit of discourse that the system engages with the user. This is manifested in the underlying architecture 39 of how the data is persisted and in the natural language generation mechanism interfac- ing with the user. Second, several of the features used for finding stories on the web, described in Chapter 4, rely on parsing and other techniques that make use of the senten- tial structure. Third, the methods described in chapters 5, 6 and 7 for finding plausible continuations of the user’s developing story all depend on well formed sentential units being identified. It is therefor critical for the success of this story generation architecture that a highly accurate method for sentence boundary detection is employed. Since sentence boundary detection is important to so many natural language applica- tions, it is unsurprising that this task has received substantial attention in the NLP com- munity. A variety of approaches have been pursued, including rule-based systems [1], supervised statistical learning [46, 102, 107] and unsupervised statistical learning [66]. Accuracy of state-of-the-art systems is impressive with reported results often achiev- ing over 98%. Despite this extremely high accuracy, there are several problems that limit the practical use of these methods for the work in this thesis. One concern is that many of these boundary detection systems have only been developed and tested on news and other well edited text. Less attention has been focused on unedited genres such as weblog text, which are of paramount importance for this work. This chapter will describe a new sentence boundary detection system that is specifically tailored to the type of open-domain web text that enables the data driven story generation architecture described in this thesis. There are several reasons why the performance of off-the-shelf systems is substan- tially lower on arbitrary web data. Although supervised systems are usually the best performing systems when enough training data is available, their performance often sig- nificantly degrades when applied to out of domain data. However, apart from supervised systems there is another more serious systemic problem to many of the approaches that 40 HTML BODY DIV BR BR BR BR BR BR A Thoughts-Forbes.com SPAN after the age of 35. The reason is that SPAN very few people do anything creative SPAN before the age of 35. SPAN --Joel Hildebrand BR I find myself trying to flip to the back of the magazine for some nuggets. Such as the following P SPAN Very few people do anything creative EM 1 2 3 4 6 7 8 9 10 5 Figure 3.1: HTML example even affects the unsupervised approach taken by Punkt [66]. Nearly all of these systems recast the problem from an actual sentence boundary detection task to a more simple period (orf.,?,!g) disambiguation task. This works quite well for edited text that tends to adhere to a set of principles that mirror this assumption adequately. Unfortunately, it does not work nearly as well for unedited text where other characters, such as closing brackets, quotation marks or no character at all, are also frequently used to demarcate the end of a sentence. These issues are compounded for web documents because HTML markup is often used in lieu of actual punctuation to visually indicate (e.g. with a new line) the end of a sentence. While it might be possible to augment existing systems with a set of rules to handle HTML tags, there are no clear semantics defined for HTML and virtually any formatting characteristics can be overloaded using cascading style sheets. 3.2 HTML Example This section walks through an example that highlights several of the issues pertaining to web text that complicate standard sentence boundary detection algorithms. Figure 3.1 shows a snippet of HTML taken from an actual weblog and figure 3.2 illustrates how 41 this markup would be rendered in a typical web browser. The box labeled 2 in figure 3.2 contains a sentence without any closing punctuation, which could potentially be resolved using a simple rule to introduce a sentence break after every closing</P> tag. Box 3 contains a hyperlink, which is not really a sentence in the traditional sense, but it also does not really belong to either the sentence before or after. Using the surrounding HTML tags (i.e., <BR> and <P>), it seems safe to assume this is intended to stand as a stand alone unit of text. Boxes 7-10 emphasize why simple rule-based approaches are likely to be insufficient. Although some HTML tags strongly imply a semantic interpretation, for example <P> for paragraph, they are strictly syntactic formatting elements. While it seems natural, as is often the case, for a <BR> tag to be used as an explicit, visual, sentence boundary marker, there is nothing that requires this. Even though there are more appropriate formatting options<BR> is often used as a simple formatting tool, for instance, limiting the width of a line. To further complicate the problem, the way elements format text or other objects on the page can easily be changed for each individual exemplification of a tag using CSS styles. The<SPANs> whose children are boxes 7, 9 and 10 are used to illustrate this concept. The <SPANs> over boxes 7 and 10 are somewhat ambiguous, but we can assume that they actually provide no formatting information at all and are merely containers used for logical (or illogical) grouping. The line break between 7 and 8 is caused by <BR> element that would again cause an erroneous sentence break if one were to use it that way. The line break that occurs between 8 and 9, however, is the result of CSS styles applied behind the scenes. This example uses a <SPAN> as a blank-slate formatting tag, which is the primary purpose of <SPANs> and <DIVs>. However, as previously mentioned, the formatting behavior of virtually any HTML tag can be modified using CSS styles. 42 Access to the complete style sheet information for a web page would be useful in determining the semantic intent of the author. However, it would still not be a general solution to the problem, because some ambiguities would remain such as disambiguat- ing<BR> tags. The larger problem, however, is that CSS styles are generally defined in external files that are not accessible to a visitor of the page. 3.3 Task Definition The example in section 3.2 demonstrates that recasting the sentence boundary detection task as a punctuation disambiguation problem, in the same way as many of the previous systems, is inadequate. Although the system described in this chapter is not a radical departure from this type of approach, there are some significant differences. First, the set of potential sentence boundary characters now includesf:,”,’,),g,],>g in addition to the explicit sentence boundary markers (i.e.,f.,?,!g). Second, to address formatting issues introduced by HTML markup we also consider the boundary between any two spans of text whose parent nodes are sisters. For example, boxes 2 and 3 in figure 3.1, whose parents are<P> and<A> respectively. Similar to Splitta [46] and mxTermina- tor [107], we also treat the disambiguation process as a classification task. While closely related to this other work, these additions substantially broaden the scope of what can actually be detected. 3.4 Data Preparation A new annotated corpus was created using the Spinn3r.com weblog corpus developed for the 2009 International Conference on Weblogs and Social Media dataset chal- lenge [18]. This is the same corpus used for story identification and corpus creation 43 I find myself trying to flip to the back of the magazine for some nuggets. Such as the following Thoughts-Forbes.com Very few people do anything creative after the age of 35. The reason is that very few people do anything creative before the age of 35. --Joel Hildebrand Figure 3.2: Rendered HTML example and will be described in more detail in Chapter 4. The dataset was created by randomly extracting 1000 blog entries from the collection of 44 million in the Spinn3r.com cor- pus. The annotation was split into two separate files. One file contained the unaltered HTML 1 of all the blog entries. The other file contained the text of each entry with all the HTML tags removed, including all code within script and style tags. This text file was then annotated by placing a newline after any of the candidate boundary markers that were judged to terminate a sentence. The following additional guidelines were also used to help disambiguate several problematic reoccurring cases. When in doubt, the visual layout was used as a guide, for example, determining whether to break after a colon or not. The generally accepted rule is to break if a multi-sentence block occurs; however, in the corpus a visual line break is often used even for a single sentence. In these cases a rule was adopted to always break on multi-sentence blocks (if it could be clearly determined this was the case) and also when visually indicated on the page. Other particularly troubling cases were song lyrics, poems, hymns and prayers. These passages have a disproportionately high occurrence of non-standard formatting and are often composed of incomplete sentences or phrases. Here too, the visual layout of the page was used as a guide, even if a legitimate sentence 1 As extracted from the description field of the Spinn3r item 44 Set . ? ! : Other HTML Total bnd tot bnd tot bnd tot bnd tot bnd tot bnd tot bnd tot Dev 994 1082 93 108 49 64 18 147 53 520 237 1125 1444 3046 Train 6656 7505 359 466 417 496 239 1090 596 3645 1795 9407 10062 21646 Test 1494 1682 66 85 106 127 35 248 211 1010 475 3152 2387 5291 Table 3.1: The number of times each character class occurred as a boundary marker (bnd) and the total number of occurrences (tot) for each dataset. was split in two. Although there are no perfect alternatives, this is most likely the single greatest source of noise in the training data. Two other cases occurred frequently enough to issue a specific guideline: creative uses of punctuation marks to convey emotion (i.e., smileys) and parentheticals. When a smiley was found in the middle of a sentence it was left as is. If one was found at the end of a sentence without any other punctuation mark then a sentence boundary (newline) was annotated after the smiley. If a smiley occurred at the end of a sentence, but after a punctuation mark, the sentence was delimited on the standard character and the smiley was placed on its own line. Similarly, parentheticals within a sentence were left as is, even if they contained a multi-sentence units. If a parenthetical occurred at the end of sentence, after an explicit punctuation mark, then that parenthetical was also placed on a newline by itself. The annotated data was then split into a development, training and testing dataset. The development set consisted of 100 entries with 1444 sentence boundaries. The train- ing set contained 750 blog entries with 10062 sentence boundaries and the testing set was comprised of 150 entries with 2387 sentence boundaries. See table 3.1 for more detailed statistics. In addition to the standard punctuation marks and HTML charac- ters, several other character classes are also accounted for but are condensed into the Other category for space considerations. These include quotation marks and the various closing bracket types. 45 3.4.1 Preprocessing & Clean Up Web text poses several other problems besides the stylistic formatting issues described in sections 3.1 and 3.2. Character encoding issues are also a serious class of problems requiring attention because of the assumptions made by many current NLP applications. In general, text on the web is encoded using UTF-8, a multi-byte character encoding. However, many of the systems that implement common NLP tasks, such as part-of- speech tagging and syntactic parsing, require the input text to be encoded in US ASCII, a single byte character format. Even tools built in languages with strong support for wide character encodings are often trained on data sets that only use a single byte characters, such as the Penn Treebank [79], which effectively limit these tools to dealing with single byte ASCII representations. There are also several other theoretical reasons ASCII is preferable for English texts; for example, implementing efficient string pattern matching algorithms [93] and limiting data sparsity. For the reasons stated above, and to ensure consistent output to a user’s story, a preprocessing step was applied that attempts to convert UTF-8 Unicode encoded web content into a plain ASCII encoded document that matches the original characters as closely as possible. This process reduces data sparsity issues for the sentence boundary detection training algorithm and produces text that is more suitable for many other NLP tools. In Unicode, it is possible to represent the same character visually using several different underlying character code sequences. For example, u212B ( ˚ A), u0041+u030A (A+ ˚ ), and u00C5 ( ˚ A) all represent the same character visually on the page. These characters are all considered canonically equivalent. Similarly, there are also characters which are equivalent abstractly, but may convey different meaning by their appearance, such as the number 1 and the subscript 1 . These characters are considered compatibly equivalent. In order to transform the Unicode text we must covert all the equivalent 46 character sequences into a single representation. The Unicode standard defines several normalization functions that are designed for exactly this purpose and are implemented in the International Components for Unicode library 2 . In particular we use the NFKC normal form, which first decomposes complex characters based on compatibility and then composes them by their canonical equivalence. The practical effect is most accents, diacritics and other character modifiers are removed and ligatures are split into their component parts, replacing them with single byte ASCII encoded characters. The output of the normalization process produces text that is nearly sufficient, how- ever, when the normalized text is converted, any character that was not decomposed into an atomic single byte character is simply removed. There are several problematic cases where this is not acceptable and these must be handled as special cases. The most com- mon instances are quotation marks and special punctuation characters, such as ellipses. To handle these special cases, a table of a 177 entries was hand created to provide a customized translation from the Unicode character to a similar ASCII character. This step was applied before the normalization to minimize the number of characters deleted. 3.5 Feature Sets for Sentence Boundary Detection The supervised classification approach adopted in this chapter uses a standard feature vector representation for each training and testing instance. This section describes sev- eral different feature sets used to generate the feature vectors tested in the experiments. FeatureSet1: As a starting point, the features in Splitta [46] were used. In Splitta, each example takes the form of L.R, where L is the (one word) context to the left of the can- didate period and R is the (one word) context to the right of the candidate period. Our 2 http://site.icu-project.org/ 47 # Description 1 H =h 2 L 0 =w i 3 R 0 =w j 4 len(L 0 ) =l 5 len(R 0 ) =r 6 cap(R 0 ) 7 cap(L 0 ) 8 (L 0 ;R 0 ) 9 (L 0 ;cap(R 0 )) 10 (L 0 ;R 1 =w i+1 ) 11 (L 0 ;cap(R 1 )) (a) Feature Set 1 # Description 1 L 1 =w i1 2 R 1 =w i+1 3 W = (L 0 ;H;R 0 ) 4 (L 0 ;H;cap(R 0 )) 5 (L 0 ;H;R 1 ) 6 (L 0 ;H;cap(R 1 )) 7 Tree 8 (ParTag;log(.)) 9 (RightTag;log(.)) (b) Feature Set 2 Table 3.2: Sentence delimiting feature sets 1 & 2 system uses a similar, but generalized pattern, LHR, such that L is still the left context. However L is not necessarily only one word. Instead it is broken up intoL 0 ;L 1 L n , where L 0 indicates the token immediately preceding the candidate boundary marker and L 1 L n are the next n words to the left. H is a generalized sentence boundary hypothesis character that is not limited to a period. If the hypothesis character is one of the specified candidate characters listed in section 3.3 thenH is simply that character. On the other hand, if the boundary is at the edge of an HTML span and the span is a punctuation character (as defined by the C method ispunct), thenH is that charac- ter. Otherwise, if the final character is a digit, then H is set to 8, a special character used to indicate the hypothesis character was a digit. Finally, if the final character is an uppercase character thenH is set toC, otherwise it is set to c. Similar to L, R is the right context, withR 0 being any token immediately following the hypothesis character, without any intervening spaces. Like Splitta all features are binary. Table 3.2a lists all the features used in this set. 48 Feature Set 2: Feature Set 1 contained textual features similar to those used in Splitta 3 . In addition to several new textual features, the second feature set introduces some HTML specific features that attempt to address tag formatting ambiguities. The first HTML feature considers the parent tag (e.g. <P>) of the current span of text, and the distance from the current hypothesis to the end of this span. Similar to Splitta, we use the log distance rounded to the nearest integer, denoted in our table as log(.). Similarly, we also consider the tag of the parent’s right sibling and a combination of the two. We also use a small subtree of the HTML in the neighborhood of the candidate boundary character. The tree is built from the bottom up and starts with the the span’s parent tag and the span’s right and left sibling tags. If both the right and left sibling tags are not empty, or we are at the root of the tree, then we break. Otherwise we traverse one node up the tree and repeat. The complete set of features introduced in this set are enumerated in table 3.2b. Feature Set 3: Feature Set 3 does not add any structurally new features, but con- tinues by increasing the context and adding more detailed compound features. See Table 3.3 for a complete list of the new features added in addition to all the previous features from set 1 and 2. As in feature set 2, log(/) denotes the log distance to the beginning of the span. Feature Set 4-7: The final feature sets take into account the sequence of capitaliza- tion surrounding the hypothesis boundary character. The sequence is built in the follow- ing way. Any words to the left ofH starting withL 1 are represented by the characterc if that word is lowercase. It is represented by anA if the entire word is uppercase and a C if only the first character is uppercase. If the word does not start with an alphabetic character then the actual character is used. If there are no wordsL 1 or greater, then the 3 Splitta features 5 and 6 were not included because they dramatically hurt performance. 49 # Description # Description 1 Tree;log(.) 12 R 0 ;log(.) 2 Tree;log(/);log(.) 13 R 1 ;log(/) 3 cap(R 0 );log(.) 14 R 1 ;log(.) 4 cap(R 0 );log(/) 15 L 0 ;log(/) 5 cap(R 0 );log(/);log(.) 16 L 0 ;log(.) 6 cap(R 1 );log(.) 17 L 1 ;log(/) 7 cap(R 1 );log(/) 18 L 1 ;log(.) 8 cap(R 1 );log(/);log(.) 19 R 2 =w i+2 9 H;R 1 20 R 1 ;R 2 10 L 1 ;W 21 L 1 ;W;R 1 11 R 0 ;log(/) Table 3.3: Sentence delimiting Feature Set 3 sequence begins with the charactere. After handling the words prior to the hypothesis the prefixL 0 , the hypothesisH, and the postfixR 0 are appended to the sequence with the same character rules as before, with the following two exceptions. If the hypoth- esis character is the first character in a span, then an N is appended to the sequence; and if the hypothesis is immediately preceded by a space, then ans is appended to the sequence. The same rules apply for the right contextR 1 and above. The only difference between feature sets 4-7 is the amount of context that is used. Feature Set 4 looks at 2 words to the left and right (i.e. toL 2 andR 2 ). Each additional feature set increases the context by 1 word. 3.6 Chapter Results The performance of each feature set was assessed using several classification algorithms on several standard evaluation criteria in order to determine the optimal feature set and classifier combination. The evaluation criteria for all the experiments are Precision, 50 Training (CV) Testing Algorithm Feature Set P R F 1 P R F 1 F 0:5 P 6 0.950 0.924 0.936 0.933 0.851 0.890 0.915 0.804 SVM 4 5 0.909 0.894 0.902 0.907 0.907 0.907 0.907 CW 5 0.946 0.940 0.943 0.920 0.897 0.908 0.915 0.832 P F1 5 0.950 0.951 0.950 0.901 0.913 0.907 0.903 0.826 P F0:5 5 0.977 0.899 0.936 0.962 0.795 0.871 0.923 0.778 mxTerm 0.928 0.735 0.821 0.882 Splitta 0.910 0.721 0.805 0.865 Table 3.4: The best performing feature sets for each classification algorithm. Recall, F-Score (where F 0:5 indicates weighting the precision twice as much in the har- monic mean) and Cohen’s between the computer generated hypotheses and the gold standard. Although the use of is relatively non-standard it provides a useful measure of comparison to human levels of performance when a inter-rater agreement study has been performed. Although no inter-rater agreement study was performed for this work, it is assumed to be quite high (> 0.9), since more difficult tasks such as part-of-speech tagging, are reported to have these levels of agreement. The first set of experiments applied the Perceptron (P) algorithm [111]. This is a simple classifier that uses a linear decision boundary to separate positive and negative examples. During training, each example is considered one at a time. If an example is misclassified by the current linear decision boundary then an update to the model is performed such that the new boundary correctly classifies this example. Each feature set was evaluated by performing a 10-fold cross-validation on the training data as well as learning a model from the training data and applying that model to the testing data. Feature Set 1 has the worst performance on the test set with a precision of 0.926, a recall of 0.439 and and F 1 -score of 0.745. There is a large jump in performance by including the features introduced in Feature Set 2 (P: 0.913, R: 0.798, F 1 : 0.851). 51 One of the most popular linear classifiers used in natural language processing is Support Vector Machines [15] (SVM). Unlike the standard Perceptron that simply finds any decision boundary that separates the data, SVM will find a decision boundary such that the margin between the two closest positive and negative examples (on opposite sides of the boundary) is maximized. In theory this leads to less over-fitting and better performance on unseen examples. In real-world data, one cannot guarantee the linear separability of the training examples so soft-margins are used instead. Soft margins introduce slack variables and a regularization parameterC is used to adjust the trade-off between training errors and the maximum margin constraint. Unfortunately, it is difficult to know this parameter a priori and is usually obtained empirically. Although it would be preferable to determine this coefficient through cross-validation, the training time needed for each pass was prohibitively long given the available computational resources. So instead, a simple line search was performed by training a model for each value ofC in the range of 2 21 to 2 21 using the training data and then testing on the development set. The model whose value ofC performed the best on the development set was then applied to the test set 4 . For these experiments we used the SVM-light implementation [64]. Dredze et al. [37] offer another modification called a Confidence-Weighted linear classifier (CW) that keeps track of the uncertainty associated with each parameter and uses this information to adjust how much the model is updated on each misclassification. This classifier is also much better with fewer features and has achieved an F-score above 0.9 by the 3 rd feature set. We see an improvement over the standard Perceptron in both absolute performance as well as in the stability of the results between the training cross- validation and test data. This method also appears to do slightly better than SVMs, but 4 The optimal value ofC on the development set was 2 3 52 0.75 0.8 0.85 0.9 0 10 20 30 40 50 60 70 80 90 100 Avg. F-score on Test Data % Training Data Default Perceptron Learning Curves Set 1 Set 2 Set 3 Set 4 (a) Perceptron learning curves 0.75 0.8 0.85 0.9 0 10 20 30 40 50 60 70 80 90 100 Avg. F-score on Test Data % Training Data Default Confidence-Weighted Learning Curves Set 1 Set 2 Set 3 Set 4 (b) Confidence-Weighted learning curves Figure 3.3: Learning Curves has the advantage of not having any parameters to tune and is extremely efficient to train. Finally, one other approach was investigated. Part of the reason the Perceptron per- formed poorly, in comparison to the other algorithms, is because there are few guar- antees about the state of the decision boundary at the end of training. As mentioned above, the algorithm will terminate with any boundary that separates the data even if it is precariously close to many examples of one class or another. When the data is not linearly separable there are even fewer guarantees. While it is not possible to train a Perceptron to find the maximal margin between the training examples, as in SVMs, it is a simple modification to impose a margin, which could lead to improved performance if chosen correctly. Similarly, it is also easy to modify the algorithm to support margins of different widths for each class, which Li et al. [69] show can significantly improve performance when the data is heavily biased toward one class or the other. Our imple- mentation of the Perceptron (with uneven margins) also has two other parameters that need to be set: a learning rate and the initial biasb. 53 Similar to the regularization parameter C in the soft-margin Support Vector Machines, these parameters are usually determined empirically. Since the algorithm is so efficient to train (at least 2 orders of magnitude faster than SVM), a relatively thor- ough search can be performed. To find the parameters, a hill climbing-like grid search was performed. For each parameter an upper and lower bound was specified. Then for all combinations of parameters that included the upper bound, lower bound and mid point, a 10-fold cross-validation was performed using the training data. After all com- binations were tried, the topn were kept. For each of these parameter sets new intervals were created by using the current parameter as the mid point and taking half the previ- ous interval span as the new upper and lower bound. After each round,n was decreased and the process terminated when the change in performance was below some threshold or a maximum number of rounds was reached. Another advantage of this approach is that it allows us to search for parameters that maximize any criteria of our choosing. The other classifiers have already shown a large improvement in F 1 -score over two state-of-the-art approaches. However, the improve- ment is entirely due to an increase in recall and, generally, as the recall improves in each of these systems, the precision decreases. For many applications requiring sentence boundaries, precision is actually a more important evaluation criteria. So, in addition to learning an (uneven margin) Perceptron model (P F 1 ) whose parameters maximized the F 1 -score 5 , a model was learned (P F 0:5 ) that maximized the F 0:5 -score 6 , which weights precision twice as much. Although the recall of the resulting classifier drops consider- ably, it is still 6-7% greater than the baseline systems and the precision is nearly 3.5% higher than any of the other systems. 5 : 0.22,b: 196.88, -m: 128.13, +m: 106.25, i: 36 6 : 0.25,b: 193.75, -m: 131.25, +m: 43.75, i: 34 54 Table 3.4 compares the best results of each of the models. The left hand side of the table indicates the mean values of each criteria for each fold of the cross-validation 7 . The right hand side shows the performance on a trained model applied to the test data. Statistical significance was tested between mxTerminator and each of the models, except SVM, using the method described by Yeh [133]. Recall, F 1 and F 0:5 were significantly better (p < 1e 7 ) than mxTerminator for all of the models. There was no statistical significance in the precision of the standard perceptron (p< 0.39), while mxTerminator was significantly better than the Confidence-Weighted classifier (p< 0.038) and P F 1 (p < 2e 5 ). However, the precision of P F 0:5 was significantly better than mxTerminator (p < 1e 7 ). Figure 3.3 illustrates the learning curves between the standard (unoptimized) Per- ceptron and the Confidence-Weighted classifier. The shape of these graphs show there is a large improvement for including the first few feature sets, but that only a small gain is achieved after about the 3 rd . The curve is slightly steeper for the 4 th feature set, and given more training data, it might be adventageous to include these features. Although the performance using all the training data is similar between the two classifiers, these graphs highlight the instability of the Perceptron and leave doubts about how much data is actually needed to have a adequate performance. 3.7 Chapter Conclusions Sentence boundary detection is a critical first step in nearly all complex natural lan- guage processing applications and especially the story generation architecture proposed in this thesis. The vast majority of textual content is now produced and consumed on 7 Except for SVM, where the left hand side indicates the result on the development set 55 the web and an ever larger share is being generated by ordinary people on their personal web pages and blogs. This chapter showed that existing SBD technologies, primarily developed for well edited text genres, suffer significant performance degradation when applied to web data. This loss of performance greatly inhibits the potential success of the proposed interactive story generation model by impairing the quality of features used in Chapters 4 and 6 and also reducing the quality of natural language presented to the user, which can break the immersive experience of the user. The approach described in this section is specifically targeted toward weblog text in order to maximize the performance on the target genre of text used for the main storytelling application. Consequently, all of the criticisms of supervised classification approaches, such as performance penalties when changing genres, apply. However, the results show that these problems are not particularly dire for sentence boundary detection. The learning curves in section 3.6 show that only a very small amount of training data is needed to achieve high levels of accuracy. Also, since weblogs are simply web pages 8 , the approach developed here should work nearly as well on more general web data. For optimal use on general web pages this SBD approach would work best in collaboration with a content detection algorithm to remove page headers, navigation elements, advertisements and other spurious elements. 8 Spinn3r.com uses a very liberal definition of weblog that includes advertisements, news articles and almost any other genre found on the web at large 56 Chapter 4 Corpus Creation 4.1 Chapter Introduction The backbone of this data-driven storytelling approach is, of course, the data itself. Therefore, its success ultimately hinges on the availability of a large scale corpus of narrative stories with sufficient coverage over a broad range of human activities, but also the depth to produce interesting variations for each new story generated. There are several possible sources for gathering stories, but one will end up suiting our needs better than the others. It is not surprising, because stories play such an important role in many aspects of our lives, from education to cultural histories, that many story collection efforts have been undertaken in the past. One of the most famous examples is the Federal Writer’s Project of the Works Project Administration [73], which solicited thousands of stories from hundreds of authors as part of a broad economic stimulus package. Other collection efforts have also focused on gathering stories from ordinary people’s experiences using a variety of techniques to educe the relevant narrative content. A skilled moderator, such as Ira Glass on This American Life, can prod, question and entice ordinary people into telling stories compelling enough that millions of people tune in to hear every week. An interesting academic example of this technique is the Story Listening Systems [22], which uses interaction with virtual characters to lure children into sharing their own stories in an effort to improve literacy. 57 The varying methods used in each of these story collection efforts allow for dif- ferent levels of control in shaping the content of the resulting corpora. Whereas the Federal Writer’s Project prescribed only very loose restrictions on the types of stories permitted, a directed story collection strategy, such as This American Life, can pick and choose the genre and story topics carefully, creating a highly focused and domain spe- cific assortment. Although these and other types of manual aggregation strategies offer several advantages in their ability to control the composition of the collection, they all share a common fatal flaw. The WPA is one of the largest manual assemblages ever created, consisting of thousands of stories commissioned as part of the New Deal and thousands more assembled from existing documents spanning from 1889-1942 1 . How- ever, this collection took hundreds of writers and historians over 4 years to produce and was heavily criticized at the time for the extraordinary cost of the program. Even pop- ular, profit making endeavors are generally only capable of producing a handful of new stories per week. The cost in terms of labor and time to manually create the sufficiently large and diverse collection of stories required for this thesis is simply not feasible. All of the strategies discussed so far are relics of the pre-Internet era, where the collection process essentially required being actively engaged in the solicitation of new content. Although millions of personal stories have probably existed for many decades among the millions of personal video and audio recordings captured by families across the world, as well as the daily journals some people keep, these resources might as well have not existed for the anthropologists, historians and other groups interested in the rich knowledge of human experience this data contained. The advent of the Internet completely changed the availability, access, and even perception, of what was once very private and inaccessible data. 1 http://memory.loc.gov/wpaintro/wpafwp.html 58 According to Matthew Gray [51] there were only 130 unique websites in June of 1993; however in only 7 years, the number exploded to well over 20 million 2 . Around this time, in the late 1990s, blogging was introduced to the world and descriptions of common daily activities by ordinary people started to appear in small numbers. In the years following, new tools were created allowing non-technical users to easily publish their content online [14], dramatically increasing the amount of user generated con- tent. The popularity of blogging has only increased since then, as indicated by Techno- rati’s [130] 2008 State of the Blogosphere Report. In it they disclose they have indexed more than 133 million weblogs since 2002 and nearly 1 million blogs are updated in a 24 hour period, highlighting the huge shift in the way content is published on the Web compared to a decade ago. Now, social networking sites, such as myspace.com, face- book.com and twitter.com have almost completely transformed the primary purpose of the Web into a medium for sharing our own personal experiences. Social networking sites are also a great new source for discovering the types of activities and experiences people are engaging in on a daily basis. However, much of this data is either written in an extremely abbreviated form, provided through meta-data, built-in functionality, or shared through pictures, video and other multi-media. On the other hand, weblogs, as the name conjures up, have been a more conventional online avenue for individuals to express their thoughts and experiences in a more complete and traditionally structured way. However, despite the suggestive name, weblogs are not synonymous with narrative journal entries nor even with personal content at all. From the beginning, weblogs have been categorized into two distinct types [14]. Filter-style blogs act as portals to existing content, where users may comment and post opinions; for example, blogs that link to news items and discuss the issues raised by these articles. 2 netcraft.com [94] 59 The other type is journal-style blogs, which are entries for users to keep a log of their activities online, much like they do in a personal diary. These types of blogs contain countless stories ranging from the courageous battles of cancer survivors to mundane life events, such as a visiting sibling. The combination of an active, user-driven community, along with a format resem- bling a traditional genre of composition, make journal-style weblogs the perfect source for mining personal narratives. However, even though journal-style weblogs provide an abundant data source for personal content, simply aggregating them is insufficient to produce a high quality corpus of personal narratives. While these two types of blogs may be distinct and well defined, they are generally lumped together on the Web and there is no readily available way to distinguish one from the other. To compound the problem, only a fraction of personal weblog entries are composed of narrative discourse. Many personal entries are philosophies on life, political opinions, rants about friends and a wide variety of other non-narrative content. The remainder of this chapter will describe the methods used to find and extract only the personal narrative blog entries that are hiding in plain site among the vast sea of web pages. 4.2 Previous Automated Story Collection The automated story collection methodology described in this chapter is based on research conducted by Gordon & Ganesan [49]. This work began by trying to iden- tify story passages from conversational speech recorded by the authors. A corpus of stories was created by transcribing the audio streams of five recorded sessions by hand and manually annotating textual passages, as either Story or Not Story. Each story was annotated by a unique pair of raters (out of five pairs) using the following definition: 60 The definition of a story is somewhat ambiguous. Generally, the stories that people tell are about events that have happened in the past. Accordingly, people use a lot of past tense verbs (e.g. said, went, gave) when telling sto- ries. However, not all descriptions of events that happened in the past count as stories. Stories give descriptions of specific events that actually occurred, not generalizations over multiple events or times. Stories generally have a sequential structure to them, providing a description of events that happened one after another. Collectively, these events are composed to create a com- plete narrative. Finally, stories usually have some point to them: the reason that the person is telling the story in the first place. Sometimes stories are truly pointless, though, but some message is usually still conveyed. To assess the level of agreement between the raters and establish the viability of automat- ing this task, Cohen’s statistic was computed between each pair of raters 3 . The average reported score over the five raters was 0.68 with a standard deviation of 0.18. To automatically find the stories within these transcripts, Gordon & Ganesan treated the problem as a binary classification task. Using a portion of the transcribed speech data, they trained a Na¨ ıve Bayes classifier over a sliding window of text using unigram and bigram word features. A mean-average filter was then run over the resulting prob- abilities to help smooth the fragmentation that this binary classification approach intro- duces. Although this approach did not work well when applied to transcripts generated by an off the shelf automatic speech recognition system 4 , the results on hand transcribed data were much more favorable (0.530 precision, 0.629 recall, 0.575 F1-score). In addition to their attempt on conversational speech data they also applied their tech- nique to Web data. Also choosing weblogs, they collected 400,000 unique URLs using the application programming interface of Technorati.com, a popular weblog search engine, by submitting thousands of queries formed of keywords taken from an exist- ing broad-coverage knowledge-base of human activities [47]. 150 of these URLs were 3 Each line of the printed transcript was used as the basic unit of judgment. 4 IBM ViaV oice for Windows Release 10 61 chosen at random and annotated using the same guidelines stated above. A new classi- fier was trained on this data and then applied to each of the 400,000 unique URLs every day over the course of 393 days. A compilation of 4.5 million stories were identified totaling 1.06 billion words. More recently, another attempt was made to extract stories using sentences as a basic unit of classification instead of arbitrary spans of text [48]. The main differences lie in the classification algorithms used and in the types of features considered. In the previous approach, the sliding window technique effectively limits the system to considering only very shallow features, such as bag-of-words. Splitting the documents into sentences beforehand allows for a more natural boundary for annotation (as opposed to a line in a transcript) and facilitates more linguistically motivated features to be used. As an additional benefit, the stories identified in this manner always begin and end on a complete sentence (at least when the sentence boundary detection works), even when a false negative is encountered. 4.3 Story Collection The previous worked showed the possibility of automatically mining narrative stories from the Web. Unfortunately, the approaches described in the previous section suffer from several flaws that limit the ability to use the resulting corpus for this work. In this section I will describe the steps taken to create a better suited collection of stories and how these steps address some of the problems of the previous approaches. 62 4.3.1 Corpus Selection & Annotation The method used in section 4.2 that identified the 400,000 candidate URLs introduces a selection bias. First, since only weblogs ranking highly against certain predetermined keywords were included in the set, it is possible that certain types of stories are over represented in the data, while others are not existent at all. It would be preferable if all weblogs were considered so that at least every genre of story is represented by the frequency in which people tell them. As a beneficial side effect, this would also give us an accurate estimate of the amount of personal narrative that is written by people in comparison with other types of commentary. Second, the URLs were identified only on the basis of a small sample of content. There is only limited evidence that the content on these pages will continue to be good candidates in the future, as was assumed by the previous research. Crawling the Web is a non-trivial exercise that requires significant bandwidth and hardware resources. Ping servers, such as the one provided by welogs.com, help reduce some of these requirements by providing a service with which weblog authors can sub- scribe 5 . Once an author has subscribed, the ping server will then be alerted anytime a blog has been updated or a new entry has been posted. Every so often, the ping server will publicly post a list of all the weblogs that have been updated since the last list was issued. These lists can be used by search engines or other interested parties to more efficiently crawl only the new content, without having to sweep the entire web on every pass. Even though ping servers simplify the crawling procedure, mining this set of sites is still a significant challenge. Despite the greatly reduced set of pages, over 30,000 5 This is usually done behind the scenes automatically by whatever blogging software the author chooses to use. 63 updates typically occur in a 5 minute period, which still requires a considerable amount of engineering and resources to deal with. Fortunately, this is a relatively common problem for small research groups and other organizations interested in this data, which has spawned a new community of inter- mediate service providers that compile this data and supply it in a more convenient and easy to manage package. One of these companies, Spinn3r.com, released 44 mil- lion weblog entries, which they are willing to take full responsibility for redistributing, spanning a two month period as part of the 2009 International Conference on Weblogs and Social Media. In addition to simply crawling the sites and providing the raw con- tent, Spinn3r.com also includes meta-data associated with each post provided by web feed information, such as Atom Syndication and RSS. This includes things like the name of the author, the time and date the post was published and any category tags the author used to describe the post. Along with the syndicated meta-data, Spinn3r.com also includes information from several custom applications that attempt to identify spam, infer the language and extract the main content of the post. Although the Spinn3r.com dataset represents a smaller timeslice than the original corpus developed by Gordon & Ganesan, two months as opposed to a year, the advan- tages of being an unbiased, comprehensive set that has already been preprocessed in useful ways motivated the decision to adopt this dataset as the basis of the new story corpus. For the time being only the static, publicly released dataset is being used, how- ever Spinn3r.com offers an academic license to their data stream and a continuously updated corpus could be incorporated in the future. The original intention of the previous story identification techniques, to narrowly extract only the narrative content of a body of text, is compelling and would be highly 64 valuable if the accuracy of such a system was considerably greater than what can reason- ably be achieved. Even state-of-the-art structured learning algorithms that optimize the set of classifications over the entire sequence of data points are not likely to achieve ade- quate performance that would eliminate the fragmentation problems caused by excessive false negatives simply because the inter-rater agreement is not high enough to learn a precise classifier. Instead, this work uses an entire weblog entry as the basic unit of classification, taking advantage of their short, but complete, nature. Typically weblogs are updated on the order of days, limiting the entry to a single meaningful event in the author’s life. Even though some fraction of weblogs contain discourse on multiple dis- jointed topics, the advantages of keeping entire stories intact, as well as being able to use advanced features over larger discourse units, makes up for any small amount of noise introduced by including irrelevant content. A new gold standard training corpus was developed using a small randomly chosen subset of 5270 English language weblog entries. Each of these entries was annotated using the previous definition specified in section 4.2 along with the slightly more tech- nical characterization amended: Discourse that describes a specific series of causally related events in the past, spanning a period of time of minutes, hours, or days, where the story- teller or a close associate is among the participants. In order to gauge inter-rater agreement, the subset of weblog entries was hand labeled by two separate annotators in an iterative process until all the examples were given a cate- gorization both annotators agreed upon. A new study was performed for several reasons. First, the original study was performed on transcripts from conversational interviews. It seems reasonable to assume identifying stories found on the Web, that is not presumed to have any narrative content at all, is a more difficult task. Second, agreement between classifications at the document level is hypothesized to be greater at the document level 65 rather than on arbitrary spans of text. Third, the evaluation of the agreement study is used to bootstrap a better gold standard corpus. The first round of annotation began with two annotators independently labeling each entry using the updated definition. In case an entry elaborated on multiple subjects, it was only labeled a positive example if more than 50% of the content adhered to the definition. To assess the level of agreement between the two annotators Cohen’s was computed, which was again found to be 0.68. Although this indicates relatively high agreement, and is in line with the previous experiments, only 203 of the entries were classified as stories and we disagreed upon 177 entries. After completing the first round of annotation the two annotators discussed, in gen- eral terms without looking at or referencing actual blog entries, some of the problematic issues that were encountered. In light of this discussion each of the blog entries that were disagreed upon were reshuffled and classified again by both annotators. Once complete, was recomputed, which improved to 0.867. The number of entries classified as stories increased to 229 and only 66 disagreements remained. To resolve the final differences each of the remaining 66 entries were openly dis- cussed between the annotators until a final judgment was settled upon. Of these entries, there were three prominent reasons for disagreement. The first reason arises due to an ambiguity in the causal structure. Pure lists of events chronicling a person’s experiences, which occur quite frequently, do not satisfy the given definition of a story because they lack the necessary causal structure. However, there is a continuous gradation between a simple chronology and a full narration, which leads to unpredictable classification between annotators. The second source of disagreement arose from the authorial intent of the blog author. Typically the positive story examples that were initially agreed upon were authored specifically for the purpose of telling a story. However, many of the 66 conflicting examples were a mixture of two distinct intents. For example, a cooking recipe might be described by telling the story of the first time the author prepared the dish. Another common example of this type of disagreement is technical help posts, for example a post on an automotive repair site telling a story of how the author’s car broke down. Although the purpose of these posts is to share (or ask for) knowledge and not explicitly to tell a story, it was decided to treat them as positive examples none-the-less. The third source of disagreement was based on the time scale of the story. A few entries closely followed our definition, but occurred over several years or even the entire life of a person. While not strictly adhering to our definition, a judgment was made on a case by case basis. At the end of this process the gold standard corpus contained 4985 entries 6 of which 267 were annotated as stories. 4.3.2 Data Preparation & Features Investigated Given the newly created corpus, this section describes the data preparation necessary for extracting the features used in the new story identification system. Discourse theories, such as Rhetorical Structure Theory [75], have been proposed as a good method for analyzing different genres of text and it could be argued that the differences in discourse elements are what define the characteristics of each genre. Although some automated tools for rich discourse processing are available, the perfor- mance of these systems on open domain text is inadequate for obtaining any useful information. However, various levels of approximation have been developed that may provide useful features. 6 The annotators disregarded any entry from the initial set of 5270 if it was no longer available on the Web. 67 Several of the features investigated in this section use syntactic parse trees labeled with dependency relations. In order to obtain this information, the sentence bound- ary detection system that was described in Chapter 3 was first applied to each weblog document. The resulting sentences were then submitted to an off the shelf shift-reduce dependency parser [113] that is among the fastest and most accurate available. Using the shallow information readily available as well as syntactic and some discourse structure, a range of features were used to model each document. Lexical & Miscellaneous Features: At the most basic level, several lexical features were investigated. Although very little structural information is provided, the distribu- tion of words and phrases in a document can be very revealing of its meaning or genre. The most typical means of capturing this information is using a bag-of-words distribu- tion orn-grams, which generalize the bag-of-words idea to multi-word phrases. Stan- dard preprocessing steps, such as lowercasing and removing stop words were applied. To more specifically target the types of words likely to be used in narrative text, according to our definition, four sets of keyword features were hand-crafted. The first keyword feature counted the relative frequency of first-person pronouns in the docu- ment. The second feature counted the relative frequency of past-tense verbs. The third counted the relative frequency of temporal words, such as such as today, yesterday, week, month, day, last, next. The final keyword set contained a list of the top 10 most frequent (non-stop) words found in the stories of the development set and again the rel- ative frequency in the document was used as the feature’s value. As a more general approach, the intuition in the previous paragraph was implemented using an n-gram approach applied to the part-of-speech tags, automatically assigned by the parser, to each sentence. 68 Finally, a bag-of-words approach that takes advantage of the HTML formating was investigated. The reasoning behind this type of feature is that narratives generally con- tain fewer words in <BLOCKQUOTE> and other special formatting elements than other genres of text, such as filter-style weblogs that often refer to other content. The features for this approach were simply encoded by appending the parent HTML tag to each word in the document. However, the feature values were encoded in two different ways and each method was included as a separate feature. The first approach treated each pair exactly as if it were a normal bi-gram. For example, the HTML snippet <P>I went to the beach yesterday</P> would be encoded as the following bag of words: P/i:0.25, P/went:0.25, P/beach:0.25, P/yesterday:0.25 7 . The second approach modeled the feature value as the probability of a word in the document given its parent’s HTML tag. Syntactic & Shallow Discourse: In addition to the simple lexical features, several other feature sets were developed that tried to leverage implicit structural elements of the document. The first feature set, Dep. Rel uses the syntactic structure discovered by the dependency parser to model the distribution of dependency triples in the document. For example, the relation John:SUBJ:ate, signifying that John is the subject of the verb ate. In effect these act like syntactic tri-grams that can capture longer range dependencies between words than typicaln-grams. They can also be seen as capturing a small amount of semantic knowledge through their partial predicate argument representation. Features based on entity-grids [8] were used to try to model syntactic relationships that occur between sentences of a document, unlike dependency relations, which only model relationships within a sentence. An entity-grid is a two dimensional matrix in 7 The true distribution would of course be over the entire document and not a single sentence or HTML segment 69 which the columns correspond to the words of the document and the rows correspond to each sentence. A cell in the table is populated by the syntactic dependency relation of the given word in the corresponding sentence (or null if it is not present). In the previous example, John would have the relation SUBJ in the table for the appropriate sentence number. Features are extracted by counting the adjacent column entries, in a similar fashion ton-grams, and using the final distribution as the feature set. For a more complete example of an entity grid refer to Figure 2.2. An entity-grid is a data structure that enables generating features that model the dis- tribution of how dependency relations for particular words change throughout the docu- ment. Although this has been proven to be an extremely useful feature for automatically evaluating the coherence of a document, it is unclear that these features are relevant to discriminating texts of different genres. A more appropriate distribution is tracking how other properties of the words change within a specific dependency relation across sen- tences in the document. In particular, it is hypothesized that the distribution of how the plurality of an individual dependency relation changes throughout the document will be a more discerning feature for this task than entity-grid based features. For example, the subject of a narrative is likely to be relatively constant throughout the story, meaning the number (e.g. singular) will not change often. However, this is likely to be more variable in a news story and extremely sporadic in a list of events. Figure 4.1 illustrates a simplified partial example of a number-grid and how the features are generated from the data-structure. The plurality of a word was only considered for nouns and pronouns and was deter- mined using a set of deterministic rules. When the number of a pronoun could unam- biguously be determined, it was assigned its value using a look-up table. For ambiguous pronouns, such as you, it was marked as plural or singular (PL/SING). Wh-pronouns, 70 SBJ OBJ 1 SING - 2 SING PLUR 3 PL/SING UNK (a) Number-grid SBJ:S S SBJ:S P/S OBJ:- P OBJ:P U 0.25 0.25 0.25 0.25 (b) Feature vector representation Figure 4.1: A partial number-grid for the document: I drove to the beach this morning. I found $20 in the sand. You never know what you’ll find. such as who, were labeled as unknown (UNK). All other nouns were labeled using the part-of-speech tag as a guide, for example NN signifies a singular noun, whereas NNS signifies a plural noun. Topic Modeling: The final set of feature types explored use Latent Dirichlet Allo- cation [13] to model the distribution of topics between document genres. LDA is a hier- archical Bayesian network that models the joint probability of the observed sequence of words in a document and a set of corresponding latent topics defined in the following way: p(w;z;j;) =p(j) N Y n=1 p(z n j)p(w n jz n ;) In this equationw represents the observed word,z is the latent word topic and is a multinomial distribution from which the word topics are drawn. and are parameters of the Dirichlet priors used to smooth the document and word distributions respectively. Any approximate graph inference algorithm can be used to estimate the probability dis- tributions specified by the model 8 . For the experiments in section 4.3.3, three separate LDA models were learned, each of which was assigned 100 latent topics. One model was trained on all of the stories in the training corpus (SLDA), one on all of the non- stories in the training corpus (NLDA) and one model on all of the documents in the training corpus (ALDA). 8 Gibbs sampling was used in this work using the GibbsLDA++ implementation. 71 The probability distributions that are learned by LDA can be used in many differ- ent ways to generate features that might be useful in classifying text. The first method of generating features from the LDA models was simply to apply the ALDA model to each document in order to obtain the maximum likelihood topic assignment for each word. Once these topics were found ,the tri-gram distribution over these topics was used as the feature representation. The second method applied all three LDA models to each document and used each word/model pair as the base feature. The feature’s value was specified as the probabilityp(wjz) from the corresponding model. Similarly, the third method used each topic/document pair as the base feature and used the probability p(zj) as the feature’s value. The fourth method computed the probability of the docu- ment using SLDA and NLDA and added several features based on these values, such as the raw probability and difference between the two. 4.3.3 Chapter Experiments & Results Much like the previous story identification work, the approach taken here was to treat the problem as a binary classification task. To assess the quality of the different feature sets, the gold standard corpus described in section 4.3.2 was split into a development, training, and test set. The development set consisted of 250 weblog entries, the training set consisted of 3985, and the test set contained 750. All engineering and debugging was performed on the training and development setsF until all the feature sets were completely worked out. Support Vector Machines and Maximum Entropy are typically the most common classifiers used in natural language processing applications. However, recently several online algorithms have been proposed that rival the performance of these popular mod- els, but are extremely efficient to train and apply. A set of experiments was run to 72 compare the performance of several different classification algorithms on the feature sets described in the previous section. Similar to the experiments run in section 3.6, the Perceptron [111], Perceptron with uneven margins [69] and Confidence-Weighted linear classifiers [37] were used. Unsurprisingly, the standard Perceptron had the lowest performance on nearly every feature set. As is common with linear classifiers, as the number of features tends to increase so does the performance of the classifier. However, even the best perform- ing feature set (i.e. 201, which includes all lexical and semantic features except those induced from the LDA topic models) only achieves a precision, recall and F 1 -score of 0.467, 0.300 and 0.365 respectively. A full summary of the results obtained using the Perceptron algorithm can be seen in table 4.1 As mentioned in section 3.6, there are several simple modifications to the Percep- tron algorithm that can dramatically improve its performance. In these experiments a learning rate parameter, initial bias and a separate margin for positive and negative examples were implemented. The same type of hill-climbing grid search as described in section 3.6 was performed to find the parameters that maximized the 10-fold cross- validated F 1 -score on the training data. The results of several of the more promising feature sets are presented in table 4.2. Many of the results are now better than before and the best performing feature set (i.e. 22, a combination of all the features) is able to reach a precision, recall and F 1 -score of 0.536, 0.429 and 0.476. Overall, the Confidence-Weighted linear classifier performed the best. For each fea- ture set, it was among the highest performing system and was also able to achieve the highest accuracy amongst all the classifiers (i.e. feature set 20), reaching a precision, recall and F 1 -score of 0.591, 0.414, 0.487. A complete summary of the results of this classifier can be seen in table 4.3. It is disappointing, but not surprising that tri-grams 73 Training (CV) Testing Features P R F 1 P R F 1 Lexical 1-Gram 1 0.161 0.757 0.222 0.080 0.124 0.829 0.216 0.064 2-Gram 2 0.117 0.789 0.203 0.057 0.121 0.814 0.210 0.057 3-Gram 3 0.107 0.819 0.188 0.036 0.115 0.829 0.202 0.045 POS 1-Gram 4 0.212 0.740 0.313 0.187 0.283 0.243 0.262 0.192 2-Gram 5 0.238 0.735 0.335 0.218 0.225 0.486 0.308 0.206 3-Gram 6 0.233 0.760 0.330 0.209 0.186 0.586 0.282 0.163 4-Gram 7 0.241 0.762 0.334 0.215 0.182 0.700 0.289 0.165 Syntactic Dep. Rel(B) 8 0.159 0.740 0.260 0.135 0.161 0.771 0.267 0.138 Entity Grids 9 0.098 0.802 0.174 0.015 0.000 0.000 0.000 0.000 Number Grids 10 0.140 0.577 0.217 0.079 0.375 0.043 0.077 0.059 Misc. Hand Crafted 11 0.090 0.900 0.164 0.000 0.063 0.014 0.023 -0.012 HTML 12 0.227 0.790 0.251 0.108 1.000 0.129 0.228 0.211 LDA 3-Gram 13 0.170 0.860 0.249 0.099 0.474 0.129 0.202 0.169 P(wjz) 14 0.485 0.532 0.428 0.359 0.226 0.443 0.300 0.200 P(zj) 15 0.607 0.060 0.108 0.094 0.111 0.757 0.193 0.036 P(w;z;) 16 0.820 0.106 0.148 0.134 0.144 0.300 0.194 0.078 Combos 3,6 17 0.218 0.785 0.322 0.196 0.156 0.757 0.259 0.124 3,12 18 0.225 0.812 0.249 0.104 0.900 0.129 0.225 0.206 3,6,11,12 19 0.227 0.836 0.259 0.114 0.786 0.157 0.262 0.238 8,9,10,19 20 0.167 0.924 0.236 0.074 0.467 0.300 0.365 0.315 13,14,15,16 21 0.238 0.818 0.227 0.084 0.500 0.286 0.364 0.317 20,21 22 0.304 0.753 0.341 0.232 0.281 0.414 0.335 0.252 Table 4.1: Performance using the standard Perceptron with the default parameters. perform so well, since they are a strong baseline in many natural language processing tasks. It is a little surprising that entity-grid and number-grid features are such poor indi- cators of genre, although number-grids are slightly better, which was the expectation. It is encouraging that at least one of the topic modeling approaches seemed to produce rel- atively discriminative features, although it is somewhat perplexing that a combination of these features actually seems to hurt performance. Despite the failure of these features 74 Training (CV) Testing Features P R F 1 P R F 1 3 3 0.38 0.66 0.46 0.149 0.730 0.248 0.110 3,6 17 0.25 0.75 0.34 0.130 0.129 0.130 0.041 3,12 18 0.37 0.45 0.38 0.273 0.386 0.320 0.236 3,6,11,12 19 0.48 0.60 0.50 0.453 0.343 0.390 0.337 8,9,10,19 20 0.56 0.52 0.53 0.463 0.271 0.342 0.294 13,14,15,16 21 0.433 0.524 0.470 0.257 0.414 0.317 0.228 20,21 22 0.642 0.512 0.564 0.536 0.429 0.476 0.429 (a) Performance using the standard Perceptron with parameters optimized for F 1 -score. Features Bias -Margin +Margin Iters 3 3 1.28 196.88 0.00 148.44 132 3,6 17 0.01 193.75 0.00 146.88 71 3,12 18 0.98 198.44 4.69 129.69 102 3,6,11,12 19 0.69 199.61 3.52 3.52 115 8,9,10,19 20 1.37 193.75 25.00 137.50 145 13,14,15,16 21 0.25 187.50 0.00 143.75 50 20,21 22 1.37 187.50 25.00 62.50 50 (b) Parameters for each model that achieved the highest average F 1 -score over a 10-fold cross-validation on the training set. Table 4.2: (a) Performance of the standard Perceptron after finding parameters that max- imize the F 1 -score on the training data (10-fold cross-validation) and test data. (b) The associated parameters for each model. to obtain higher performing results it is probably not wise to discount this approach for genre classification in general, because the dataset (about 4000 documents total and only 267 stories) is far too small to learn a reliable LDA model. 4.4 Chapter Summary These results suggest that using a Confidence-Weighted linear classifier is the most appropriate choice for building the automatic story identification system used to pop- ulate the story corpus. However, it is less clear which set of features will result in the highest quality corpus. The single most predictive feature is lexical tri-grams, but the 75 Training (CV) Testing Features P R F 1 P R F 1 Lexical 1-Gram 1 0.382 0.523 0.423 0.352 0.417 0.429 0.423 0.362 2-Gram 2 0.550 0.527 0.525 0.474 0.547 0.414 0.472 0.425 3-Gram 3 0.590 0.533 0.543 0.495 0.628 0.386 0.478 0.438 POS 1-Gram 4 0.222 0.684 0.288 0.164 0.124 0.743 0.212 0.062 2-Gram 5 0.339 0.481 0.383 0.303 0.179 0.657 0.281 0.158 3-Gram 6 0.371 0.578 0.417 0.340 0.265 0.514 0.350 0.258 4-Gram 7 0.362 0.622 0.415 0.334 0.319 0.529 0.398 0.319 Syntactic Dep. Rel 8 0.600 0.343 0.432 0.387 0.543 0.271 0.362 0.320 Entity Grids 9 0.238 0.271 0.242 0.154 0.185 0.071 0.103 0.054 Number Grids 10 0.129 0.588 0.209 0.072 0.118 0.414 0.184 0.044 Misc. Hand Crafted 11 0.254 0.225 0.231 0.164 0.053 0.029 0.037 -0.031 HTML 12 0.464 0.410 0.432 0.373 0.453 0.343 0.390 0.337 LDA 3-Gram 13 0.369 0.463 0.396 0.324 0.280 0.200 0.233 0.169 P(wjz) 14 0.414 0.353 0.368 0.306 0.173 0.443 0.249 0.133 P(zj) 15 0.282 0.623 0.361 0.261 0.333 0.400 0.364 0.291 P(w;z;) 16 0.920 0.138 0.202 0.183 0.145 0.300 0.195 0.079 Combos 3,6 17 0.444 0.607 0.488 0.423 0.391 0.486 0.433 0.368 3,12 18 0.527 0.439 0.475 0.423 0.490 0.343 0.403 0.354 3,6,11,12 19 0.543 0.497 0.515 0.464 0.510 0.386 0.440 0.340 8,9,10,19 20 0.565 0.515 0.533 0.484 0.591 0.414 0.487 0.445 13,14,15,16 21 0.431 0.442 0.428 0.364 0.316 0.171 0.222 0.168 20,21 22 0.612 0.512 0.554 0.509 0.333 0.200 0.250 0.194 Table 4.3: Performance using the Confidence-Weighted linear classifier with the default parameters. combination of all lexical and syntactic features was able to achieve a slightly higher F 1 -score because of the better balance between precision and recall. Although the pre- cision of feature set 20 was slightly lower, this was offset by an increase in the recall, which is more important for our purposes. Figure 4.2 plots learning curves for a few of the better performing features to help choose between the best feature sets. The Marquardt-Levenberg nonlinear least-squares 76 algorithm 9 was also used to fit each set of data points to the curve alog(x) +b and is plotted along with the raw data. The curves are quite steep and indicate that more training data would be beneficial in all cases. The experiments in section 4.3.3 used 1000 training examples for development and testing, which provides an additional 25% of training data for the final system to learn from. Unfortunately, these curves do not provide a definitive answer as to which feature set should be used. While feature set 20 achieves the best performance on all of the training data, feature set 3 has a slightly steeper curve suggesting that it may perform better once the additional training examples are included. However, the slope of feature set 20 is less steep, at least in part, because it performed better on less training data, which seems to unfairly put it at a disadvantage. The variance is also quite high for all of the feature sets, especially feature set 3, adding only more uncertainty to the curve’s prediction. Although the argument could be made for choosing feature set 3, ultimately feature set 20 was used. This decision was made due to the higher recall value, the reduced variance as the data set increases and because the syntactic parses generated in this process will also be useful in Chapters 5, 6 and 7. Before committing the final results of the extraction procedure to the database, sev- eral additional post-processing steps were necessary. Although a minimal amount of clean up was performed in section 3.4.1, these procedures do not address several issues related to the quality of the corpus from a usability standpoint. Since a sentence is the fundamental unit of the generation procedure, it is important that each sentence com- mitted to the database be as complete as possible, even if this introduces errors when reconstructing an entire story. 9 Using the gnuplot implementation 77 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 10 20 30 40 50 60 70 80 90 100 110 120 130 F-score on Test Data % Training Data Confidence-Weighted Learning Curves (Default Parameters) Set 3 Set 18 Set 20 Best Fit Set 3 Best Fit Set 18 Best Fit Set 20 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 10 20 30 40 50 60 70 80 90 100 110 120 130 F-score on Test Data % Training Data Confidence-Weighted Learning Curves (Default Parameters) const, t Figure 4.2: Confidence-Weighted linear classifier learning curve. Extended to show the projected performance if the training and development data is used for training. Multiple sentence quotations (and parentheticals) are a good example of this prob- lem. The sentence delimiter will, hopefully correctly, split these segments into their appropriate parts. However, this leaves the first and last sentence with dangling quota- tion marks, which when displayed by themselves is disconcerting to the user. To address this problem, quotations were closed when they could be determined using a set of deter- ministic rules and left open when the could not 10 . A similar set of rules were used to close brackets and also handle simple cases when more than one type of bracket was left open. 10 In the future I believe it would be better to simply remove dangling quotations if the rules failed to close them. 78 A significant amount of noise is included in the raw stories in the form of excessively repeated characters. These typically come in two forms: those used for visual formating and those encountered by intentional misspellings used for linguistic flair. The first type are usually single non-word characters such as ‘-’ or ‘*’ being used to delineate content on a page. However, there is no limit to the creativity of people in finding ways to introduce strange repetitive character sequences into their blogs. The other kind involves words like ’cool’ being spelled ‘coooooool’, usually for dramatic effect. Although these issues are rather different, they were handled in a similar manner. First, any sentence that did not contain at least one purely alphabetic word was completely removed. Then a sequence of regular expressions was used to shorten repeating character sequences and then repeating word sequences 11 . A character sequence, other than a ‘.’, repeated more than twice was replaced by the same sequence repeated only twice. Similarly, periods were handled as a special case to accommodate ellipses. One other regular expression was used to deal with repeating word sequences such as: j click herej click herej click herej As a relic of both sentence boundary detection errors and human induced mistakes, several other miscellaneous issues were also handled. For one reason or another, many of the extracted sentences did not start with a capitalized letter. Regardless of the original intent, all sentences were modified to begin with a capitalized word. Similarly, the use of all lowercase words has become commonplace in informal English writing, and the easy case of uppercasing ‘I’ was taken into account. Analogous to first word capitalization problems, not all sentences in the corpus end correctly with a punctuation mark. This case was slightly more complicated to deal with, because not all sentences should end, 11 The matching expressions used are not strictly regular, but most modern regex libraries support some form of memory and lookahead that allow these patterns to be used. 79 absolutely, with a punctuation mark. For example, sentences that end with a quotation might legitimately have the punctuation mark within the quotes causing the sentence to end in a ‘”’ character. A few simple rules were authored to look for special cases and add a period to the end of the sentence if necessary. Using a model trained on all of the annotated data available, the story identifica- tion process and preprocessing steps described above were applied to the 25 million English language weblogs present in the Spinn3r.com corpus. 6.03% of the entries were extracted, resulting in 1,605,480 identified stories with an average of length of 26.24 sentences. Of the 1.6 million stories, 80,142 were held out for development purposes and as independent training data for various types models that will be used in Chapter 6. After excluding these stories, 1,525,338 stories were included in the final dataset used by the system. 80 Chapter 5 Retrieval The story corpus described in Chapter 4 implicitly contains a wealth of information about human psychology and sociology. As humans, we are easily able to read these stories and learn about the goals people find important and the types of activities they commonly engage in. Even beyond our own experiences, these narratives also tell us what events constitute a particular activity (or goal), the consequences of these events, what order they occur, and what happens when something goes wrong. Given this vast amount of collective knowledge and the ease with which we tap into it, it is easy to forget how difficult it is to represent natural language formally in a machine readable format suitable for any type of inference. Representing the concepts and relations implicit in natural language is difficult to symbolize formally. Even when possible, formal reasoning with the huge range of pred- icates and relations involved would be intractable without major simplifying assump- tions. However, Say Anything does not require the full inferential power of a formal logic such as predicate calculus. Instead, it must only be able to understand enough about the previous chain of events in order to make a prediction about what will happen next. While this is still a daunting task, the primary inferential mechanism can essen- tially be reduced to forward chaining over temporal and causal relationships, which also greatly simplifies the transformation from natural language to a more suitable machine readable format. 81 An ideal characterization of these causal and temporal relations would be discrimi- native enough to operate at the level of event structure. In fact, various discourse theories have been used to annotate several corpora, such as the Rhetorical Structure Theory Dis- course Treebank [21], the Penn Discourse Treebank [87] and TimeBank [106], that try to capture these relationships at this level of detail. Unfortunately, all of these corpora suf- fer some severe limitations that prevent their use for this task. First, despite the success of many complicated automated linguistic tasks, such as syntactic parsing, the perfor- mance of automated systems on recognizing discourse structure is greatly reduced. This is at least partly due to lower agreement between annotators, but also because the depen- dencies between relations often occur between a much greater distance. Second, only a limited amount of annotated data is available and is virtually all news genre text. During a preliminary study, we found that a discourse parser trained on an existing news genre corpus suffered a dramatic loss in accuracy when applied to weblog data. Given the already low performance on the in-domain data, the results on the out of domain web data were not acceptable for this work. Instead of a fine-grained representation of event structure, two simplifying assump- tions are made to coarsely model the necessary relationships. First, it is assumed that narratives are told in the temporal order in which the events occur. Although this seems to be true on average, it is not of much concern when stories violate this assumption, since the actual goal of the system is to predict what will happen next in the story and not what would happen next in reality. The other assumption is that events can be ade- quately represented by an entire sentence. While this is again an inaccurate depiction of the world, it adequately models the information in a format best suited to this applica- tion. 82 Given these two simplifying assumptions, reasoning and prediction are performed in a case-based manner, rather than a deductive first principles approach. If we can find a story segment that closely matches the sequence of events in the user’s story we will know what to do, not from any “deep” understanding, but because we have seen it before. The primary concern for this methodology is formally defining what constitutes “closely matches” and how we can find these story segments in an efficient way. The remainder of this chapter describes a complete first step in answering both of these questions. 5.1 Information Retrieval Finding information in large collections of data is a well studied area in Computer Sci- ence. Many efficient algorithms have been developed in the Information Retrieval (IR) community that solve large scale document retrieval problems in fractions of a second. Manning, Raghavan & Sh¨ utze [76] provide an overview of some of the fundamental algorithms that are widely used and many toolkits exist that implement these and other more sophisticated techniques. The Apache Lucene toolkit [50] was initially investi- gated for this purpose; however, the implementation provided by Terrier [99] was found to be far more efficient. Retrieval is accomplished by building an index over the docu- ment collection, in this case sentences, and then using a scoring function to rank a subset of these documents that match a given search query. An index is the key data structure that enables fast lookup of a document that matches a query of keywords or phrases. There are many variants of this data structure that provide additional features and flexibility for querying the indexed documents; however, for maximal efficiency, only a very basic variant was used in this work. The index is essentially created by passing over all the documents in a corpus and collecting crucial 83 bits of information about each individual word. This information is used to build a table mapping each unique word in the corpus to the set of documents that mention it. To prevent recalculation each time, the total number of documents is stored separately in the index. Additionally, the number of times a word is seen in a document and the number of documents containing a particular word are also stored with each entry. The entries in the index are then sorted alphabetically allowing the index to be stored on disk in a format efficient to search. Many scoring models are available in Terrier and the default PL2 scoring function was used for its generally good early precision performance 1 . PL2 is a divergence from randomness method, which weights a term based on the inverse probability that it will occur in a document given some model of randomness. The exact formula for computing the score is derived from the base model of randomness, Poisson for PL2, and several normalization factors. The first normalization tries to reduce the influence of rare words by including a balancing risk factor that penalizes according to the information gain of the term. In addition to smoothing out the weight of rare words, term frequencies are also renormalized to a standard document length to help provide consistent scores across different queries. The exact formula for computing term weights for a document is given below: 1 tfn + 1 tfn log 2 tfn + (tfn) log 2 e + log 2 (2tfn) 2 (5.1) tfn is the normalized term frequency, according to the standard document length (sl) and original document length (dl), defined by the formulatfn =tflog(1+c sl dl ) 2 . is 1 Early precision is any number of precision measures that consider the topk documents, wherek is small. 2 c is computed automatically by the method given in He & Ounis [58]. 84 the expected frequency of the term given by its total count in the corpus divided by the total number of documents in the collection. A complete document is simply computed using the simple formula: score(d;Q) = X t2Q w(t;d)qtw (5.2) Whered is the proposed document andQ is the set of query terms. w(t;d) represents equation 5.1 and qtw is the relative frequency of the term in the query between itself and the most frequent term. 5.2 A First Generation Model The information retrieval techniques, along with the database schema, provide the mechanics for the first story generation model. The pseudo-code for this generation method is given in algorithm 1. The key steps of the algorithm are generating a set of keywords from the input sentence, finding the most similar sentences and returning a set of corresponding next sentences. Although the sentence could be used directly as a query to the Terrier index, the analyze method performs some useful preprocessing to improve the quality of the query as well as reduce the number of irrelevantly matched documents. Before anything else, the same post-processing steps that were applied in the story creation phase, described in section 4.4, were applied to the input sentence. Next, the sentence was tokenized on word boundaries using the dependency parse to determine the appropriate places. A set of keywords were used to remove functional and common words that are likely to contribute very little to the score and would match a large percentage of the documents in the corpus. In addition to removing noise from the weighting function, this also 85 Algorithm 1: Simple IR based generation algorithm Input:UserSentence The most recently submitted user sentence Output:Candidates The set of candidate next sentences for the user’s story begin Candidates fg KeyWords analyze(UserSentence) IndexEntries searchIndex(KeyWords) whileIndexEntry2IndexEntriesandlen(Candidates)< 10 do WeblogSentence findInDatabase(IndexEntry) ifhasNextSentence(WeblogSentence) then NextSentence getNextSentence(WeblogSentence) append(NextSentence;Candidates) end end ifisEmpty(Candidates) then return error else returnCandidates end improves the response time because many fewer documents match the query. For the opposite reason, proper nouns, as determined by the part-of-speech tagger, were also removed because they are likely to receive a high weight for being rare, but often it is not for any meaningful reason. This decision is not without its drawbacks, however. In many other cases, such as celebrities, movie titles and other pop culture references, the name is what is important and the removal of such words is detrimental to the performance of the query. The remaining tokens are lowercased and used as keywords to the Terrier retrieval component, which returns a ranked list of entries indicating the location of the cor- responding sentence in the database. Using the information obtained from the index entries, each of these sentences is looked up in the database. A check is made to ensure that the retrieved item is not the last sentence in a document (and hence no following 86 sentences). 3 If the check succeeds, then this next sentence is added to the list of candi- dates. Once all of the similar sentences are exhausted, or a maximum of 10 candidates are found, the results of the process are returned to the user. If for some reason no can- didates could be found, an error is reported to the user and they are free to try again. A more detailed description of the underlying architecture and database schema that enable this algorithm for those interested in recreating this type of system is given in appendix 8.3. 5.3 Chapter Evaluation Evaluating how well a particular story generation model performs is a difficult task, because it is reliant on a large-scale user study. It requires both a broad sample of stories to be written by numerous unique individuals, and a semi-independent group of users to provide independent ratings of these same stories. Swanson & Gordon [126] and later Swanson & Gordon [127] showed the viability of a similar story generation approach using unigram and bigram queries 4 . At the beginning of each story one of the competing generation models was randomly chosen behind the scenes and used for the duration of the story. As a baseline comparison, a generation model that took random sentences from the database regardless of the user’s input was also included. The results of our experiments showed an improvement over the baseline, as well as bigrams over unigrams; however, the size of the study casts some doubt on the reliability of the conclusions. Although it was one of the larger academic studies of any interactive 3 This step could have been eliminated by omitting the final sentence of each story from the database. However, the index was kept as general as possible to allow future applications to use the same index for different purposes. 4 The algorithm in all of the approaches are identical, however the data, data clean-up and index imple- mentation all differed in the previous work. 87 narrative system, our evaluation relied on only a handful of volunteers composed of stu- dents, staff and faculty of the University of Southern California’s Institute for Creative Technology. As a consequence the results were based on only 101 stories split between three models and judged by 22 users who provided only 96 independent ratings. The evaluation in this chapter is a two phase process closely following the previous work of Swanson & Gordon. The first phase involves collecting a large sample of stories written with one of the generation models and the author’s own appraisal of each story he or she writes. The main screen of the interface presents the story authoring workspace, shown in figure 5.1a. At the top of the page is an initially empty panel containing the user’s story as it develops. Below the story panel is a text box for typing in the user’s sentence. Next to the text box are controls for submitting the sentence and ending the story. After the user has submitted their story to the system, a modal dialog box appears and takes focus of the screen (figure 5.1b. The up-to-date story is presented at the top and a choice of 10 sentences, ranked by the system from best to worst, are given at the bottom. While the dialog remains open the user may change their selection as many times as they would like, however, once they close the dialog, using a confirmation button, the sentence is permanently added to the story. The user is presented with several choices for a couple of reasons. As a practical matter, none of the generation models discussed here and in the following chapters are likely to produce a single candidate that fits the story with sufficient consistency. As a result it would often be difficult for users to continue their stories in a meaningful way and the process would be more frustrating than it’s worth. However, even though the system cannot always (or even usually) return the single best sentence to the user, it can often find at least one sentence out of 10 that is acceptable. Giving the users this choice 88 (a) Writing interface (b) Selection modal (c) Rating interface Figure 5.1: User interface for the writing component. greatly reduces the chance of total failure and enables the users to continue their stories for a longer period of time. While deficiencies of the generation models certainly play a role in finding a single best continuation, it is also a matter of people’s personal interests, which are as varied as their experiences. Even if the generation model is able to find a single sentence that reasonably advances the story, there is no guarantee that a particular user will find the sentence compelling enough to use it. However, there are often several choices within the set of 10 alternatives that are equally coherent, even if none of them are ranked the 89 highest, yet provide differences in content that may appeal to different users. For exam- ple, a common sentence users began their story with was: Once upon a time. This par- ticular sentence is well represented in the collection of weblog stories and consequently nearly all 10 of the computer responses are highly appropriate 5 . With the generation model described in chapter 6, at least four distinct types of sentences were presented. One type continued the fantastical fairytale theme, exemplified by the candidate: There was a fairy that lived in the North. Another type maintained the fairytale archetype, but discarded the mythical motif, as in: A recent graduate from SJSU School of Educa- tion, Masters in hand, I sought employment. A third type of sentence used the fairytale archetype, but was ambiguous as to the mythical nature of the world, illustrated in this platitude: While I’m walking towards happily every after. Finally, there was a fourth sentence type that barely, if at all, followed the standard fairytale template, such as this returned sentence: My dating history (what little there is of it, since I met DH when I was 19) is filled with musicians. Each of these possibilities offers a unique direction to take the story and it is up to the user whether they want to pursue a tried and true path, or be adventurous and go with something a little more creative. So, these choices not only smooth over problems of poor generation, but also provide a beneficial game-play element that helps give the feel of a Choose Your Own Adventure book pioneered by Edward Packard [101]. Once user has completed the minimum number of turns writing and selecting sen- tences they are asked to assess the quality of their story on several commonsense criteria. There are several areas where the system typically has trouble, which these questions try to measure. One of the primary goals from a natural language processing perspective is to model and generate coherent text. There are many ways in which the system can 5 Although some generation models do better than others 90 fail to do so, but the two most common problematic cases are pronoun, name and verb agreement (syntactic/structural problems) and temporal/causal inconsistencies (seman- tic problems). From an interactive entertainment perspective it is also essential to deter- mine how much the user enjoys the process of creating his or her narrative. Finally, from a user interface point of view, it is important to measure the usability of the sys- tem. With these issues in mind, the users were asked to rate the following questions on a scale from 1 (bad) to 5 (good): 1. Does the story make sense? (Coherence) 2. Is the story believable? (Believability) 3. Did you have fun writing the story? (Entertainment) 4. How easy was the story to write? (Usability) The first question attempts to get the user’s overall impression of how coherent the story is. The second question is similar, but is trying to bias away from structural coherence problems and focus more on the semantic issues the system may introduce. The third question directly asks the user whether he or she enjoyed the experience or not. The fourth question is geared toward gauging the usability of the system. The actual interface provided to the users is shown in figure 5.1c Although other metrics evaluating the narrative qualities of the story, such as charac- ter and plot development, would also be interesting, there are several reasons these are not appropriate at this time. One reason is that the expected length of an average story will be too short to categorize segments of the discourse into meaningful formal narra- tive constructs. Additionally, there is no requirement that the user has any significant creative writing experience and so their ability to form theoretically sound narratives is unlikely to be consistent. Finally, it is already very difficult for the computer to respond coherently and in a topical manner. It would be exceedingly difficult to try to solve all 91 the problems at once and including too many criteria could muddy the interpretation of the more fundamental results. After the first phase is completed, a second group of raters evaluates the stories independently on a similar set of criteria given below: 1. Does the story make sense? 2. Are the events in this story plausible, given the setting the writer established? 3. Is this story entertaining? In addition to a slight modification in the evaluation criteria from the previous studies, this phase also introduces a new baseline that attempts to provide a different kind of controlled comparison. In addition to the stories written with the system in the first phase, a handful of human authored weblog stories are also randomly included in the evaluation set for the second phase 6 . Although the human authored stories should pro- vide an upper bound on the level of coherence for our hybrid stories, it should not limit our expectations in terms of how entertaining the stories are to read. The remainder of this section will describe Amazon’s Mechanical Turk and how it was used to obtain more than an order of magnitude more stories and more than 100 times more independent user ratings. 5.3.1 Crowd Sourcing With Mechanical Turk There are many tasks that are trivially easy for a human to complete, but are still too difficult to automate reliably, for example identifying arbitrary objects in an image or transcribing conversational audio recordings. Traditionally obtaining this data was done by an individual, a small group of volunteers or a temporary workforce. Unfortunately, 6 The human authored blog stories are taken from the held out set of automatically identified story corpus described in section 4.4. 92 there is usually too much work for a small group of volunteers to gather a sufficient amount of data. On the other hand, there is usually not enough work to justify the high cost of hiring a temporary work force and so many times these projects are abandoned altogether. Amazon created the Mechanical Turk website to solve this problem as “a market- place for work that requires human intelligence”. It is a centralized cyber-location where requesters post work to be done (a Human Intelligence Task) and workers com- plete the jobs for a small monetary reward. Requesters are free to offer whatever they believe is a fair price for the task and workers can choose to work on HITs that interest them. Upon a worker submission, a requester can accept or reject the work done by the worker based on criteria of their own choosing. The acceptance rate of each worker is tracked as part of their profile and used as a quality measure in a similar way positive and negative feedback is used on auction sites, such as eBay. For mediating this exchange, Amazon charges a 10% commission to the requester, or $0.005 if the reward is less than 5 cents. HITs can be created by requesters using several different methods provided by Ama- zon. The web services API and command line interface provide the greatest control and the most options, however, the web based interface, is the simplest method for creating a basic HIT and is sufficient for the purposes of this evaluation. Setting up a HIT using the web based interface is a three step process. In the first stage several of the basic properties of the job are defined, such as the monetary reward for completing the task. A title, description and keywords must also be specified to help give the worker an idea of what the HIT is about and to allow them to search for HITs they might be interested in completing. There are several other properties that can be defined to help restrict which users can work on the task. For example, qualifications 93 can be set to restrict the HIT to workers whose acceptance rate is higher than a specified threshold. A few other basic qualifications can also be specified using the web based form, such as the geographic location of the worker and how many unique workers can complete the same HIT 7 . The presentation of the HIT is designed in the second phase of the creation process. Designing a HIT is accomplished using using a subset of HTML including the basic for- matting tags and form elements to collect the user’s input. Amazon also supplements the basic HTML with a simple syntax for introducing variables. These variables allow the layout to be used as a template for many different individual HIT instances. If graphics, video or other more complex designs are necessary, then the other APIs provide some protocols for accessing data from external servers. It is also possible to simply direct the worker to an external site or in some cases have them perform a task in the real world. The final stage consists of uploading any required data that is used to fill the vari- ables defined in the previous step. The data format is a simple comma-separated values file that has a single header row whose columns match the variable names and each subsequent row corresponds to single HIT instantiated by the given values. 5.3.2 Story Authoring HIT To gather a large sample of stories, a HIT was created using the web based interface and posted for workers to complete. The design of the HIT was very simple, including a brief set of instructions at the top and a link directing the worker to the external site where the story writing interface was hosted. Although the process of setting up a HIT described in section 5.3.1 is extremely simple and straightforward, several precautions must still be taken to avoid some potential hazards. The primary concern is preventing 7 Custom qualifications can be defined using either of the other more advanced APIs. 94 spam and other deceitful actions workers undertake to complete the task without actually fulfilling the requirements in good faith. Despite Amazon’s approval rating scheme, it is very easy to be overrun by unscrupulous users that flood the results with invalid data. A few additional elements were included in the simple design in order to reduce the chance of being inundated by bots and spammers. Since the monetary reward is generally relatively low and the number of identical HITs is also limited, very simple protection schemes can be effective, because it is not worth the effort to crack even the simplest of tactics. In this case, two 8 digit random alphanumeric strings were associated with each individual HIT. One number was given to the user up front as a login code to gain access to the writing interface. The other was returned to the user when they completed writing their story and were instructed to copy it onto the form before submitting the HIT. To prevent users from writing extremely short (or empty) stories, an eight sentence (4 turn) minimum requirement was imposed before the user could end their story (and get the success code). Requiring the use of these codes adds just enough manual effort that the user will probably not accept the HIT unless they really want to do the task, but is easy enough not to be overly burdensome for genuine workers. The codes also provide an easy way to automatically verify which story a worker has written or if they have written one at all. Figure 5.2 is a screenshot of the actual HIT design. A few small batches of HITs that limited the number of stories to 10 each were published to the Mechanical Turk website. These batches were used for a small pilot study to find an appropriate reward and set of qualifications. To start a reward of $0.20 each, a 98% approval rating restriction was tried. These initial HITs were completed so quickly that the monetary reward was eventually reduced to $0.12 and the approval 95 Figure 5.2: Story authoring HIT design rating filter was dropped to 96% without any noticeable decrease in quality 8 . With these specifications in place, larger batches consisting of 100 HITs each were posted. Although all of the HITs could have been uploaded in one batch, it was discovered that running several smaller batches in parallel increased the throughput without having to increase the monetary reward. In general, three batches were posted in parallel and new batches were manually uploaded as previous ones completed. In total 13 batches were completed. It would have been possible using the Mechanical Turk command line API to write a script that automatically accepted or rejected the submitted HITs based on whether 8 In hindsight the monetary reward could have probably been cut in half or more. 96 the login code matched the success code on the submission form. However, automat- ically determining whether that story was written in good faith is extremely difficult. Legitimate users also take their acceptance rating very seriously and can get justifiably upset when they are rejected unfairly, which can easily happen with an automated script. Fortunately, the web-based interface made it easy to manually detect the conspicuous frauds, because the result code was either obviously incorrect or not present at all. To ensure that real stories were actually being written, a large sample of the stories were physically read. For the most part, nearly all of the submitted stories seemed to be gen- uinely attempted and a large number of invalid stories are not expected in the dataset. Unfortunately, there were a few cases of abuse and the access codes were cross-checked to identify the user. These users were then banned from working on any additional HITs and their stories removed from the collection. In less that one week, 601 stories written with the unigram index and 567 stories written with the bigram index were collected. Although 1,300 HITs were published to Mechanical Turk only 1,168 were usable as evaluation data. Some of the 132 stories that are missing from the total were rejected for failing the requirements (e.g. clearly spam or not done in good faith). However, due to a browser compatibility bug, a small percentage of the user logins did not properly record the access code. Even though these users could successfully write a story with the system, the author’s ratings could not be traced back to their story. Without this information it was not possible to use this data in the first phase of the evaluation and were also omitted from the second phase for consistency. 97 5.3.3 Story Rating HIT The previous section describes the method used for acquiring the large sample of stories needed for the first phase of the evaluation. This section describes the analogous process for acquiring a set of ratings that are independent of the story’s author. As before, a Mechanical Turk HIT was created to gather a sufficient number of ratings. The layout of this HIT is again extremely simple. The instructions and candidate story were located near the top of the page and the required questions were located below the story towards the bottom. Completing the design to prevent untrustworthy users and data is more difficult in this case, however. Determining whether a particular rating is a true reflection of a person who has actually read the story is extremely difficult. What is rated a 1 for one person may legitimately be a 5 to another. Although there will always remain some doubts about the validity of any particular rating, a few strategies are described below that help minimize the influence of dishonest workers and spurious ratings. One way to minimize the amount of illegitimate data is to maximize the quality of the workers performing the task. The easiest, but relatively ineffective, way to do this is using Amazon’s approval rating filter, which was set to 97% for this HIT. Another fairly simple way to discourage dishonest workers is the use of a protection mechanism, like the access codes in the previous section. However, designing an effective scheme for this type of HIT is also much more difficult, because the task is so easy (and lower paid), any significant amount of extra work would likely ward off a substantial number of legitimate workers. As a simple mechanism, an objective question, that is easily generated automatically, was also posed to the worker in addition to the 3 subjective judgment questions. The form of this question was simply, “What is the n th word of the story?”. The correct answer, along with 5 additional random words from the story, 98 were presented to the worker in a random order. n was kept small so that counting the number of words was not excruciatingly tedious. Although this protection mechanism cannot prevent dishonest workers from completing the task “correctly” without actually reading the story, it does prevent them from clicking through as fast as they can without any way of knowing they at least read the directions. In this vein, the set of radio buttons included a default choice labeled Change me to prevent workers from simply clicking the submit button. Figure 5.3 presents a screenshot of the actual form given to the Mechanical Turk workers. Another way to reduce the impact of spurious ratings by unprincipled workers is through redundancy. It is simple, using the web-based HIT design wizard, to specify a maximum number of unique workers that can complete the same HIT, which can be used to gather multiple independent ratings per story. It is not clear what the optimum number is for any given task, but other research has shown that the quality of judgments improves, sometimes reaching the level of expert annotators, as more redundancy is added [20]. Even though some of the ratings may be arbitrary, averaging over all the ratings of a story will hopefully still reflect the intent of the genuine workers. For this analysis 8 workers were assigned to complete each HIT and were paid $0.01 for their opinion. Although the layout of the HIT and protection mechanism were designed to easily automate the process of accepting and rejecting worker submissions, several considera- tions discovered in a small initial experiment prompted the decision to manually oversee the process once again. Despite the seeming simplicity of the form, workers who were clearly performing the task genuinely 9 , would still fail to fill out all of the required items with a much greater frequency than expected. Given the relatively high number of 9 For example, leaving comments about a story with relatively high frequency, even though it was not a mandatory requirement, was a strong indication. 99 Figure 5.3: Story rating HIT design untrustworthy workers, it is important to have as many “known” good workers as possi- ble to complete the task. To keep these good workers, it is a good idea to accept, and pay for, their submissions even when they do not complete all of the specified requirements. Similarly it is important to block known bad workers. Although there are a couple of fairly obvious indicators of deceitful behavior, it is not trivial to implement an automated script that interprets these indicators with a high degree of accuracy. 100 One way of assessing the integrity of a user is by examining how long it takes them to complete an individual HIT. The task is not complicated and most of the stories are very short, so it should not take the worker much time to complete. However, even though a worker may form their opinions almost instantaneously, there is some minimal amount of time necessary to actually read the story. It is difficult to know a priori how long it takes a worker to read any given story, but there are a couple of ways an educated guess can be made. One way is to estimate from experience and going through the rating process personally several times. A more empirical way is to use the mean and standard deviation as a guide after a sufficient number of HITs have been completed by real workers. Another useful heuristic to consider is the number of errors made on a HIT and the rate at which a worker’s submissions contain errors; however, care must be taken to not be overly aggressive for the reasons mentioned in the previous paragraph. In future projects the heuristics described above could be used as the basis of an automatic approval script. However, legitimate users take their HIT approval rating very seriously and one should take extra care to avoid automatic rejection based on rigid thresholds. Instead, a better approach would be a tiered system that first issued paid warnings, followed by unpaid warnings and then ultimately blocking consistent abusers. This type of approach would rely on the redundancy to filter general noise, but preventing prolific spammers from spoiling large chunks of the data pool, while still allowing some flexibility for legitimate mistakes by honest workers. Although automating the process would be possible, the design of the HIT, along with the heuristics described above, allowed for easy visual verification of the data using Amazon’s web-based interface. Since several hundred submissions could be verified in only a few minutes, the effort to write a robust automatic filter did not seem justified. Manually reviewing the results allowed for more flexibility in the approval process and 101 gave valuable intuition about nefarious workers; however, this could also raise questions about potential bias in the data. In this case it is unlikely to have severely altered the results, because nearly 95% of the HITs were accepted and more than half of the rejec- tions came from only a few individual workers. In other cases, it may be necessary to implement a stricter policy (e.g., through an automated script) to avoid even the appear- ance of bias, however this should not be a major concern for this work considering the high acceptance rate. Once the HIT design was finalized, batches of 400 stories, each allowing 8 work- ers for a total of 3200 HITs, were published to Mechanical Turk. The first batch was published alone and a new batch was started each subsequent day until a maximum of three batches were running in parallel. In only 7 days, the total set of 15,384 HITs were completed with 14,404 being approved. 5.4 Chapter Experimental Results The previous section has described the user study that will be used to evaluate the two simple IR generation methods. It also explained how Mechanical Turk was used to crowd source a large number of workers to obtain more than an order of magnitude more data than in previous studies. This section will discuss the results of this new user study, examining both the author’s experience of writing a story and the outside reader’s perspective of the story artifacts. 5.4.1 The Author’s Own Assessment Table 5.1 shows the average scores given by the authors of their own stories, along with the standard deviation for each rating category. Similar to the previous work, the 102 bigram index outperforms the unigram model on all categories. The improvement is most pronounced for the Usability measure, however, everything other than Believability was markedly improved. The standard deviation is quite large among all the scores so a two-tailed t-test was performed to assess the statistical significance. The results were significant in all cases except Believability 10 . In addition to the required questions asked of the users, they were also given the opportunity to leave an open-ended comment if they wanted to. This space allowed users to describe any technical problems and also to leave any other remarks they had about the experience. Although most users left the text box empty, about 13% of the submissions did leave some form of comment. In general the remarks were very positive and helpful. At least 20% of the comments used the word fun to describe how much they enjoyed the game and many others expressed a similar sentiment in other ways. Not everyone shared these glowing reviews, however. A small percentage of users complained that it didn’t work at all and even that the idea itself was terrible. The other comments left by users were a mix of constructive criticism and actual technical problems encountered with the system. See appendix 8.3 for a complete tabulation of the comments left by the users. Although the workers were only asked to evaluate their story on four subjective questions, there are also a number of other objective criteria that can be collected in order to augment the evaluation. Table 5.2 provides several of these other statistics 11 . One measure of success is the length of the stories written using the model. Presumably, a model that returns better sentences will be easier and more fun to use, resulting in 10 p< 0:05 for Entertainment andp< 0:01 for Coherence and Usability. 11 None of the results are statistically significant (p> 0:05) between the two models. 103 Model # Stories Coherence Believability Usability Entertainment Unigram 601 3:46 1:11 3:53 1:16 3:08 1:19 3:99 1:05 Bigram 567 3:63 1:11 3:59 1:20 3:27 1:19 4:14 1:02 Table 5.1: Author rating results. longer stories on average. In accordance with the Usability scores, users did in fact write slightly longer sentences with the bigram model. Another measure that is useful to consider is how often the user picks the top ranked sentence. If the model is doing a good job, then the first sentence presented to them should be a good candidate for their story and no other choices would have to be consid- ered. On this measure, the bigram model also obtains a slightly higher percentage of top ranked sentences chosen, although the difference is extremely small and the standard deviation is as large as the mean. Similarly, the mean reciprocal rank (MRR) is a stan- dard metric for evaluating processes that produce a ranked set of candidates in response to a query. The number reported in table 5.2 is the mean reciprocal rank of the user’s selection over all the sentences in which the user did not choose the top ranked sentence. Interestingly, despite producing a higher percentage of top ranked sentences, the bigram model actually has a slightly lower MRR than the unigram model does. The final objective measure used to evaluate the story authoring process is to con- sider the average amount of time it took a worker to complete their story. The measure- ment times were obtained using statistics provided through the Mechanical Turk service, which tracks the amount of time between when a HIT is accepted by a user and the time it is submitted. The average total amount of time a user spent writing a story with the unigram model is about 7 minutes and 40 seconds, while it took them about 8 minutes and 12 seconds using the bigram model. 104 Model Max Len Avg Len % Top MRR Time(s) Time(s)/Sen Unigram 27 9:41 2:31 0:08 0:09 0:36 0:30 460:6 411:8 44:9 32:0 Bigram 25 9:50 2:51 0:09 0:10 0:34 0:29 492:4 463:7 47:9 35:6 Table 5.2: Story authoring statistics. There are several concerns with the reliability of this measure, however. First, we would expect the total amount of time to complete a story with the bigram model to be a little longer, simply because the users are writing slightly more sentences with this model. As an alternative, the average number of seconds to write a sentence is also considered. Unfortunately, this measure is also likely to be highly unreliable because of the extremely high variance in the data. The environment in which the workers are performing the task is completely unknown and it is assumed that the entire story writing task is not always performed in one continuous session. Workers could be doing the HITs at home, in a coffee shop or even at work. Regardless of where they do these tasks, it is likely that interruptions and distractions, such as telephone calls, prevent all of the workers from completing the task as fast as they can. To try to get a more accurate assessment of how long it actually takes a worker to complete the task without any interruptions, a simple filter was used to remove outliers. In particular, anything beyond one standard deviation was considered an outlier and removed from the dataset before calculating the time per sentence values reported in table 5.2. Although the difference is no longer as pronounced, it still took more time for people to write stories with the bigram model than the unigram model. 105 5.4.2 Independent Rater Assessment Table 5.3 12 presents the results of the independent user evaluations of the stories. As described in section 5.3.3, a HIT was created to gather ratings from 8 unique workers for each story. Unsurprisingly, the human written stories achieved the highest coherence. As expected, stories written with the bigram model were also judged to be more coherent than those with the unigram model. Although these results are not unexpected, it is somewhat remarkable how low the human written stories are rated. Part of the reason for this outcome is the methodology used to select the human authored stories. Since the stories picked for the evaluation set were determined randomly from the automatically extracted story set, some false positives were included. At least a few of these documents were advertisements and had very little discourse structure in general. In Chapter 8 another reason will also be explored. The believability scores are approximately the same as each model’s coherence rat- ing, perhaps suggesting these criteria are difficult for the workers to distinguish between. However, there is a small difference in the entertainment ratings between each of the models. Unfortunately, all of the models, including the human authored stories, were rated below 3 indicating that people have more fun writing these types of stories than they do reading them. There is however a slight increase in enjoyment when using the bigram model, but falls just short of the weblog stories. 12 human/unigram:p<< 0:001 for Coherence and Believability andp< 0:01 for Entertainment. human/bigram:p<< 0:001 for Coherence and Believability andp< 0:26 for Entertainment. unigram/bigram:p< 0:001 for Coherence and Believability andp< 0:02 for Entertainment. 106 Model Coherence (# Ratings) Believability (# Ratings) Entertainment (# Ratings) Human 3:65 1:24 (0878) 3:70 1:26 (0876) 2:96 1:26 (0876) Unigram 3:29 1:33 (3901) 3:30 1:32 (3901) 2:84 1:26 (3895) Bigram 3:42 1:27 (3505) 3:41 1:28 (3509) 2:91 1:23 (3506) Table 5.3: Independent story rating results. 5.4.3 Qualitative Analysis Although the subjective and objective quantitative analysis described above provides some insight into the relative capabilities of the two models, it does not tell us much about the actual stories created or about the problems that these models tend to suffer from. This section will take a more qualitative approach by examining several example stories written with these two models. Figure 5.4a presents an entire story written by a worker using the unigram model, and part of a weblog story used as the proxy in order to generate the shared sentence highlighted in light-gray. Although it is a little longer than average, it is fairly typical of a relatively successful story written using this model, and on average, the eight indepen- dent raters gave it a 3.50, 3.75 and 3.13 on coherence, believability and entertainment respectively. Reading through the story and taking a deeper look at the proxy story illustrates several facets of simple IR generation model. Focusing only on the user’s story for a moment, this example demonstrates that even with a weak model, relatively long stories can be written that maintain a moderate degree of coherence. The story only very weakly adheres to some of the richer elements of even a basic narrative theory, such as a climax or conflict resolution, however the user is still able to impose some basic structure on the narrative. The introduction begins with a generality, introducing the subject of the story, which contextualizes the specific narrative events that follow. It continues with some background information, such as when the events take place and that the main character is married, that sets the scene 107 User Story Weblog Story Untitled Even the best laid plans go awry. I knew when the moment arrived and it happened the day before we were flying. Snowstorms were for- cast throughout the entire midwest. I started the morning out by dropping my wife off to work. On the way homne, I got a flat tire. The irony is that I remember thinking that morning when I got in the car. ”Man I wish we had a 3day weekend coming up soon, I could use a day off to get some new tires and maintenance done on the car.” A four hour wait fro roadside assis- tance caused me to miss my lunch meeting. By the time I was done with my dental cleaning, the repair serviceman had already finished the work, and just needed me to sign the paperwork. When I tried to pay my dental bill, my credit card was declined. This was going from bad to worse. ... continued above ... All through this, I cried 2 more times. And again when we get to the body shop. Agnes, who owns the place, told my mom what might happen in regards to the insurance situation. It’s looking a bit grim. I don’t understand the ins and outs of insurance policies, but I’m pretty sure that because I’m not insured, the insurance company is probably not going to let my mom make any claims. And even if they do, it’ll prob- ably be less or something. I’m not sure. My mom hasn’t been able to get a hold of her agent, yet. The police report is going to take a few days, but I’m going to call the station tomorrow to see if they might have gotten around to it. The biggest problem is money. My mom has such a low income nowadays that I don’t think she can afford to rent a car and wait until the car is fixed. Now I had no way to pay for the repairs to my car or the dentist. Then she would have to pay for the repairs ($3000-$4000) AND the car rental. Where are we going to get this money? Where are we going to get this money? Plus our vacation starts tomorrow with no way to pay for expenses. My 2 weeks in tjeschie were extremely boring, because there was noth- ing to do, except visiting castles. I should have known as soon as I got that flat tire that things just weren’t going smoothly. (a) Introduction Rising Action Resolution Falling Action Premise: Plans don't always work out Character and scene setting Unexpected events disrupt the plans Trip is bad because of it I learned my lesson (reevaluate at first sign of trouble) (b) Figure 5.4: (a) The left column shows an entire user story. The right column looks at part of the weblog story used as a proxy for generating the highlighted sentence. The words highlighted in dark-gray show the overlap leading to the highest similarity score. (b) Simple narrative arc of the user’s story. 108 for the remainder of the story. Most of the remainder of the story describes the events constituting the plot of the narrative. The plot itself is primarily a sequence of unfortunate events that does very little to advance the story to a climactic moment. However, the user does something clever with the last sentence that enables a basic narrative arc to be constructed despite the previous lack of structural elements. Essentially, they provide a moral to the story that explains why all of the other discourse was necessary. It is a common device used in both the collection of weblog stories and in the user generated stories created during these experiments. In this case, it seems to work quite effectively by providing a clear, albeit mundane, resolution, as well as enabling the second to last sentence to function as a climax or falling action. A simple interpretation of the complete narrative arc is illustrated in figure 5.4b. Although the user’s story is successful on many levels, this example also illuminates several problematic issues and other areas of concern. One of the biggest shortcomings of the model is its tendency to wander off topic in a only a few turns. This not surprising, since only a single preceding sentence is used as a query and no constraints are given to ensure consistency within the user’s story. This is a symptom encountered by many other generative processes, particularly those making a Markovian assumption that the current state only depends on a limited history of previous states. In the user’s story, this is manifested by local consistencies that do not move the story forward and are sometimes contradictory with previously stated events. It is also interesting to focus on issues between the user’s story up to the sentence, Now I had no way to pay for the repairs to my car or the dentist, and the weblog story used as a proxy at this point. Despite relatively high overlap of “important” words in the corresponding sentences (i.e. pay, repairs and car), the two sentences actually mean 109 substantially different things. However, in spite of this difference, the next sentence is actually quite apt for the user’s story. Similarly, the meaning and narrative discourse of the user and weblog stories are almost entirely different, yet the next sentence of the weblog story is still highly appropriate. This suggests that the type of Markovian assumption made in the simple IR generation algorithm is fairly well justified, because certain words and events constrain the possible future outcomes so tightly that no other context is needed. However, it also hints that improving the model will be more complex than simply adding context to the query component. For comparison, a complete story written using the bigram model is given in fig- ure 5.5. In this example, each user sentence is presented with the corresponding weblog sentence to give a better feel for what types of keywords and phrases are contribut- ing the most to the retrieval process. Although the fundamental problems remain the same for the bigram model, the use of short two word phrases, as opposed to single keywords, seems to restrict the possible outcomes enough to improve the coherence slightly, as indicated by the independent raters assessment. Intuitively, this makes sense when considering the types of things that could happen after the phrase harm’s way, versus the keywords harm and way independently. For many more complete stories see appendix 8.3. 5.5 Chapter Summary This chapter introduced the first type of narrative generation model and a cost effective way of obtaining a large evaluation set using Amazon’s Mechanical Turk. The results presented in section 5.4 indicate that even the simple unigram baseline was sufficient for people to enjoy using the system. Using bigram keywords helped to both improve the use of the system and the quality of the stories written as judged by independent 110 User Story Matching Sentences All in a day’s work 1. The day began in the usual way . Work went the usual way. 2. I spun like a ballet dancer and everyone ate their order hot and thirsted not. 3. The diner was busy, the music was lively, and I was gleefully happy! The diner was at the edge of the square. 4. It was the best place in town. 5. And I was thrilled to be one of their danc- ing waitresses. I was thrilled to hear that. 6. Then. 7. Outside, a car veered wildly out of control and headed for the big front window. I fell forward, wildly out of control , arms splayed out. ”YAAUUUGGGHH!” 8. I exclaimed, in utter surprise, as my body lurched in mid-air. 9. I dove toward the crowded tables, trying to get people out of harm’s way . I push the bike over to the curb, out of harm’s way . 10. I am sweating but I’m cold. 11. The wnidow shatters as the car plunges into the diner. Suddenly, Rhett plunges into the water, swims about 4 feet, grabs the duck and swims back to the edge. 12. I was shocked! 13. Luckily, everyone was out of the way and the passengers in the car were merely stunned. Figure 5.5: An example story using the bigram model. The matching sentences from each weblog story are presented next to the corresponding user sentence and the over- lapping bigram phrases are highlighted. raters. Section 5.4.3 gave a few example stories and analyzed a few of the strengths and weaknesses of the simple retrieval based algorithm. In the next chapter an extension to this retrieval model will be introduced that attempts to address several of the issues discussed in the previous section. 111 Chapter 6 Reranking The sentence by sentence, turn based, interactive storytelling paradigm used in this the- sis has many advantageous qualities. It is a game that many people are already familiar with through various existing (live) versions of it, such as Fortunately or Unfortunately. It is both popular as a purely recreational activity, and for its use as a tool for collabora- tive group bonding exercises in organizational settings. This type of game has also been used in educational settings to help children improve their language skills and encourage them to write more often [124, 132]. The methods developed in this chapter will also show how an automated version of this game can enable research in two related areas. Similar to the cognitive studies that can be used to aid children’s writing abilities, a trace of the interactive process with Say Anything produces valuable data that can be used to model discourse coherence computationally and assess the quality of text generated by a machine. Second, as the coherence model reaches a sufficient proficiency it becomes possible to study and model user preferences. In this case we are not only interested in merely predicting if the story continuations are valid, but also that they are interesting to the users. For example, we would like to be able to populate the list of candidate sentences with a coherent set of sufficiently divergent possibilities, so that they have a true opportunity to explore the virtual world. The previous chapter presented a simple narrative generation algorithm that enables an end-to-end system that begins to partially satisfy these goals. It does not matter if the user’s story is simple or complex, commonplace or obscure, the system is virtually 112 guaranteed to find some way to continue it. This already represents a shift from most existing interactive narrative systems that have a very limited domain of knowledge and often fail, even within their targeted area of expertise. However, despite the relative success of the simple IR based method to find a sentence to continue the user’s story, it still often fails to find a relevant sentence that is coherent within the author’s fictional universe. This chapter will take a more in depth look at some of the difficult prob- lems hindering this simple approach and introduce a modification to the algorithm that significantly improves upon the retrieval-only approaches. 6.1 Story Analysis This section will take a more detailed look at places where the simple IR model breaks down and will investigate what is necessary to do a better job. Figure 6.1 is another partial story fragment written using the system. It is annotated with several link types to illustrate the depth of knowledge required for the task, and to also suggest several simple discourse attributes that are important to the ultimate success of generating coherent responses. Although the figure is a seemingly unintelligible jumble, the remainder of this section will untangle the spaghetti and show that the different link types are neither difficult to interpret nor particularly hard to automatically discover in many cases. At a basic level, a story is nothing more than a description of the state of the world, how these states change over time, and the effect these changes have on the characters involved. An event calculus that had sufficient breadth to cover all the possible states of the world and the axioms to describe how the world is transformed from one state to another would be an ideal computational representation of a story. Unfortunately, the world has infinitely many possible states and axioms, rendering a complete solu- tion virtually impossible. However, the number of actual activities, events and states 113 some and intended to drink one . He me I purchased Finally up on the to the . warming we treacharous trip store started It a in . cold blustery Tiny Town was day trouble his door that was shut . Patrick car froze was having opening It's on the . Winter land closed had As away still the laughter of the large crowd . we drove I hear could had Mike's Hard Lime grip Figure 6.1: Story analysis graph. people are likely to engage in are vastly more limited. Even in this reduced domain, it is unlikely a sufficient formal theory could be engineered by hand to adequately repre- sent all of the things people regularly do. The severe lack of breadth in this narrower, but extremely large, domain is one of the primary bottlenecks of most state-of-the-art storytelling systems today. Instead, hope rests on finding a much looser informal repre- sentation of events that can be automatically learned from raw text, but still has enough of the expressiveness a more formal theory would provide. The finely dotted lines in Figure 6.1 are used to loosely represent partial internal event structure (within a sentence) and the external relationships between consecutive events across sentences. Within a sentence, the links are used to indicate the sub- structure that encompasses the primary meaning of a specific event (i.e. the dependency structure of the head verb, which could be obtained automatically). For example, the 114 first sentence is an eventuality describing a property of the current state of the world, in particular, that of being a cold and blustery day. In a formal representation this might be expressed asbe(day;cold) &be(day;blustery). Fortunately, the dependency tree struc- ture used in the figure conveys nearly the same information, but is easier to acquire. The finely dotted links between sentences represent causal, temporal and other rela- tions between (possibly) separate events. For example, the link between the first and second sentence might specify that the events are contemporaneous, or that one elabo- rates on the other by providing more detailed information about the same state of affairs. Although long distance dependencies among the events of this story exist (and are not shown in the figure), the local dependencies are enough to show that many types of rela- tionships between events impose a structural order in which these events are narrated. For example, it only makes sense to be warm during the winter if you are somewhere heated, such as in a moving car with the heater turned on. Representing the events of a discourse in the way depicted in figure 6.1 suggests learning a theory of event structure could be done empirically and several recent papers have explored different approaches for doing so [11, 27, 77]. With a highly accurate dependency and discourse parser that provides both the sen- tence level and document level relationships, it would be possible to make accurate predictions about changes in the world simply by following the appropriate links in the graph. Although the tools and mechanics are available to extract this information auto- matically, there is a severe data sparsity issue. For example, the event/state of being a cold blustery day in Tiny Town has very probably never been described textually before (in any lexical permutation) on the Web, despite its enormous size. To maintain tractabil- ity, certain concessions must be made to limit the amount of information used, but comes at the cost of omitting details that are necessary for a correct interpretation. In the first 115 sentence, for example, it is probably not crucial to the state of the world that the cold and blustery day is in Tiny Town. However, in the second sentence it is important that Win- ter had closed its grip on the land, since it might otherwise be confused with gripping a physical object, such as the car door. An alternative to using the full dependency structure of the head verb to make pre- dictions is to simply use informative keywords, as illustrated by the dark solid lines in figure 6.1. This is essentially just a generalization of the simple IR based method described in the previous chapter. The reason this generation method works so well is because informative words in one sentence are often very indicative of the high value words in the following sentence. As can be seen in the figure, nearly every sentence has at least one word that is very predictive of another word in the following sentence. For example, cold is very predictive of Winter, Winter is very predictive of froze, car is very predictive of drove and so on. However, there are also much longer dependencies on key- words than a sentence or two, such as cold predicting warming (or more aptly warming being explained by it being cold), or the words treacherous and blustery. Even though properly identified keywords tend to be very predictive of each other, they can also lead to some very ill informed interpretations. Although this story is relatively unaffected by the following coincidental word association, a keyword based system could easily be fooled into relating the grip in sentence 2 with the door in sentence 3 (dash-dotted line). Although the previous paragraphs have emphasized the importance of event predic- tion in generating a topical continuation of the user’s story, it is not the only aspect of the discourse that determines the acceptability of a sentence presented to the user. Not only must the candidate sentence be of the appropriate topic, but it must also be struc- turally correct in terms of its linguistic discourse construction. For example, there can be cases where a sentence is about the right thing, but does not make sense because it 116 fails other discourse constraints. Two of the most common areas where the simple IR algorithm fails is in regards to coreference resolution and in verb agreement. Problems often arise when the main action of a predicted event is exactly what you would expect, but the agency of the characters involved is inconsistent with the previous aspects of the narration. Although the coreference between the entities in figure 6.1 can be fairly eas- ily interpreted, the thin solid lines hint at how much world knowledge and experience would be needed to correctly determine what noun phrases refer to the same physi- cal entity. Presumably, Patrick and the narrator I are distinct individuals, but are both included in the interpretation of we. However, it is potentially unclear, without sufficient world knowledge, that the we in sentence 5 does not actually refer to the large crowd in Sentence 4 (dash-dotted line). He in the last sentence could be introducing a new char- acter, but is likely to be referring to Patrick mentioned three sentences earlier. However, without further processing to identify Mike’s Hard Lime as a single inanimate object, it would also be difficult to rule out that He refers to a person named Mike. 6.2 An Enhanced Retrieval Model The previous section examined some of the crucial elements in accurately predicting the next event in a sequence and also some of the required elements needed to ensure the returned sentence preserves the linguistic integrity of the given narrative. Although the simple IR based model remains the core of the new algorithm, several modifications are needed in order to leverage the insights from the previous section. The remainder of this section will describe a new generation model and the features it uses to model the story in more detail. The keyword based information retrieval approach described in the previous chap- ter was not chosen for the reason that it is the most accurate retrieval mechanism for 117 this type of scenario. More advanced information retrieval methods would likely per- form better on (theoretical) precision and recall metrics. For example, many sentences have very similar meaning, but are expressed using a completely disjointed set of words. Retrieving sentences that only match particular lexical items will, then, potentially over- look a large number of relevant sentences. Several IR methods have been developed in an attempt to address these types of issues. Latent Semantic Indexing [35], for example, indexes a corpus based on a reduced set of latent topics independent, but derived from, the lexical words of individual documents. Another example, is ak-nearest neighbors approach [122] that allows documents to be indexed by arbitrary features instead. How- ever, despite the advantages of these types of approaches, they are much more difficult to scale to large data sets, making it difficult to meet the real-time requirements of the system. What is needed is an approach that has nearly the same retrieval latency as the simple keyword based IR approach, but allows a richer set of features to contribute to the final score (and rank) of a candidate sentence. The proposed solution investigated in this chapter is a two phase algorithm. The first phase is nearly equivalent to the simple IR based method from the previous chapter. In the second phase of the algorithm the candidates obtained in the first phase are reranked based on an arbitrary set of features. Algorithm 2 illustrates this method in pseudo-code. Algorithm 2 is nearly identical to the one introduced in Section 5.2 with only a few minor, but important, modifications. First, the entire user’s story must be given as an input. Although the retrieval component in Phase 1 still only uses the most recent sen- tence written by the user, the reranking component will use features derived from the entire user story. Similarly, when we find the initial candidates, it is also important to 118 Algorithm 2: Reranking based generation algorithm Input:UserStory The user’s complete story (including the most recent sentence) Output:Candidates The set of candidate next sentences for the user’s story begin// Phase 1 Candidates fg UserSentence getLastSentence(UserStory) KeyWords analyze(UserSentence) IndexEntries searchIndex(KeyWords) whileIndexEntry2IndexEntriesandlen(Candidates)<n do [WeblogStory;WeblogSentence] findInDatabase(IndexEntry) ifhasNextSentence(WeblogSentence) then NextSentence getNextSentence(WeblogSentence) append([WeblogStory;NextSentence];Candidates) end end ifisEmpty(Candidates) then return error end begin// Phase 2 Candidates rerank(Candidates;UserStory) returnCandidates end retrieve the corresponding weblog story that goes along with it. Much of the imple- mentation complexities are hidden, but conceptually Phase 2 is very simple. All it does is rerank the retrieved candidates based on a mixture of features from the candidate’s weblog story as well as the user’s story. Although we still return only a small number of sentences at the end of Phase 2, we may arbitrarily increase the number of sentences found in Phase 1. The only restriction on the number of sentences is the ability of the reranker to process them within the real-time constraints of the game. For the experi- ments in Sections 6.3 and 6.4, 40 sentences were retrieved in Phase 1. 119 6.2.1 Reranking The rerank method in Phase 2 of Algorithm 2 hides a considerable amount of detail. For example, it is important to understand how the function makes its ranking decisions and what kind of performance penalty we incur for using it. This section will describe the mechanics of the reranking function in more detail, while Section 6.2.2 will explain the features used to make the ranking decisions. Shen and Joshi [119] survey several ranking and reranking algorithms that have been proposed for problems where the input is a set of candidate elements and the desired result is an ordered list. Many of these algorithms are directly analogous to linear clas- sification algorithms, such as the Perceptron and Support Vector Machines, and share many of the same properties of their counterparts. For example, a maximum margin ranking algorithm finds the set of rankings for the candidates that maximizes the margin between consecutive pairs, which generally helps performance on unseen data, but is relatively costly to train and apply. In contrast, a Perceptron based algorithm, similar to the one described in Collins [31], is extremely efficient, can be trained online and is a simple modification to the classification algorithm already used in Chapters 3 and 4. Although it does not have the same theoretical performance advantages as the maxi- mum margin approaches, a simple Perceptron based algorithm was chosen because of its efficiency and ability to train online 1 . Training the Perceptron based reranker is almost identical to the Perceptron classification algorithm and is illustrated in Algorithm 3. The only real difference is the format of the input examples. In classification, each training or application example is represented by a single feature vector, whereas each exam- ple in a ranking problem is an ordered set of multiple vectors. This change leads to two slight alterations in the training algorithm. First, every lower ranked feature vector in the 1 The online learning capability was not used in this work but is a clear avenue for future research. 120 example set is compared with the top ranked candidate, and if the score of the top ranked element is less than the lower ranked one, then the weights need to be adjusted. Second, the weights are updated slightly differently. In the standard Perceptron the weights are increased (or decreased) to move the decision boundary just enough to put the training instance on the correct side. However, there is no such single boundary for the ranking problem. Instead, the weights are adjusted so that the score of the top ranked candidate is just above the lower ranked candidate. In addition to the straightforward modifications of the standard Perceptron algo- rithm, the pseudo-code in Algorithm 3 also includes a margin and a learning rate, similar to the what is described in Section 3.6. The margin is directly analogous to the margin used in the classification algorithm. However, the learning rate is slightly more complex, inspired by suggestions in Pattern Classification [38]. In general, a learning rate is used to reduce one of the major drawbacks of the standard Perceptron algorithm. Although the simple weight update schema is guaranteed to find a decision boundary, if the data is linearly separable, most real world data sets are not. In these cases, the simple update rule can cause a significant problem. For example, a few outliers, or even a single data point near the end of the training loop, can move the decision boundary a great distance, which could negatively affect the classification of many other data points. One of the simplest ways to mitigate this problem is to reduce the distance the learning boundary moves on each update. One way to accomplish this is by using a real valued learning rate 0< 1. Although this will not immediately classify (or rank) the current train- ing example correctly after each update, over time the boundary will inch closer to the optimal position without making large jumps that effect too many other data points. The learning rate approach used in this reranker takes this idea two steps further. First, a separate learning rate is used for each individual feature. By itself this does not 121 Algorithm 3: Perceptron ranker training algorithm Input: A set of training examplesX of sizeN Input: A positive value to control the decay of the learning rate Input: A positive margin Input: A weight vector ~ Define: Each element i counts the number of times i has been updated Define: ~ he 1 ;e 2 ;:::;e m i Define:score(~ x i ) ~ ~ x i begin fori 1 toMaxIterations do forx2X do forj6= 1 do ifscore(~ x 1 )score(~ x j ) + then ~ ~ + ~ (~ x 1 ~ x j ) end end end end end help much, because there are potentially millions of features and it would be impossible to know up front how to set each parameter. To solve this problem, an exponential decay is introduced into the learning rate for each parameter. The learning rate always starts at 1 and decays according toe i , where is a real value that controls the properties of the decay and i is the number of times feature i has been updated. The idea is to allow large changes in the weights early in the process, but limit the movement as we become more confident in the values after having seen the feature many times. Setting the margin to 0 and the learning rate parameter to 0 reduces this approach to the standard Perceptron algorithm. Ranking a set of candidates using this methodology is extremely easy. One simply computes the score of each candidate using the parameters of the weight vector learned in the training phase, and then sorts the list according to these scores. 122 6.2.2 Story Modeling Features The previous section explained the mechanism for reranking a set of candidate sentences in Phase 2 of the new generation algorithm. However, a major part of the success of this approach is not in the mechanics, but in the features used to represent each candidate sentence. This section will explore several features based on the analysis in Section 6.1, as well as a few others that might be useful. There are many features that will depend on relatively deep analysis of the candidate story and of the user’s story textual discourse, however there are a few potentially useful features that can be obtained with little or no analysis at all. One of these features is the PL2 retrieval model score (RMS) found in Phase 1 of the retrieval process, which does not require any additional processing at all. Two features were created using the PL2 score as a basis. The first used the log of the PL2 score directly as its value. The second used the difference between the higher weighted score and the lower weighted one. Another set of features that requires very little processing is based on the length of the candidate and user sentences (SenLen). Given the nature of the game, and the broad range of users expected to play it, one would expect there is an optimal range of lengths for a candidate sentence irrespective of its content. This is especially true for longer sentences. A user is much less likely to thoroughly read, if at all, a sentence longer than 30-40 words, even if that sentence is a valid response to the input. Unfortunately, the sentence length conjecture does not necessarily hold for short sentences, because they tend to have very little information content and can often follow almost any input sentence; for example, the single word sentence Wow!. Including only a feature based on absolute length might cause a severe bias toward one or two word sentences. While these might be able to continue a story in many cases, too many of 123 them could dramatically hurt the overall quality of the resulting story. To help over- come this problem, an additional feature was added. Instead of the absolute length of the candidate sentence, the relative length of the candidate sentence compared with the previous (user) sentence was used. One would expect in a high quality story, most of the sentences would be roughly the same length and there would be few transitions from very long sentences to very short ones, and vice versa. So far, including Phase 1, all of the features used to determine the worth of a candi- date sentence have been extremely local. Only the relative sentence length feature has made any use of the additional discourse available, and even here it only considered one additional sentence in the user’s story. One of the problems encountered with the simple IR generation model was the lack of context used to determine the similarity. Swanson & Gordon [127] previously tried incorporating context into their generation model by combining two indexes. The first index modeled each information retrieval document as a sentence, exactly as described in Section 5.2. The second index, however, modeled each document as a story prefix, creating as many IR documents in the index as there were sentences in a weblog story. To find the set of similar candidate sentences, two separate queries were constructed. The first using the words of the most recent sen- tence, as before, and the second using all of the words in the entire user’s story. The weights of the two queries were then combined using Apache Lucene’s default mecha- nism for combining queries. Unfortunately, this approach performed both significantly and substantially worse than the single sentence based method. One of the reasons Swanson & Gordon’s attempt failed is because the simple default scoring combination gives too much weight to the contextual index. The context of a story is most useful, for our purposes, in disambiguating two candidate sentences that appear similar on the surface, but hinge on different background assumptions. For 124 example, consider the following user input sentence: I dove right in. Based on this sentence alone, at least 2 possible candidate sentence types are valid; one type having to do with water, pools, lakes, etc., and another more general experiential type having to do with getting involved in something quickly. The context of the weblog stories in which these candidate sentences come from should be very helpful in determining which type is actually appropriate for the user’s story, for example, if a pool is mentioned somewhere else in the weblog story. The idea of combining the scores of both a local similarity metric and a more global one is a good idea, but, in Swanson & Gordon’s previous approach a method for finding the appropriate balance between the index was not devised. In this work however, the reranker provides us with exactly the mechanism we need to find a weight that combines the two models in a more principled and effective manner. To this end, a feature set (DocSim) was constructed based on the similarity of the entire user’s story and the prefix of the weblog story up to the candidate sentence. These portions of each story were converted into a bag-of-words feature vector that used TF-IDF weights computed from frequency counts on the held out weblog story data. The cosine similarity between these vectors was used as the value of this feature. Figure 6.2a provides an illustration of how these features are used to model the story generation process. So far the meaning, or semantics, between the user and weblog story has played a central role in trying to model the merit of a candidate sentence. However, many of the candidates are not problematic because they are off topic, but rather due to invalid syn- tactic or structural constructions; for example, if it is difficult or impossible to interpret the candidate sentence within the context of the user’s entire story. Assuming Patrick is a boy in Figure 6.1, it would be difficult to interpret the story if the next sentence after Patrick was having trouble opening his car door that was froze shut. 125 It was a cold blustery day in Tiny Town. Winter had closed it's grip on the land. Patrick was having trouble opening his car door that was froze shut. As we drove away I could still hear the laughter of the large crowd that had gathered to witness my public humiliation. We went to Grandpa's house for dinner. The dinner was delicious, but our behavior, according to my mom was not good enough for our ages, 13 and 20. In my defense, I was tired, after having only gotten a few hours of sleep last night. I tried to take the high road as we played croquet, but my cousin ended up throwing two substantial tantrums saying that I cheated (one was based in truth... my mom told me to cheat though, and the other was all in his mind). I was lying in my tv room, leaning on my cousin a few minutes ago, the events of the evening forgotten and forgiven, when my mom came downstairs from her bedroom and announced that we had been immature. He had purchased me some Mike's Hard Lime, and I had intended to drink one then crash, and go to sleep. He had purchased me some Mike's Hard Lime, and I had intended to drink one then crash, and go to sleep. Finally we started warming up on the treacharous trip to the grocery store. Then my dad came home from his trip to the grocery store. (a) Document Similarity It was a cold blustery day in Tiny Town. Winter had closed it's grip on the land. Patrick was having trouble opening his car door that was froze shut. As we drove away I could still hear the laughter of the large crowd that had gathered to witness my public humiliation. Finally we started warming up on the treacharous trip to the grocery store. He had purchased me some Mike's Hard Lime ... (b) Machine translation (IBM Model 1) cold drove door i we NMOD - - - - - - - - - - - OBJ - - - ROOT - SBJ SBJ - - - - SBJ - - - SBJ - It was a cold blustery day in Tiny Town. Winter had closed it's grip on the land. Patrick was having trouble opening his car door that was froze shut. As we drove away I could still hear the laughter of the large crowd that had gathered to witness my public humiliation. Finally we started warming up on the treacharous trip to the grocery store. He had purchased me some Mike's Hard Lime, and I had intended to drink one then crash, and go to sleep. {ROOT_-:0.01, NMOD_NMOD:0.003, -_OBJ:0.03, NMOD_-:0.05, -_NMOD:0.06, -_ROOT:0.01, SBJ_-:0.02, SBJ_SBJ:0.01, OBJ_-:0.02, -_SBJ:0.02,...} (c) Entity Grids It was a cold blustery day in Tiny Town. Winter had closed it's grip on the land. Patrick was having trouble opening his car door that was froze shut. As we drove away I could still hear the laughter of the large crowd that had gathered to witness my public humiliation. Finally we started warming up on the treacharous trip to the grocery store. He had purchased me some Mike's Hard Lime, and I had intended to drink one then crash, and go to sleep. SBJ it ENT 1 ENT 2 ENT 3 we i i he we ENT 4 OBJ ENT 1 ENT 2 ENT 3 ENT 5 ENT 6 ENT 7 - {SBJ_it_ENT 1 :0.1, SBJ_ENT 1 _ENT 2 :0.1, SBJ_ENT 2 _WE:0.1, SJB_ENT 2 _i:0.1,..., OBJ_ENT 1 _ENT 2 :0.13, OBJ_ENT 2 _ENT 3 :0.13, OBJ_ENT 3 _ENT 4 :0.13,...} (d) Coreference Grids It was a cold blustery day in Tiny Town. Winter had closed it's grip on the land. Patrick was having trouble opening his car door that was froze shut. As we drove away I could still hear the laughter of the large crowd that had gathered to witness my public humiliation. Finally we started warming up on the treacharous trip to the grocery store. He had purchased me some Mike's Hard Lime, and I had intended to drink one then crash, and go to sleep. VBD VBN VBG VBN VBN VBD VB VB VBD VBG VBN VB VB VBN VB {VBD_VBN:0.03,VBN_VBG:0.03, VBN_VBN:0.0.03, VBG_VBD:0.0.03,..., VBG_VBN:0.03, VBG_VBN:0.03, VBG_VB:0.03, VBG_VB:0.03, VBG_VB:0.03,...} (e) Verb Agreement It was a cold blustery day in Tiny Town. Winter had closed it's grip on the land. Patrick was having trouble opening his car door that was froze shut. As we drove away I could still hear the laughter of the large crowd that had gathered to witness my public humiliation. Finally we started warming up on the treacharous trip to the grocery store. He had purchased me some Mike's Hard Lime, and I had intended to drink one then crash, and go to sleep. z w z w z w z w z w z w z w z w z w z w z w z w z w (f) Latent Diriclet Allocation Figure 6.2: Six ways of modeling a story. was As she drove away I could still hear the laughter of the large crowd. 126 Although this is not completely incoherent, it is certainly less acceptable than the current version of the sentence (where the pronoun is we). The next several feature sets examine different ways to model the quality of a candidate in terms of its coherence within the user’s story alone. Discourse coherence is a complex relationship between many entities that can often span long distances in a text. However, Marcu [78] argues that any globally coherent text must also be locally coherent. So, tackling the easier problem of modeling local coherence is a good first step and may be sufficient in many cases. One interesting way to model local coherence is by treating one sentence as the translation of another, where the words in next sentence are generated by the words in the first. Soricut & Marcu [123] show that this is an effective method for the similar task of reordering a set of scrambled sentences. As in their work, we use IBM Model 1 [17] to derive our features. Statistical machine translation, in general, tries to estimate the likelihood of a sen- tence t in a target language being the translation of a sentence s in a source language as p(tjs). IBM Model 1 is one particular instance of this framework that makes sev- eral simplifying assumptions that ease the mathematical formulations of the problem. It is a word based alignment model in which any single word in the target language can generate any single word in the source language. A word in the source language may align/generate one word in the target sentence, more than one word, or no words at all. It is also possible that a word in the target sentence does not correspond to any words in the source sentence, represented as being generated by a hidden null token. Given these assumptions, and that the length of any sentence in the target language has an equal probability, Brown et al. [17] show that the probability ofp(tjs) is given by: (l + 1) m m Y j=1 l X i=0 tr(t j js i ) (6.1) 127 Wherem is the length of the target sentence andl is the length of the sentence in the source language.tr(t j js i ) is the probability that wordj in the target sentence is a trans- lation of wordi in the source sentence. The description of the model is very intuitive and easy to understand how it could be applicable to scoring candidate sentences by treating them as target sentences of the user’s input. The only mystery remaining is how to obtain the translation proba- bilities. The standard way of estimating these probabilities is using the expectation- maximization algorithm using a large corpus of known translation pairs. The training corpus in this work was created by extracting pairs of consecutive sentences from the held out story data described in section 4.4 for a total of 2,035,966 training pairs. The statistical machine translation toolkit GIZA++ [98] was then used to learn the translation probabilities from these sentence pairs. Two different feature sets were created using the transition probabilities learned from the EM training. The first set of features (IBM(BG)) was created by extracting all pairs of words across the user’s sentence (source) and the hypothesis candidate sentence (target), similar to a bigram. The values for these features were then assigned based on the transition probability of the source and target words. The second feature set (IBM1) contained only one actual feature whose value was determined by Equation 6.1, taking the input sentence as the source and the candidate sentence as the target. Figure 6.2b is a visual representation of how IBM Model 1 is used to represent the story generation process. Several other, slightly less local coherence feature sets were also investigated. Although it is a common thread in this chapter that highly detailed and long distance discourse relationships are very hard to automatically extract, it is relatively easy to pull out inexact local relationships. Entity grids (EG), which were introduced in Sections 2.3 128 and 4.3.2, are one representation of a document that allows simple discourse relations to be easily converted into a format usable by the reranker. Briefly reiterating, an entity- grid is a matrix-like representation of a document whose columns are the unique words of a document and the rows correspond to each individual sentence. The cells of the matrix contain the dependency relation of the corresponding word (e.g. Subject) in the given sentence.n-gram style features are then obtained by traversing down each column and assigning a value to each feature based on its relative frequency in the grid. Each candidate sentence will have different words and dependency relationships assigned to them, causing each entity-grid to produce a different set of features and relative fre- quency distributions. Barzilay & Lapata [8] have shown that these features are highly predictive of coherent text in the sentence ordering task and we should expect they will also be useful in this domain as well. Figure 6.2c shows a partial entity-grid for the example story presented earlier in the chapter. One of the areas of concern mentioned in Section 6.1 that has yet to be directly addressed is the problem of coreference resolution. Identifying the lexical items, which actually refer to the same underlying real world object is an extremely difficult problem in its own right. This type of general coreference resolution is so difficult that only a sub- set of entities are considered such as pronouns within the same sentence. Unfortunately, for our purposes, the coreference relationship across sentences is the important aspect we need to consider. On the other hand, we do not actually have to definitively resolve the relationships either. Instead, it might be enough to track the distribution of nouns and pronouns between neighboring sentences, similar to an entity-grid (COREF). Like the entity-grid, each row of the matrix corresponds to a sentence in the document, how- ever, each column now corresponds to a dependency relation. Since this is not a very precise organization and not all dependency relations are equally important, tracking all 129 the nouns and pronouns between sentences is likely to introduce more noise than signal. For the most part, we are really only interested in following the subject and objects in the document, and so they are the only relations that are used in the model. The cells of the grid are filled in with the corresponding word corresponding to the appropriate row and column of the table. To reduce data sparsity all non-pronoun words were replaced with a special token wordENTITY i . The indexi starts at 1 for the first seen word and is incremented each time a unique new word is encountered. If the exact same lexical item is seen again, then it is replaced by the special token and assigned the same index as before. The grid is constructed in order so that the replaced entities will be assigned a number based on their order of appearance. Although the coreference construction is very similar to the entity-grid construction, the slight change introduces a problem that does not emerge in entity-grids. Since a verb can have more than one object and a sentence can have more than one clause (each with its own subject), sometimes more than one word can be entered into a cell. To address this issue, each column is treated like a graph, which has a node for every word in a cell and a link from every word in that cell to every other word in the next cell down. Every possible path in the graph is enumerated and then features are extracted in the same way as before by treating every enumerated path as a separated column. These features should capture, for example, the intuition that a first person story will have a high distribution offsubject! I! I! Ig features. An illustration of the coreference model is given in Figure 6.2d. A story can also thrown off track by incorrect verb agreement (V A), even when the meaning of that verb is pertinent to the user’s story. These cases are handled in a similar way to how coreference is treated above. The initial data structure is a single col- umn with each cell containing the part-of-speech of any verb in the sentence (excluding 130 modals). As before, the column is treated similar to a graph and features are extracted from all possible paths. See Figure 6.2e for a visual illustration of this model. Language models are a different type of approach for representing the story author- ing process and assessing the quality of candidates. Lapata [68] first proposed a lan- guage modeling approach for discourse coherence, discussed in Section 2.3. Following a standard approach, she models the probability of a set ofN sentences as: P (S 1 \S 2 \S 3 S N ) =P (S N jS N1 S 1 )P (S N1 jS N2 S 1 )P (S 1 ) (6.2) Assuming the Markov assumption it is approximated by: P (S 1 \S 2 \S 3 S N ) = N Y i=1 P (S i jS i1 ) (6.3) Each sentence is represented by a set of features, such as the main verb, and compute the relevant statistics over these features. Barzilay & Lee [9] propose a similar language modeling approach using HMMs instead. In the remainder of this section a new language modeling technique is suggested for evaluating the coherence of a document. Latent Diriclet Allocation [13] is a graphical language model composed of several factors designed to estimate the probability of an entire document. It was first discussed in Section 4.3.2 as a potential method for identifying story documents from non-story documents. However, this type of model needs a large amount of training data to learn the probability of the individual factors in order to reliably estimate the total probability. In the story identification task, only a small number of labeled story documents were available for training and resulted in an inability to discriminate between the two genres. 131 w N M z (a) Original LDA model w N M O z (b) Sentence LDA model Figure 6.3: Sentence LDA model. For the coherence modeling task, there is no shortage of training data, since any reasonably well written document will suffice. However, the original LDA model rep- resents an entire document as a bag of words and does not have adequate internal struc- ture to assess the (coherence) quality of an individual sentence within a document. As a reminder, the original LDA model is defined as: p(w;z;j;) =p(j) N Y n=1 p(z n j)p(w n jz n ;) (6.4) Figure 6.3a is a plate representation of the LDA model, which makes clear the lack of structure required for modeling sentence level coherence. To adapt this idea for our purposes, adding a new sentence layer into the model is proposed. The plate diagram for this modification is shown shown in Figure 6.3b. The equation for this new model is given as: p(w;z; ;j; ;) =p(j) N Y n=1 p( n j) O Y o=1 p(z o j n ; )p(w o jz o ;) (6.5) 132 In this equation is a new variable representing the sentence topic and is a prior introduced to smooth the distribution. Gibbs sampling was used again to estimate the probabilities for each variable in the model, given 100 latent word and sentence topics. The training set consisted of 24,103 documents from the held-out portion of the story corpus. This set of documents consisted of 475,342 sentences, 3,554,440 words and a vocabulary size of 141,867. The priors were chosen using a linear regression from a large set of trials with parameter variations on a (much smaller) separate set of held out training data. Similar to the IBM model, the individual pieces of this model give us several options for designing features. Several different approaches were considered separately as a comparison. The first feature set (LDAZ) uses the maximum likely word topic assign- ments and creates relative frequency unigram and bigram features for the document based on these assignments. Similarly, the exact same process was performed, except the unigrams and bigrams were constructed from the maximum likely sentence topic assignments (LDA ). The final feature set (LDAE) considered the per-word entropy 2 of each document including the candidate sentence. A diagram of the partial sentence LDA graph is shown in Figure 6.2f. 6.3 Offline Experiments The previous sections have suggested several features useful for reranking candidate sentences. However, so far, it has been left as a mystery where the actual data to extract these features to learn a reranking model comes from. This section will describe how 2 Calculated simply aslog 2 p(x). 133 this training data is obtained and explain several offline experiments to measure value of each feature set described above. The experiments conducted in the previous chapter gathered over one thousands stories written by dozens of people. During the process of writing these stories people took a total of 5,310 turns writing sentences with the system. It might be argued that every one of these turns provides a training example for our reranker. At a minimum, the training process of the reranker requires that we know the actual top ranked candidate. However, the particular update rule in Algorithm 3 does not actually require the other candidates be ranked in any particular order. So, we could in theory include all of the candidates for every turn of the system as part of our training data. This is not necessarily the most reliable approach to take, however. Given the way English text is read (left to right and top to bottom) it is assumed that the user has at least read the first sentence. However, due to visual stimuli and other psychological factors, there is no guarantee the user will read the other sentences in any particular order, if they will read them at all. To address the user selection issues, it is better to include only those turns in which a user explicitly chose a candidate sentence other than the default first sentence. Of the 5,310 possible turns, the users selected one of the other 9 alternate sentences 4,395 times. On each of these turns we can assume that one of two things happened. Either the default sentence did not make sense with the story at all (semantically or syntactically), or the user found one of the other sentences more interesting. However, our reranking algorithm is agnostic to the particular reason and should work equally well for both. Following the standard approach taken in the previous chapters, a development, training and testing dataset were created from the 4,395 pairs of sentences. The develop- ment set was composed of 100 stories and a total of 381 sentence pairs. 3305 sentence pairs from 900 stories were included in the training set. The remaining 184 stories and 134 688 sentence pairs were used for the test set. Although training the reranker using only pairs of sentences seems a bit more like a classification problem, it is slightly different despite its similarity. The primary difference is that there is no single decision bound- ary. Instead, each instance can be thought of as defining its own boundary based on the position of the top ranked candidate. Second, while the training may be similar to the classification problem, the application of the learned model will be applied to all 10 of the candidate sentences. The experiments with this data explore several facets of the components discussed in Sections 6.1 and 6.2.2. Each feature set described in Section 6.2.2 was applied on its own and several additional feature sets were also created through various combinations. The feature sets were first trained using the standard Perceptron ranker, without a margin or learning rate. Then, using a similar hill climbing technique described in Section 3.6, the feature sets were also tested with the enhanced ranking algorithm. The results are summarized in Tables 6.1 and 6.2. Each table shows the average original rank of the candidate selected by the user (RANK 0 ) from 0 (best) to 9 (worst). Accompanying this information is the average rank of the selected candidate after it has be reranked using the corresponding feature set (RANK 1 ). The percent of selected candidates that were reranked to the top of the collection is also reported (%1 st ), as well as the percentage of selected candidates that were reranked above the top ranked sentence according to the retrieval-only model (%Above). The results on the left hand side of the table report results from a 10-fold cross-validation on the training data, with an average of 330.5 test examples in each fold. The values on the right hand side indicate the performance on the testing data, which contained 688 examples. Several results shown in Table 6.1 are particularly interesting. While not even close to the best performing feature, it is surprising how well the simple sentence length 135 heuristic works, improving the position of the selected candidate by almost a full rank. Similarly, in light of Swanson & Gordon’s previous attempt, it is reassuring to see that the similarity of the entire document can be used to improve the ranking of a candidate sentence. Entity-grids are clearly the best performing feature set. It is astounding that they are almost always able to rank the selected candidate sentence at the top of the list. Although not quite as good, the coreference features are also considerably better than every other feature set other than entity-grids. On the other side of things, it is also sur- prising how poorly the LDA topic modeling approach performed, and actually caused the sentences to be ranked lower than before. It is not clear exactly what is going on, but one explanation seems to be that the model is highly sensitive to the input parameters (i.e. the priors) and to the input data itself. While it performed extremely poorly on the test data, it was actually one of the best feature sets on the development set. One expla- nation is the very high variance on the cross-validation folds, with a standard deviation of 0.9, whereas the standard deviation of the IBM Model 1 features, for example, was only 0.1. Table 6.2 presents the results after finding an optimal margin, learning-rate and num- ber of iterations to pass over the training data. These adjustments are able to improve the performance of almost all of the feature sets on both the training and test data. Most of these feature sets showed only a mild improvement, however the cross-entropy LDA feature set showed a large improvement over the previous results. Using the enhanced training algorithm LDAE has gone from the worst feature set on the test data to the 3 rd best. The variance is also drastically reduced with a standard deviation of 0.1 on the 10-fold cross-validation. This result is curious however, because the ranking algorithm uses a linear combination of the features. Since there is only one feature in this set, the results should be unaffected. The most reasonable explanation seems to be an instability 136 Training (CV) Testing Features Rank 0 Rank 1 %1 st %Above Rank 0 Rank 1 %1 st %Above Simple RMS 1 4.79 4.21 10.7 100.0 4.73 4.27 10.8 100.0 SenLen 2 4.79 3.94 15.0 58.3 4.73 3.80 15.3 59.9 Semantic DocSim 3 4.79 3.98 13.9 52.6 4.73 3.76 15.3 55.4 Coherence IBM(BG) 4 4.79 3.97 18.0 60.8 4.73 4.10 17.7 59.2 IBM1 5 4.79 3.34 16.4 64.6 4.73 3.17 15.6 64.4 EG 6 4.79 0.54 82.3 94.0 4.73 0.32 88.7 95.8 COREF 7 4.79 1.94 44.3 79.8 4.73 1.75 43.2 82.7 V A 8 4.79 4.25 10.0 53.7 4.73 4.07 12.6 53.9 LDAZ 9 4.79 4.51 9.8 50.0 4.73 4.61 11.0 48.4 LDA 10 4.79 4.47 10.1 51.0 4.73 4.55 9.7 47.4 LDAE 11 4.79 5.71 4.8 37.3 4.73 6.15 2.3 31.1 Combos 2,3,4,6,7,11 12 4.79 0.88 69.0 90.5 4.73 0.58 80.2 93.5 2,3,4,5,6, 13 4.79 1.09 60.2 87.3 4.73 0.91 65.8 88.7 7,8,9,10,11 2,6,7,8 14 4.79 0.90 66.6 89.8 4.73 0.38 85.3 94.8 Table 6.1: Performance using the standard Perceptron with the default parameters. in the LDA inference mechanism due to an insufficient number of Gibbs sampling iter- ations. However, despite the improvements, the performance of the entity-grids remains about the same and are still the single most important feature. Although entity-grids are easily the most predictive feature for these offline exper- iments, it is not 100% clear that using these features in isolation will translate into the most entertaining and usable sentences for the users during live testing (i.e., story writ- ing). For example, the simple IR method may be doing such a poor job that only one reasonable sentence is present among the candidates, and in this case entity-grid fea- tures correlate well with this type of sentence. However, once the set of candidates is increased beyond 10 in the retrieval phase, more plausible candidate sentences might be 137 Training (CV) Testing Features Rank 0 Rank 1 %1 st %Above Rank 0 Rank 1 %1 st %Above Simple RMS 1 4.79 4.21 10.7 100.0 4.73 4.27 10.8 100.0 SenLen 2 4.79 3.77 15.2 61.0 4.73 3.71 16.1 59.4 Semantic DocSim 3 4.79 3.88 14.5 53.3 4.73 3.76 15.3 55.4 Coherence IBM(BG) 4 4.79 3.82 20.4 61.0 4.73 3.96 19.8 60.6 IBM1 5 4.79 3.34 16.4 64.6 4.73 3.17 15.6 64.4 EG 6 4.79 0.36 89.0 95.9 4.73 0.33 88.7 95.3 COREF 7 4.79 1.50 50.3 84.9 4.73 1.40 50.4 85.0 V A 8 4.79 3.91 12.1 58.4 4.73 3.77 14.8 58.3 LDAZ 9 4.79 4.42 10.0 52.6 4.73 4.63 8.9 46.7 LDA 10 4.79 4.40 11.3 52.2 4.73 4.55 10.2 47.8 LDAE 11 4.79 3.02 21.8 65.6 4.73 2.85 23.8 68.9 Combos 2,3,4,6,7,11 12 4.79 0.40 84.9 95.6 4.73 0.36 85.5 94.9 2,3,4,5,6, 13 4.79 0.64 74.6 92.8 4.73 0.58 77.8 92.9 7,8,9,10,11 2,6,7,8 14 4.79 0.40 84.7 95.8 4.73 0.37 84.7 95.3 Table 6.2: Performance using the enhanced Perceptron. found. While entity-grids would probably rank them all high, they might lack the fine- grained lexical details that could differentiate them relative to each other. The only true way to find out is to have users write stories with each of the models. Unfortunately, this is not the most practical approach in terms of time and resources. Although the empirical evidence was not overwhelming for any particular model, in the end, Feature Set 14 was chosen to use as the underlying model for the live authoring and testing. This feature set was chosen because of its excellent performance (0.37 Rank 1 on the test set), it combines several disparate types of features, and is extremely efficient to apply (rel- ative to the other feature combinations). After learning an enhanced Perceptron model for this feature set, it is not much worse than entity-grids alone, but may provide slightly 138 Positive Valued Negative Valued COREF/SBJ he!he EG !! COREF/SBJ i!i EG ! COREF/OBJ ENT 2 EG COREF/SBJ i V A VBD!VB!VBP V A VBG!VB COREF/SBJ you V A VBN!VBP!VBD COREF/SBJ !you COREF/SBJ ENT 3 V A VBG!VB!VBP EG NMOD V A VBG!VBD V A VB!VBZ!VBZ COREF/SBJ V A VBD!VBP!VBD COREF/SBJ ENT 1 !i Table 6.3: The most indicative features for feature set 15. more balance from the other lexical features it includes. Table 6.3, shows some of the most highly (and least) valued features of this set. 6.4 Online Experiments The previous section demonstrated the success of combining the ranking algorithm described in section 6.2.1 with the types of characteristic features of a story discussed in Sections 6.1 and 6.2.2. Given a set of ten sentences, several of the feature sets were capable of producing almost exactly the same outcome as the human choices. Although this is highly related to the goal of our story generation system, it is not exactly the same. Here, our goal is to produce sentences that are interesting to the person and coherent to the story at large. Even though the sentence our reranker selects is highly correlated with the human chosen one, it does not necessarily mean that sentence is what the user wished, desired or hoped it would be. It may only mean that it was the only sentence remotely plausible or coherent within the story. By increasing the number of candidate sentences the simple IR retrieval mechanism fetches in Phase 1, we hope to increase our chances of finding a larger number of desirable sentences. However, there is always the 139 Model # Stories Coherence Believability Usability Entertainment Unigram 601 3:46 1:11 3:53 1:16 3:08 1:19 3:99 1:05 Bigram 567 3:63 1:11 3:59 1:20 3:27 1:19 4:14 1:02 Reranking 443 3:51 1:15 3:62 1:20 3:85 1:05 4:38 0:83 Table 6.4: Author rating results. possibility that none of these sentences in the expanded set will actually be any more interesting than before In order to test how the reranker effects the actual story generation process, a set of experiments were performed similar to those in Section 5.4. Similar to the previous experiments, the data was collected in two stages. In the first stage, a set of Amazon Mechanical Turk HITs was created offering a reward to workers for writing a story with the system. The HIT design, reward, and qualifications were all kept the same as in the simple IR generation approach. However, every story generated used the bigram retrieval model and the reranker paired with Feature Set 14. The second stage collected ratings of these stories from other users in the same was as they were accumulated before. 6.4.1 Author Ratings Table 6.4 3 summarizes the ratings given to the stories by the authors themselves for each of the generation algorithms discussed so far. Somewhat surprisingly, the authors did not believe their stories were any more coherent with the new reranking model, despite its excellent performance in the offline experiments. Although they rated their stories slightly more believable, the result was not dramatic, nor statistically significant. 3 unigrams:p> 0:1 for Coherence,p> 0:1 for Believability,p<< 0:01 for Usability andp<< 0:01 for Entertainment bigrams: p> 0:05 for Coherence,p> 0:5 for Believability,p<< 0:01 for Usability andp<< 0:01 for Entertainment 140 However, there was a large, and statistically significant, jump in both usability and in entertainment. This is actually a quite perplexing result. Presumably, if the system is easier and more fun to use, it is because the responses are more coherent and appropriate. So, some other explanation must be the reason for this discrepancy. One possibility is that the system is in fact performing better locally. Each sentence returned is much more appropriate given the local context (i.e., the previous sentence only), however, the candidate sentence still fails to be cohesive within the entire story. Psychological factors may also be another possibility for this dissonance. It seems reasonable to believe that many people enter into their stories with a predefined vision of how it should proceed. So, even though the model may be returning sentences that do make (better) sense for the story that is actually there, they do not conform to the story the author imagines should be there. Table 6.5 4 presents the objective statistics that can be directly gathered from the authoring process and the story artifacts themselves. Contrary to what was expected, the percent of top ranked candidates and the mean reciprocal rank have both decreased. At first glance, it seems strange that the offline reranking model can determine, after the fact, which candidate the user selected with extraordinarily high accuracy, yet it is completely unable to make the same prediction in advance of the user’s actual selection. At first, this might seem to undermine the benefit of the whole approach, but there are several potential reason for this discrepancy, and other results presented here and in the next section will prove the relative effectiveness of the reranking generation algorithm. One possible explanation why users do not choose the top candidate more often with the reranking model, even when it is a better choice, has to do with a user’s sense 4 unigrams: p > 0:1 for Avg Len, p > 0:05 for %Top, p < 0:01 for MRR, p > 0:05 for Time and p< 0:01 for Time/Sen bigrams: p> 0:5 for Avg Len,p< 0:01 for %Top,p> 0:1 for MRR,p< 0:05 for Time andp<< 0:01 for Time/Sen 141 Model Max Len Avg Len % Top MRR Time(s) Time(s)/Sen Unigram 27 9:41 2:31 0:08 0:09 0:36 0:30 460:6 411:8 44:9 32:0 Bigram 25 9:50 2:51 0:09 0:10 0:34 0:29 492:4 463:7 47:9 35:6 Reranking 27 9:53 2:68 0:07 0:08 0:28 0:07 399:2 294:3 40:1 22:8 Table 6.5: Story authoring statistics. of control. They want to feel like they are actively engaged with the process and are in control of their own story. In other words, they do not choose the top candidate, because it is the top candidate. Additionally, the reranker should theoretically be populating the candidate list with an overall higher quality set of sentences, increasing the chance any alternative presented is a desirable one. In addition to the possible psychological and other hypothesized reasons for lower than expected coherence ratings, some of the other authoring statistics provide evidence supporting the effectiveness of the reranking model. Average story length has been dis- cussed as one way to measure the quality of the algorithm, because it indicates how willing the user is to continue using the system, even when there is no additional mon- etary reward for doing so. The new generation model does show a slight improvement over the other models, giving some additional indication it is an improvement over the others. However, looking at the time required to write the stories compared with the other models is much more compelling. The total time to write a story with this model has dropped by more than a minute (a 13.3% decrease) and the standard deviation has been reduced by almost two minutes. Adjusting for the length of the story and remov- ing outliers as described in Section 5.4.1, nearly 5 seconds (10.6%) have been trimmed from the average length of time needed to write a sentence. These results suggest that the reranking model is presenting a set of sentences, or at least ordering the set, in a way that makes finding a reasonable continuation much easier than either of the previous models. Additionally, the fact that the stories are not significantly rated worse and were 142 written in much less time suggests the hypothesis that some unknown factors are biasing the coherence ratings users report. 6.4.2 User Ratings The results in the previous section show that certain aspects of the new narrative genera- tion algorithm improve the user’s experience with the system, indicated by their subjec- tive ratings and the reduced amount of time needed to write a story without a significant reduction in coherence. However, the coherence and believability ratings provide a less clear picture of the overall quality of the stories generated with the reranking model. The lack of improvement in these ratings could simply reflect that these stories are not any better (or more coherent) than before. On the other hand, other factors discussed in the previous section could also be at play. This section examines the results of independent users rating the stories using the same methodology as Section 5.4.2 and will shed some light on which of these scenarios is more likely. Table 6.6 5 presents the results of the independent user evaluation for the stories writ- ten so far. These ratings tell a different story than when the authors rated their own compositions. Now, when a fresh pair of eyes judges the quality of the narratives, they are rated significantly higher than the previous two models on every category. They are rated almost as highly on coherence as the human authored stories and are even rated more enjoyable to read. These results appear to lend credence to the idea that people are not good at stepping back from their own stories and providing impartial ratings for coherence judgments. Even though the authors themselves did not think their stories 5 human:p<< 0:01 for Coherence,p< 0:05 for Believability andp> 0:5 for Entertainment. unigrams:p<< 0:01 for Coherence,p<< 0:01 for Believability andp<< 0:01 for Entertainment. bigrams:p<< 0:01 for Coherence,p<< 0:01 for Believability andp< 0:05 for Entertainment. 143 Model Coherence (# Ratings) Believability (# Ratings) Entertainment (# Ratings) Human 3:65 1:24 (0878) 3:70 1:26 (0876) 2:96 1:26 (0876) Unigram 3:29 1:33 (3901) 3:30 1:32 (3901) 2:84 1:26 (3895) Bigram 3:42 1:27 (3505) 3:41 1:28 (3509) 2:91 1:23 (3506) Reranking 3:55 1:24 (3067) 3:51 1:26 (3064) 2:98 1:24 (3058) Table 6.6: Independent story rating results. made any more sense as they wrote them, other people clearly thought they were easier and more coherent to read. 6.5 Chapter Summary This chapter introduced a new story generation model that attempts to model the user’s story in a more detailed manner. Instead of considering only lexical similarity between sentences, this new model has the ability to take into account a more contextual semantic view of the similarity between the stories, as well as the ability to consider other internal coherence measures that help ensure the candidate sentence makes sense regardless of the semantic similarity to the proxy story. A deeper examination of a particular story was conducted in Section 6.1 to illustrate some of the essential characteristics that enter into our analysis of how we might decide to continue a story as a human. Although many of these features could not be encoded in the precise manner described, Section 6.2.2 showed several approximations that were possible to automatically extract. The stories written for the experiments in the previous chapter provided us with sufficient training data to learn several competing models for assessing the quality of a candidate sentence. The offline experiments showed that all of the feature sets were able to improve the ranking of the selected candidates. Entity-grids in particular per- formed incredibly well, as well as some feature combinations that were ultimately used in a new story gathering phase. The results of the authoring process were not quite 144 what was hoped for. Although certain measures, such as the usability ratings and time to complete a story were substantially better, the self reported coherence ratings were not significantly different than either of the previous models and the mean was actu- ally slightly lower than the bigram-only model. The independent user ratings painted a different picture, however. The ratings for all categories were significantly higher and in particular the coherence improved so much that it is closer to the human level of performance than either of the previous models. The ultimate goal for any interactive storytelling system is to allow the user to act freely in the environment, but to still have the ability to direct the user in a way that both makes sense with events that have happened so far and that is exciting for the user to participate with. The results presented in this chapter, suggest that significant progress has been made toward achieving these goals. Even though the authors do not seem to think so, the candidates presented to them must make more sense in general than those with the previous models, because independent readers find them much more coherent. Other metrics such as the reduced authoring time, increased story length and improved usability scores also tend to support this interpretation of the results. Although not quite as clear cut, there is also some evidence that this reranking approach is not only learning a coherence quality model, but also a user preference model. For example, the writers rated their stories more entertaining than with the retrieval only method, suggesting the reranking generation approach is more closely aligned with the user’s own internal authoring process. This is also reflected by the fact the independent users also found these stories to be more entertaining to read. While this approach has made large strides in the right direction, it has not solved the problem completely. The next chapter will examine some of the weaknesses of the reranking model and present one way that tries to address them. 145 Chapter 7 Adaptation Linguists often refer to the infinite productivity of language [72], which refers to a human’s ability to speak and understand an infinite number of possible utterances. One of the consequences of this productivity is that even two relatively short sentences expressing a similar thought have an incredibly low chance of being communicated using the exact same surface form. This poses a daunting challenge for computational systems whose goal is to reason about the meaning of natural language utterances. Tra- ditional expert systems, and storytelling engines, have typically attacked the problem by trying to abstract the lexicalized surface form into a much smaller set of essential con- cepts that embody the same meaning. To characterize the relevant aspects of the world to the problem at hand, only one reduced set of axioms needs to be authored based on this conceptual abstraction. This type of approach can often be extremely successful for small targeted domains. However, to say that abstracting all of the core concepts and relations encompass- ing human experience is extremely difficult would be a major understatement. People have been building ontologies of the world for thousands of years and there is no reason to believe that it will not continue for thousands more. One of the reasons is because the determination of a successful ontology is a moving target. Concepts are continually added and modified through discovery, invention, and changing philosophical world views. Unfortunately, fixing the vocabulary of concepts does not make the endeavor 146 much more manageable. The structural organization of the concepts themselves is con- textually dependent on the purpose for which it is created and separately for the purpose in which it is used. These properties render it theoretically impossible to define a single complete or consistent ontology for general reasoning for any arbitrary purpose. Even though the noble goal of a creating a universal ontology is unattainable, this is not to say real world, large-scale practical ontologies cannot be useful. OpenCyc 1 , CLIB 2 [6], SUMO 3 [96], and ConceptNet 4 [56] are all large-scale ontologies that have been exten- sively used to help solve real world problems. It should be clear by now, however, this is not the approach advocated in this thesis. Instead, the collection of stories authored by ordinary people in natural language is the ontology. The concepts and categorical organization are defined by the actual events and relationships as reported by people in the real world. This allows a fluidity of the concepts based on their use and context in which they appear in a story. This also entails that these concepts and relations will have all of the ambiguity that makes interpreting natural language correctly so difficult. However, using this approach eliminates the need to define (or discover in a platonic view) the essential concepts of the world, nor do we have to explicitly prescribe a mapping from natural language utterances to these concepts. Likewise, we no longer have to hand author the axioms of the world, at least in respect to people’s commonsense perception of it, because that is what the stories are implicitly about. We know that there is a vast amount of knowledge contained in the million and a half stories about complex relationships between entities of the world, 1 http://www.opencyc.org/ 2 http://userweb.cs.utexas.edu/users/mfkb/RKF/clib.html 3 http://suo.ieee.org/SUO/SUMO/index.html 4 http://conceptnet.media.mit.edu/ 147 such as cause and effect. However, the primary problem with this approach is how to extract and make use of this knowledge buried in the complexity of natural language. For example, if we want to know what happens when a person drops a glass, in a formal, conceptual approach we would map the entities into a set of predicates and search for an axiom whose antecedent sufficiently matches these conditions. The two obstacles facing this approach are mapping the natural language description into the appropriate predicates, and the likely scenario that there is no such axiom in the knowl- edge base. In our approach we do not have to reformulate the antecedent and it is much more likely that a relevant “axiom” exists in the story derived knowledge-base. How- ever, even though an explicit translation from natural language to a formal language is not needed, some kind of translation is needed to find an antecedent in the corpus that has the same meaning as the query. The IR method in Chapter 5 is the most basic solution, essentially ignoring the trans- lation problem altogether and relying solely on exact word overlap. The only reason this works at all is due to the large number of stories in the corpus. Despite the infinite pro- ductivity of language, there is still a significant amount of redundancy in the types of stories people tell and the actual words used to express them. It seems reasonable to believe that the frequency of new content introduced by each new story follows some kind of Zipfian or exponentially decaying curve, suggesting that a good deal of knowl- edge about common events is available with a relatively small set of stories. Although we have seen that this approach will often find a consequent in the database that is gen- erally pertinent to the given antecedent, the specific language conveying the response is sometimes inappropriate for the particular story we are trying to continue. For example, when the coreference between named entities cannot be resolved in a coherent manner. 148 The previous chapter described one approach to improve these discourse coherence problems by reranking the candidate response using a set of suitable features to constrain the results. This had a considerable effect and improved many aspects of the process, including the usability and the overall quality of the generated stories. However, it was not the solution to every problem of the system. Even though the authors rated the system more entertaining and easy to use, they still had some doubts about the coherence of their stories. It has been suggested that the author’s problems do not reflect the original linguistic intent of the coherence metric, but has more to do with an inability to find sentences that conform to their vision of the developing story. There are many possible avenues for trying to address this problem. The simplest way is to change the interface to allow the user to access more candidates, which was actually suggested by several users. This is certainly viable but also has some draw- backs. It is not clear how far down the list, on average, a user would need to go before they found an ideal candidate, if they found one at all. If it is too far, then giving them this option may lead to frustration more often than it does satisfaction. Second, it has serious repercussions to the game play aspects of the system. Although part of the sci- entific and engineering goal is to predict an appropriate response, much of the success of the game is being forced to work within the constraints you are given. So allowing the user to essentially fish for any sentence in the entire collection would be counter productive to the intent of the game. Another possible solution, suggested earlier, is to use a different information retrieval method that generally achieves higher recall, such as latent semantic index- ing. Alternatively, the size of the story corpus could be dramatically increased. Given the adequate amount of resources, both of these solutions are likely to improve the 149 performance of the system significantly. However, irrespective of the amount of hard- ware thrown at the problem or how many stories are collected, the characteristics of the curves describing the information gain of each new story suggests some critical gaps will always remain. Although 1.6 million stories is large in comparison to previous story collection efforts, it is tiny in comparison to the number of possible stories and there will always be a long tail of interesting but rare stories. This chapter will explore an alternative approach. Much like the user is forced to work with the choices they are given, we too will try to work with and improve the candidates available from the techniques described so far. The proposed method is one of the most important steps in case-based analogical reasoning, which is to adapt the retrieved item so that it integrates better with the target scenario. In our case this means making modifications to a retrieved candidate sentence so that it is more coherent with the user’s developing story. 7.1 Adaptation Based Generation It has been established that without access to an infinite repository of all the possible stories people could author about their experiences in the real world, our collection will be incomplete. The solution advocated in this chapter is to work with what we do have and try to change portions of a retrieved sentence to better fit with the user’s story. As we have seen in the previous chapters, a retrieved sentence often has the correct underlying semantic relevance, but is incoherent because of a relatively minor discrepancy. In the previous chapter we saw at least two cases where relatively minor differences in the sentences where a slight modification could have a profound impact on the coherence of the sentence within the user’s story. The first is in regards to named entity resolution, including pronouns, and the other has to do with verb tense agreement. 150 Stories are incredibly complex and dynamic data structures that could theoretically be amenable to many other types of adaptation other than the previous examples. For example, consider the following two related stories. First, a story about a young boy who becomes rich and famous as a baseball player through determination and endless practice. Second, a narrative about a young woman who overcomes many obstacles to become a successful corporate executive by achieving high grades in school and endur- ing long working hours at her company. At a certain level these stories are significantly different. One focuses on a man’s triumphs in sports, while the other is about a woman having corporate success in her professional career. However, neither the characters nor their occupations seem to capture the essence of either of these stories. In both cases, the purpose of these stories is to convey the fundamental proposition that working hard is a critical ingredient to success. At a surface level these two stories are not particularly good proxies for one another, at least in terms of how they would operate in our narrative generation system. Although there would probably be sufficient lexical overlap to rank at least some of the corre- sponding sentences highly, many of the entities in the proposed sentences would likely contain erroneous domain specific references. For example, both would probably men- tion phrases like work hard and long hours, but also contain domain specific words such as baseball, games, or executive. It would be an incredible benefit, drastically reducing the necessary size of the case library, if the relevant aspects of these stories could be used and the domain specific aspects of one could be altered to reflect the the domain of the other. For example, one could imagine a mapping between athletic training facilities in the sports domain to an academic institution in the other, or from a baseball stadium to corporate headquarters. 151 Algorithm 4: Adaptation based generation algorithm Input:UserStory The user’s complete story (including the most recent sentence) Output:Candidates The set of candidate next sentences for the user’s story begin// Phase 1 Candidates retrieve(UserSentence) end ifisEmpty(Candidates) then return error begin// Phase 2 Candidates adapt(Candidates;UserStory) end begin// Phase 3 Candidates rerank(Candidates;UserStory) returnCandidates end Unfortunately, adapting candidate sentences based on a deep semantic mapping like the one in the previous example is beyond the scope of this work. However, it is not out of the realm of possibility that some of the more surface level aspects of a sentence, such as verb, pronoun and named-entity agreement, could be adjusted to improve the quality of the candidates. The remainder of this section will describe a new adaptation based generation model and the algorithm used to modify the candidate sentences. Similar to the reranking model, the adaptation generation model is not much dif- ferent than the previous methods from a high level view of the system. The algorithm now operates in three phases illustrated in pseudo-code by Algorithm 4. The only new method in this algorithm is adapt in what is now Phase 2. However, just like rerank, this function hides a considerable amount of detail. The sentence adaptation algorithm is a five step process. Step (1) begins by identi- fying all of the pronouns and proper names used in the subject or object position in the parse tree. For each identified position a set of valid replacement words is created using a replacement table for each type of word that could appear. The first set of tables cor- respond to five different classes of pronouns: subjective (e.g., I, he, she, we), objective 152 (e.g., me, him, her, us), reflexive (e.g., myself, himself, herself, ourselves), possessive (e.g., mine, his, hers, ours) and possessive determiners (e.g., my, his, her, our). If the target word is contained within one of these tables, then this set is used for the candidate replacements. Proper names are handled slightly differently. Similar to before, they can also be replaced from a set of pronouns, based on whether the noun functioned as the subject or object. However, they can also be replaced by other proper names. This is accomplished by maintaining an extra data structure that keeps track of all previous mentions of proper names in the story (i.e. the cast of characters). Step (2) of the process involves generating a new sentence for every possible combi- nation of the replacements in each target word set. Unfortunately, the number of combi- nations for sentences with more than a few target replacement candidates becomes pro- hibitively large. To prevent the set of candidates from exploding, a simple heuristic was used to limit the total number of possibilities. For any given target word, a maximum of two alternatives were selected as possible replacements. These alternatives were chosen by sampling the entire set of valid possibilities based on each word’s relative frequency of occurrence in the entire story. For example, the valid replacements for the pronoun he are the subjective pronouns: I, we, you, she, one and they. So, if I had been seen 4 times, she 2 times and you 1 one time then the relative frequency of each term would be I/0.38, we/0.07, you/0.14, she/0.21, one/0.07 and they/0.07 (add-one smoothing is used to prevent 0s and allow a small chance for any pronoun to be selected). These first steps, (1) and (2), are illustrated in Figure 7.1a. It is hoped that the characters participating in the events of one of these alternative sentences (or the original) will more closely adhere to the narrative intentions implied by the user’s story. However, small changes to the subject of a verb can lead to ungram- matical agreement between the two. For example, when the subject of the sentence 153 Jane purchased me some lemonade and I intend to give her one of mine Patrick I He She You We They him her you us them he she you we they me him you us them 0.18 0.18 0.09 0.09 0.09 0.27 0.09 0.2 0.2 0.2 0.2 0.2 0.14 0.14 0.14 0.43 0.14 0.2 0.2 0.2 0.2 0.2 Patrick purchased me some lemonade and I intend to give her one of mine Jane purchased her some lemonade and I intend to give her one of mine Jane purchased me some lemonade and he intend to give her one of mine Jane purchased me some lemonade and I intend to give you one of mine Patrick purchased her some lemonade and I intend to give her one of mine Patrick purchased me some lemonade and he intend to give her one of mine ... ... (a) Steps 1 & 2 had some me lemonade Patrick purchased and intends he give to her one of mine <entry> <lemma>intend</lemma> <pos name='verb'/> ... ... <inflection> <form>intend</form> <feat name='tense' val='ind'/> <feat name='person' val='1'/> <feat name='number' val='sing'/> </inflection> <inflection> <form>intends</form> <feat name='tense' val='ind'/> <feat name='person' val='3'/> <feat name='number' val='sing'/> </inflection> … ... </entry> (I) 1 1 2 3 4 (b) Step 3 had some me lemonade Jane purchased and intend I give to her one of mine ... Patrick purchased me some lemonade and I intend to give him one of mine Jane purchased me some lemonade and he intend to give her one of his (c) Step 4 Figure 7.1: The first four steps of the adaptation process. (a) Shows the potential replace- ments along with their relative frequency. The arrows indicate the subset of items chosen by sampling. (b) Illustrates how verb agreement is fixed using a dictionary. (c) Shows how the coreference between pronouns is resolved. 154 I have more than one lemonade. is changed to He, then the new sentence is no longer grammatical. Step (3) combats this issue with a special dictionary [33] that provides, where applicable, the number, person, gender and tense for every lexical entry in the dictionary. The entry containing all lexical variations of any verb whose subject has been adapted is looked up in the dic- tionary using the number and person information available from the unaltered sentence. The lexical variation corresponding to this entry that matches the number and person information of the new adapted subject is then used to replace the previously ungram- matical verb. Figure 7.1b gives an example of this process. I is the original subject of the verb intend. However, the subject is changed to the pronoun he, which renders the verb agreement incorrect. A specific entry for intend is found in the dictionary by matching the lexical features of the subject (i.e., first person and singular) with the lexical form of the verb as seen in the original sentence. This entry contains many inflected lexical forms depending on the subject’s person and number, as well as the verb’s tense. The new lexical form is determined by using the new person and number (i.e., 3rd person and singular), while maintaining the same tense. The two relevant inflections are shown in the dictionary entry, but the others have been excluded due to space constraints. Changing the subject and object can also cause more unintended side effects than just incompatible verb agreement. Altering a noun in the sentence can also disrupt the coreference interpretation within the candidate sentence. For example, consider the following sentence: Jane 1 purchased me 2 some lemonade and I 2 intended to give her 1 one of mine 2 . Given only the information present in this sentence, we would expect the coreference between the pronouns to be assigned according to the given indexes. However, if we change Jane to Patrick to better fit our story from Section 6.1, then the interpretation 155 changes considerably. Step (4) attempts to preserve the coreference interpretation of the unaltered sentence with the adapted one. To do this a simple coreference resolution algorithm, similar to the one proposed by Hobbs [60], was used. Pronouns in the unal- tered sentence are resolved in the following way. Starting with pronouns lowest in the parse, the tree is traversed to the closest node in the upper left portion of the tree that matches the number and gender of the originating pronoun. The gender of proper names were estimated using frequency data collected as part of the 1990 United States census 5 . Using a procedure similar to Step (1), these coreference assignments are used in the adapted sentence to change the pronouns of nodes lower in the tree to correspond with their assigned coreferent higher in the tree, if necessary. Once again, this may cause problems with the verb agreement in the sentence and is addressed in the same way as Step (3). A simplified illustration of the main idea is shown in Figure 7.1c. The 4 steps outlined above provide an adaptation mechanism that addresses some of the most prominent concerns discussed in the previous chapter. However, there are still a few problems that this approach introduces. First, even with the sampling restrictions in Step (2) and only small number of sentences retrieved in Phase 1 of Algorithm 4, the total number of new candidates can prohibit them from being processed in real-time. Second, entity-grids, the most important feature set during reranking, are not lexicalized. In other words, the changes made to the adapted sentences will have only a small impact on the overall score given by the reranker. This means that the adapted sentences of an unaltered sentence with a high reranking score will also have a high score. In effect, this could end up populating the list of candidates returned to the user with only a single sentence and its minor variations. To avoid these issues a Step (5) is included for each candidate sentence. In this step all of the adaptation candidates for the sentences are 5 http://www.census.gov/genealogy/names/ 156 Model # Stories Coherence Believability Usability Entertainment Unigram 601 3:46 1:11 3:53 1:16 3:08 1:19 3:99 1:05 Bigram 567 3:63 1:11 3:59 1:20 3:27 1:19 4:14 1:02 Reranking 443 3:51 1:15 3:62 1:20 3:85 1:05 4:38 0:83 Adaptation 429 3:46 1:07 3:55 1:19 3:90 1:02 4:33 0:85 Table 7.1: Author rating results. processed by the reranker. The top two of these alternatives, plus the original unaltered sentence, are pushed onto a global set of candidates. After all of the candidates from Phase 1 have been processed, the stack of global candidates are reranked and finally the top 10 are sent to the user as before. 7.2 Chapter Experiments Similar to Sections 5.4 and 6.4, Mechanical Turk was used to solicit new stories and to have independent users rate the compositions. Similar to the reranking experiments in the previous chapter, only the bigram index was used in Phase 1 of the generation process and a set of 40 seed candidates was sent to the adaptation component. After applying the initial adaptation and reranking steps to identify the 3 best adapted sen- tences for each candidate, a total of up to 120 sentences was given to the reranking component to order globally. 7.2.1 Authoring Results The results of the authors’ subjective ratings on their own stories is presented in Table 7.1 6 . Similar to the reranking results in Section 6.4, these outcomes are somewhat 6 None of the results are statistically significant between the reranking and adaptation algorithm. How- ever, the usability and entertainment ratings remain statistically significant between the adaptation model and the simple IR based methods. 157 Model Max Len Avg Len % Top MRR Time(s) Time(s)/Sen Unigram 27 9:41 2:31 0:08 0:09 0:36 0:30 460:6 411:8 44:9 32:0 Bigram 25 9:50 2:51 0:09 0:10 0:34 0:29 492:4 463:7 47:9 35:6 Reranking 27 9:53 2:68 0:07 0:08 0:28 0:07 399:2 294:3 40:1 22:8 Adaptation 36 9:63 3:07 0:04 0:07 0:23 0:04 406:1 286:5 39:3 20:6 Table 7.2: Story authoring statistics. perplexing. The average coherence and believability rating has fallen below the rerank- ing model and has dropped to the level of the simple IR unigram model. However, the usability has slightly surpassed the reranking model and the entertainment value remains above the simple retrieval-only methods. The conflict between the usability and the coherence measure discussed in Sec- tion 6.4 remains the most puzzling aspect of the results. This issue is actually even more pronounced for the adaptation model, because the extra processing required for the adaptation itself and the increased number of candidates that are handled by the reranker causes a delay in the responsiveness to the system. Although the delay is not typically very long, on average it is noticeable compared with the previous algorithms. Yet, despite this delay the authors rated this algorithm among the highest in usability. The objective statistics in Table 7.2 only seem to confound these issues even more. The stories are on average longer than any of the previous models and the maximum length story has 9 more sentences (33% longer) than the next longest story. Also, the total length of time to write a story is not much different than the bigram model, and when accounting for story length, users took less time per sentence than any of the other models, despite the increased latency. However, both the mean reciprocal rank and the percentage of time the selected candidate was ranked at the top of the list was significantly worse than all other models. 158 7.2.2 Story Rating Results The previous chapter showed that there can be a substantial difference in opinion between the beliefs of an author and those of an independent reader of a story. Table 7.3 compares the average ratings from the independent pool of raters for all the models. Again, we see that the adaptation model is rated significantly higher in coherence and believability than the unigram and bigram models 7 , despite having the poorest ratings amongst the authors. Although the differences are not significant for any of the cate- gories, it is disappointing that the adaptation model appears to perform slightly worse than reranking alone. The reranking model is successful because it is able to push incoherent sentences out of the list of sentences returned to the user, while bringing more coherent sentences to the top. However, many of the incoherent sentences that the reranker removes from the list are potentially more appropriate choices for the user, except for minor problems with the sentence, such as an incompatible pronoun. The primary motivation of the adaptation component was to try to leverage the best of both worlds. We would like the reranker to find better choices based on a richer set of features than the simple word-based index, but still be able to keep the semantically relevant choices by fixing their inconsistencies. In a best case scenario this should produce a candidate list filled with extremely relevant and coherent sentences. Unfortunately, it appears the adaptation component provided the opposite effect. For reasons stated at the end of Section 7.1, a highly ranked unmodified candidate sentence will probably produce alternative adapted sentences that are also highly ranked. If one sentence makes it into the set returned to the user, the other two variations will also. Even though one of these alternatives is more likely to be coherent with the user’s story, it also removes two other distinct 7 p values not reported here, but are similar to those in the previous chapter. 159 Model Coherence (# Ratings) Believability (# Ratings) Entertainment (# Ratings) Human 3:65 1:24 (0878) 3:70 1:26 (0876) 2:96 1:26 (0876) Unigram 3:29 1:33 (3901) 3:30 1:32 (3901) 2:84 1:26 (3895) Bigram 3:42 1:27 (3505) 3:41 1:28 (3509) 2:91 1:23 (3506) Reranking 3:55 1:24 (3067) 3:51 1:26 (3064) 2:98 1:24 (3058) Adaptation 3:51 1:22 (3048) 3:48 1:27 (3049) 2:94 1:22 (3042) Table 7.3: Independent story rating results. options the user can choose from. It then becomes a question of whether it is better to present the user a narrow, but more likely to be correct, selection of relevant candidates or a broader range of possibly incoherent candidates. These results suggest that without a more sophisticated method for adapting the candidate sentences, it is more effective to provide a broader range of sentences using only the reranker. 7.3 Chapter Summary The adaptation component introduced in this chapter was unable to significantly improve upon the reranking-based approach developed in the previous chapter, and in some cases actually showed a slight decrease in performance. In the short term, it is fairly clear that the most beneficial way to improve the system is by increasing the size of the corpus and emphasizing the reranking component. Given the near human levels of performance achieved using the reranking model, it is not worth redressing the prob- lems with the adaptation component for the purpose of a turn-based narrative generation game. However, the basic case-based architecture has proved to be relatively success- ful and could be applied to other applications, such as a more general natural language understanding mechanism. In these cases it would be important that a single inference is drawn on the first try, as opposed to a list of potential candidates. It is also important that 160 this inference is both plausible, in the sense a human believes it to be true, and that all of the variable bindings are correct, i.e. there are no invalid coreference interpretations. In Section 7.1 it was argued that a relatively small corpus (millions of stories) con- tains a large amount of information about real world human activities and events, but the amount of new information obtained with each additional story diminishes rapidly. For a maximally broad coverage inferential system, some form of analogical reasoning or adaptation will be needed. To support this type of reasoning in future systems it is important to understand why the methods employed in this chapter failed and where they need to be improved. Although not the most satisfying answer, the most direct explanation is because of the extraordinary difficulty of the problem. Some of the problems occur because the linguistic resources available, such as the inflection dictionary, are incomplete, insuffi- ciently detailed, and sometimes inaccurate. However, there is a much more fundamental reason this is such an intractable problem. Consider part of the story first presented in Figure 6.1: It was a cold blustery day in Tiny Town. Winter had closed it’s grip on the land. Patrick was having trouble opening his car door that was froze shut. As we drove away I could still hear the laughter of the large crowd. Finally we started warming up on the treacharous trip to the store. and now the potential candidate sentence: Jane had purchased him a soda and he intends to drink it. Assuming for a moment that the sentence can be interpreted correctly without any adap- tation, the potential number of coreferent antecedents for the emphasized targets is large even for this short story fragment. he, for example, could refer to either Patrick or the 161 narrator I. However, it, could refer to just about anything in the discourse, including the soda, the trip, the large crowd, the laughter, the land, Winter, Tiny Town or the cold blustery day. In the best case, when the sentence is already known to be coherent, a tremendous amount of world knowledge is needed to sort out the correct interpretation and state-of-the-art systems are only accurate about 75% of the time [36]. However, our situation is more complex, because we don’t even know if the target entities are interpretable in the discourse as a whole, which in this case would require some mental gymnastics to incorporate the sudden introduction of a new character. Not only must we resolve the entities, but also actively engage in selecting alternatives that are more reasonable interpretations. However, as mentioned previously, the number of possible pronoun combinations, even for this short sentence, is so large they cannot realistically be enumerated. If we take the most generous viewpoint, that we can enumerate all possible combina- tions and use a state-of-the-art resolution algorithm that can identify coreferent entities along with some kind of confidence score, the adaptation approach in this chapter would still fail. At 75% accuracy, one of the interpretations in the example above has a good chance of being wrong already, and this sentence is not a particularly complex represen- tative of the data. At best, this essentially leaves us in the same situation that we are in now. The system will be correct more often than not, meaning an interpretable combina- tion will be highly ranked by its score, but it will probably not be the most highly ranked combination. To reach this sentence we would again have to keep a human in the loop by allowing them to select from among several candidates. However, including several variations of the same sentence pushes out other potentially interesting sentences and reduces the diversity of options to the user. Given the complexity of the problem and the 162 number of possible ways coreference assignment can go wrong, it is actually somewhat encouraging the performance did not significantly decrease. One could imagine many different ways to try to improve the effectiveness of the adaptation process, however, taking advantage of the reranking mechanism already in place is a clear way to go. Although one of the feature sets did try to capture some intu- ition about the locality of coreferent entities, it is a relatively weak model that only uses pairwise decisions and simply enumerates all possibilities between sentences without any preference between them. To make significant headway in the adaptation com- ponent it is reasonable to believe a much stronger model of coreference modeling is needed. Another related deficiency with the reranking model is the lack of lexicalized fea- tures used by the reranker. Most of the features in Section 6.2.2 are unlexicalized, mean- ing they do not make use of any of the actual words in the sentence or story. For example, entity-grids achieved the best offline performance of any feature set, but do not contain any information about the words used in the discourse. In retrospect, despite the large memory requirements, relative inefficiency and no offline performance improvement, including the IBM(BG) features might have been advantageous for their lexical speci- ficity. Finally, the discrepancy between the offline ranking predictions versus the user selections in the online experiments in the previous chapter indicate another potential avenue for improvement. Along with suggested feature improvements above, it might be doubly beneficial to utilize the online learning aspects of the ranking algorithm. First, it is a rare occasion that more training data is not helpful, especially in the high dimen- sional spaces produced by lexical features. Second, it sets up a more symbiotic relation- ship between the user and the learning mechanism. The desire to feel actively engaged 163 in choosing the direction of the story is one of the reasons given for the large discrep- ancy between the rank of the selected candidate between online and offline experiments. Adjusting to the input after each turn by updating the reranker might lead to a more accurate user model, including preferred coreference assignments. Although this would probably not have an immediate impact, it would allow the system to adjust as the user becomes aware of idiosyncrasies of the current model and prevent the user from gaming the system over the long run. 164 Chapter 8 Conclusions Interactive narrative is a genre of storytelling that allows active participation by the reader to affect the development and outcome of the story. It is a particularly interesting topic because it intersects the central aspects of many areas of research. It raises issues that are fundamental to literary analysis as well as artificial intelligence. We must try answer both What is a good story? and How can we automatically generate one? Section 2.1.2 discussed the most typical ways that AI researchers have attempted to tackle these problems. These approaches usually model the major events (i.e., scenes or beats) of the narrative in a formal language. The most common is in a STRIPS-like lan- guage where one specifies the characteristics of the scene, including the preconditions necessary for activation and the post-conditions that specify the changes to the world caused by this event taking place. Stories are generally produced from these representa- tions using some kind of planning mechanism [108] or hierarchical task networks [24]. As the user changes the state of the world through his or her interaction, the planning algorithm can reformulate the goals of the story and adjust its direction to suit the events as they have actually unfolded. This methodology is perfectly reasonable, intuitive, and has been shown to be quite effective for many use cases. However, a major component of its operation and success has been completely glossed over. Where does the content come from and how do we know the causal and temporal structure of the scenes? In almost all cases the answer 165 is: from experts. Unfortunately, this poses a very significant problem on many differ- ent levels. Writing stories is usually considered a creative art or humanities endeavor, yet the representation of interactive narratives is most often defined in formal notation and mathematical algorithms. This introduces a serious gap between the best qualified authors and the people who have the technical expertise about the system’s mechan- ics. Even if this gap could be narrowed through intelligent user interfaces, which is an ongoing area of research [28,85,105], the vast majority of creative writing courses only teach traditional narrative techniques. Regardless of any technical challenges inhibiting literary experts from engaging further in the process, there would still be a severe lack of qualified authors for the task. Finally, one of the biggest problems, and the one addressed in this thesis, is the issue of breadth. Assuming the other challenges could be surmounted, there is virtually no pool of human experts large enough to author all of the events necessary to correctly handle unrestricted interaction by the user. Instead, the contribution of this thesis is to show that the content about each scenario is readily available in the millions of stories that ordinary people write on their weblogs everyday. By judicious modeling the narra- tive and utilizing other natural language processing techniques, not only do the stories provide the content, but also enable the pre- and post-conditions of events to be implic- itly determined. Put together, these components enable Say Anything to react to a wider variety of possible actions than any other interactive narrative system. The remainder of this chapter will solidify this claim with a summary of the results and then discuss future directions for the research. 166 -6 -5 -4 -3 -2 -1 10 15 20 25 30 35 log(% of Stories Written) Number of Sentences Percentage of Stories by Length Unigrams Bigrams Reranking Adaptation Unigram Best Fit Bigram Best Fit Reranking Best Fit Adaptation Best Fit Figure 8.1: The (log) percentage of stories with a given length for each model. 8.1 Overall Summary of the Results In each of the preceding chapters, the most recent results were compared with the other models up to that point, until all of them had been considered. However, it is worth reexamining the results as a whole and look at them from a slightly different perspective. One of the important objective metrics for assessing the success of a particular gen- eration method was measuring the length of stories written with it. Although Table 7.2 provides some statistics on the maximum and average length of the stories for each model, a lot of information is hidden by this format. Figure 8.1 is a much better visual- ization of the differences between the models. Thex-axis tracks the number of sentences in a story and they-axis shows the percent of stories that have the corresponding number of sentences. While we know from the tabular results that the longest story was written 167 using the adaptation model, this graph also shows that it was not a unique outlier, since two other stories are also considerably longer than for any other model. The best-fit lines also show the clear superiority of the adaptation model over all the others and the larger gap between the reranking model compared to the unigram and bigram models. The amount of time taken to write each sentence was also an important statistic in comparing the systems and making sense of some of the more subjective ratings. Figure 8.2 plots the average time in seconds for the human to author a sentence for each story written with the model using the same preprocessing techniques described in Section 5.4.1. The graphs in each column are grouped together because they are relatively similar and it enables an easy visual comparison between the simple IR based methods to the more advanced models. Both the unigram and bigram models have a relatively high mean and a large number of points all over the plot; however the points in the enhanced models are much more densely located. The reduced standard deviation hinted that this might be the case, but these charts clearly show a different use-pattern between the types of models. It is also worth considering a more visual representation of the subjective ratings of the stories. Figure 8.3a charts the average self reported ratings by the authors for each of the models. Comparatively the coherence and believability between the models is relatively stable. However, there is a strong upward trend in usability and entertainment as we are more discriminating in the retrieval process through the bigram index and the reranking rules. Similarly, Figure 8.3b presents Table 7.3 as a bar chart. As in the table, this shows that as the retrieval process becomes more discriminative, the judgments of all of the criteria improve. Although there is no glaring evidence that the relative per- formance between the models is particularly inaccurate, a few issues raise some doubts about the accuracy of the absolute values of these ratings. Looking at the human levels 168 0 20 40 60 80 100 120 140 160 180 0 100 200 300 400 500 Avg. Time per Sentence (s) Story Id Unigrams Time per Sentence Mean Mean +/- σ (a) Unigrams 0 20 40 60 80 100 120 140 160 180 0 50 100 150 200 250 300 350 400 Avg. Time per Sentence (s) Story Id Reranking Time per Sentence Mean Mean +/- σ (b) Bigrams 0 20 40 60 80 100 120 140 160 180 0 100 200 300 400 500 Avg. Time per Sentence (s) Story Id Bigrams Time per Sentence Mean Mean +/- σ (c) Reranking 0 20 40 60 80 100 120 140 160 180 0 50 100 150 200 250 300 350 400 Avg. Time per Sentence (s) Story Id Adaptation Time per Sentence Mean Mean +/- σ (d) Adaptation Figure 8.2: Comparison of authoring time per second for each of the models. of performance suggests that something is slightly askew. For reasons stated earlier, it was not suspected they would be rated close to 5 (the maximum), but, a coherence rating around 3.6 seemed suspiciously low. Reconsidering the design of the page presented to the users did highlight a potential problem, however. The radio button for each of the criteria was initially placed in an invalid position to the left. We certainly expected dis- honest workers to introduce some noise, unfortunately with this design choice the noise is not likely to be random. Instead, there is probably a strong bias toward 1, since it was the closest and easiest button to select. To try to filter this non random noise, a different method for reporting the ratings was also devised. The judgment that received the most 169 2.5 3 3.5 4 4.5 Unigrams Bigrams Reranking Adaptation Rating Model Author’s Self Reported Ratings (Good) (Bad) Coherence Believability Usability Entertainment 601 stories 567 stories 443 stories 429 stories (a) Author’s self reported ratings 2.5 3 3.5 4 4.5 Human Unigrams Bigrams Reranking Adaptation Rating Model Ratings by Users (Good) (Bad) Coherence 878 ratings 3901 ratings 3505 ratings 3067 ratings 3048 ratings Believability 876 ratings 3901 ratings 3509 ratings 3064 ratings 3049 ratings Entertainment 876 3895 3506 3058 3042 (b) Average independent user ratings 2.5 3 3.5 4 4.5 Human Unigrams Bigrams Reranking Adaptation Rating Model Ratings by Users (Majority Rules) (Good) (Bad) Coherence 115 ratings 511 ratings 460 ratings 402 ratings 400 ratings Believability 115 ratings 511 ratings 460 ratings 402 ratings 400 ratings Entertainment 115 511 460 402 400 (c) Majority value independent user ratings Figure 8.3: Summary of the subjective ratings by the authors and independent raters. votes was selected to represent each story and then these representatives were averaged to obtain the final score. If there was a tie, then the average of the tied values were taken. Figure 8.3c shows the result of plotting the ratings using these majority rules representatives. The overall trends are nearly identical to the other plot. However, there is an increase in the scores across the board, and the differences between the models are amplified. Assuming the human-rated stories should be rated relatively high, it is likely that these numbers more accurately reflect the intentions of the legitimate users. Having summarized and reexamined the results of all the systems, one concern still remains. The stories were collected in two phases; first using the retrieval-only models 170 and then the enhanced models trained on the data from the first. This could lead to a situation where a small number of prolific authors, who were particularly good (or bad) at the game could bias the results for one type of model or the other. So, it is also important to also look at the distribution of stories written by individual users. Although, a rigorous statistical analysis was not performed, the graph in Figure 8.4 provides some evidence that this type of bias is not a major concern. The top half of the figure shows the average (independent) ratings for the stories written by the 50 most prolific authors. The overall rating is further subdivided into the contribution of each model to the overall score. This contribution is just a weighted average of the ratings for a particular model. The bottom portion of the graph shows the number of stories written by the corresponding user. This graph highlights several things. First, most authors primarily wrote stories from one phase (retrieval-only) or the other (reranking and adaptation), so it is unlikely that improved ratings of the reranking models are due to long standing authors getting accustomed to the system. It is true that a few users contributed a disproportionate number of stories. However the long tail shown in the lower half of the figure shows the converse is also true, that a large number of users contributed a significant percentage of the total story collection. 8.2 Future Directions The results summarized above highlight one of the primary contributions of this the- sis, that it is possible to construct a narrative generation architecture that is capable of finding a relevant and coherent way to continue the user’s story with virtually no restric- tions on the domain. This is a significant achievement in the interactive storytelling community in which the branching storylines and their dependencies normally have to be constructed by hand. The development of this thesis has also demonstrated several 171 0 1 2 3 4 0 5 10 15 20 25 30 35 40 45 50 20 40 60 80 User Id Overall User Ratings of Most Prolific Authors Rating Stories (Good) (Bad) Unigrams Bigrams Reraking Adaptation Figure 8.4: The independent user ratings for the 50 most prolific users of the system. other secondary contributions beyond the interactive storytelling community. In addi- tion to user modeling, a trace of the interaction with the system has also been shown to be a valuable resource for coherence modeling in general. While sentence ordering has been the preferred metric and source of data for computational models, the sentence pairs generated by user selections is an orthogonal source of data that can help tackle coherence modeling from a different angle. These successes are, of course, not the final word in open domain interactive story- telling or computational coherence modeling. Within the same application space, there are still many technical areas that could be investigated to improve the performance of the system. The self-reported coherence ratings by the authors suggests that there are not enough relevant candidates or not enough diversity between them. Explicitly addressing 172 the diversity problem is a more difficult problem; however there are two clear ways to address the relevancy issue. The problem arises because the lexical nature of the retrieval method has very low recall. Requiring exact word overlap severely restricts the space of candidates to choose from. As the size of the corpus grows the recall becomes less important, so simply gathering more stories would almost certainly improve the situa- tion. Another approach mentioned in a previous chapter worth investigating is replacing the retrieval method altogether. Some kind of latent semantic indexing [35] where the similarity between queries does not explicitly depend on lexical overlap is one method for improving the recall. Another appealing approach is ak-nearest neighbors approach that uses an arbitrary feature vector representation that could potentially improve the recall and also the precision even before filtering the results through the reranker. How- ever, the these approaches require a much more clever data structure to find the similar documents. kd-trees [12] are a multi-dimensional extension to a binary tree that allow fast querying algorithms analogous to a binary search. Unfortunately, kd-trees do not scale well to high dimensional spaces and even in more optimal cases they typically have a longer response time. More recently, new data structures, approximation techniques and dimensionality reduction approaches [62, 63, 67, 81] have improved the searching efficiency of high dimensional spaces and are something to consider in the future. There are many other areas which could be improved, some of which have already been discussed elsewhere. In a broad sense, improvements in the natural language pro- cessing pipeline are one of the most important areas to focus on. For example, a richer model of coreference is important for both the discriminative selection of sentences during reranking as well as for making better predictions in adapting the sentence to the actual discourse of the user’s story. At a minimum, this requires better models to be developed than the simple pairwise co-occurence (during reranking) and the closest 173 node in the tree (during adaptation). Regardless of the new approach taken, the models will still be dependent on the parse trees and the discourse structure of the document. However, the existing tools, such as the dependency parser, already have significantly reduced performance on out-of-(news)-domain text. For substantially improved perfor- mance on both existing and newly developed tools, some attention will also need to be given to ensure the accuracy on the target narrative domain is enhanced and new ones that are developed. Without considering any new technical improvements, the system has already demonstrated the ability to coherently predict narrative sequences with surprising accu- racy. This opens up the possibility of trying to combine the best aspects of tradi- tional interactive narrative systems with the ones developed here. Most importantly, we would like to retain the control over the narrative structure provided by the planning approaches, but allow the freedom of expression our new approach enables. New hybrid approaches combining these advantages could be extremely valuable in a wide variety of interactive narrative applications. For example, they could be used to improve the current turn based approach to enforce high level narrative goals. However, the benefit would be even more profound in other applications. Hybrid approaches could be used to develop extremely deep autonomous non-player characters in virtual simulations or generate the entire content for interactive games of all types. Aside from interactive storytelling, there are other application areas that could be investigated with this type of system that are not directly related to what has been dis- cussed so far. Previous studies [132] have looked at live versions of this game between real people to determine the effect it has on writing proficiency in children. Although a dramatic improvement in skill level was not observed between children who played this game versus those who wrote by themselves, there was a measurable temporary 174 improvement and a much larger difference in the amount of enjoyment the children had in writing their stories. It would be valuable to reexamine some of these experiments over a longer period of exposure to this type of exercise now that it can be performed without requiring a large gathering of children for each run. It is also worth looking at other benefits this game might have for different cognitive abilities. The process is neither like traditional reading or writing. It is more active than reading since it requires some participation, namely in writing. However, it is also different than normal writ- ing, because one is continually forced to reinterpret and adapt his or her vision of the discourse before he or she writes his or her contribution. It seems possible that habitual use of the system could enhance a different subset of skills than the union of reading or writing when exercised individually. The alternating turns of the system between information gathering (e.g. reading) and information production (e.g., writing) are highly analogous to exercises in other disci- plines, such as improvisational theater. Adapting in real time to partial scene informa- tion and unexpected scenarios is a common element of live performances and learning exercises. The type of automated system described in this thesis could provide a unique alternative to traditional methods of seeding these activities. Finally, the ability of the system to accurately produce coherent sentences for vir- tually any story a user could imagine suggests that this type of large-scale data-driven analogical reasoning framework could be useful in a broader scope of AI research. For example, the task undertaken in this thesis used individual sentences and to some degree the entire stories to which they belonged as the base analogical unit in order to predict what happens next in a narrative. However, with very little effort this exact process could be reversed in order to provide an explanation for the events that have already happened. As natural language discourse parsing tools improve this approach can also 175 be extended to answer more complex questions about relationships in the world. For example, rich discourse structure would allow for actual temporal prediction of events, as opposed to the type of discourse prediction done here. 8.3 Summary of Contribution This thesis has described a new data-driven, case-based reasoning approach for address- ing one of the primary roadblocks of interactive narrative systems. Giving the user freedom to behave or do what they want, while still maintaining a coherent narrative is a central goal for almost all narrative technologies. However, past systems have focused almost entirely on the problem of maintaining authorial control and narrative coherence. The lack of sufficient attention to content leaves the user with an extremely limited num- ber of choices to direct the story and inevitably leads to a prescribed narrative in which the user does not feel a great deal of participation in creating. The new approach proposed in this thesis emphasizes content first. Using a case- based reasoning approach that allows us to separate out the authoring task from the application task enables us to leverage the vast amount of narrative content that is authored every day on ordinary people’s weblogs. Having identified millions of personal stories the central challenge of this method was to understand and manipulate the natural language in order to find a response that was both topical to the subject of the story as well as grammatically coherent with the discourse. Even with the simple retrieval-only algorithm introduced in Section 5.2, surprisingly good results could be achieved. How- ever, by digging deeper with more sophisticated natural language processing techniques a richer model of the story and interaction could be developed. These models provided valuable features to a reranking algorithm that was able to significantly improve the performance on almost all of the evaluation criteria. Not only were the improvements 176 across the board, but these stories were independently rated nearly indistinguishable from human authored stories in terms of coherence and were even slightly more enter- taining to read. In addition to the accomplishments in the domain of interactive storytelling, the turn-based game developed here has been shown to be valuable in several other areas of research. This type of automated game could provide useful new teaching devices for encouraging writing skills in children or as a new type of improvisational tutor. Alternatively, traces of the system can be another important resource for computational linguistics research, aside from sentence ordering, for studying and evaluating discourse coherence. These traces also provide an interesting new way to study and develop mod- els of user preference. In summary, this thesis has developed a data-driven analogical approach for pre- dicting and producing the next discourse utterance of a given narrative. Users of the system report high usability and enjoyment of the system and the story artifacts that are produced are nearly as coherent as human authored stories and even more enjoyable to read. These results have the potential to fundamentally change how content is incor- porated into future interactive narrative systems and has sprouted many new research questions to be investigated. 177 Bibliography [1] John Aberdeen, David S. Day, Lynette Hirschman, Patricia Robinson, and Marc B. Vilain. MITRE: description of the alembic system used for MUC-6. In Proceedings of the 6th Conference of Message Understanding, pages 141–155, Morristown, NJ, USA, 1995. [2] Craig Eilert Abrahamson. Storytelling as a pedagogical tool in higher education. Education, 118(3):440–452, March 1998. [3] Aristotle and S. H. Butcher. Poetics. 1997. [4] Byung-Chull Bae and R. Michael Young. A use of flashback and foreshadowing for surprise arousal in narrative using a Plan-Based approach. In Proceedings of the 1st Joint International Conference on Interactive Digital Storytelling: Inter- active Storytelling, pages 156–167, Erfurt, Germany, 2008. Springer-Verlag. [5] William Michael Bain. Case-based reasoning: a computer model of subjective assessment. PhD thesis, Yale University, 1986. [6] Ken Barker, Bruce Porter, and Peter Clark. A library of generic concepts for composing knowledge bases. In Proceedings of the 1st international conference on Knowledge capture, pages 14–21, Victoria, British Columbia, Canada, 2001. ACM. [7] Regina Barzilay. Information fusion for multidocument summarization: para- phrasing and generation. PhD thesis, Columbia University, 2003. [8] Regina Barzilay and Mirella Lapata. Modeling local coherence: an entity-based approach. In Proceedings of the 43rd Annual Meeting on Association for Com- putational Linguistics, pages 141–148, Ann Arbor, Michigan, 2005. Association for Computational Linguistics. [9] Regina Barzilay and Lillian Lee. Catching the drift: Probabilistic content mod- els, with applications to generation and summarization. In HLT-NAACL 2004: Proceedings of the Main Conference, pages 120, 113, 2004. 178 [10] Joseph Bates. Virtual reality, art, and entertainment. Presence: Teleoper. Virtual Environ., 1(1):133–138, 1992. [11] Cosmin Adrian Bejan. Unsupervised discovery of event scenarios from texts. In 21st International FLAIRS Conference, Coconut Grove, FL, May 2008. [12] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517, 1975. [13] David M. Blei, Andrew Y . Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003. [14] Rebecca Blood. Weblogs: A history and perspective, September 2000. [15] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algo- rithm for optimal margin classifiers. In Proceedings of the fifth annual work- shop on Computational learning theory, pages 144–152, Pittsburgh, Pennsylva- nia, United States, 1992. ACM. [16] John Seely Brown. New learning environments for the 21st century, 2005. [17] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: parameter estimation. Comput. Linguist., 19(2):263–311, 1993. [18] Kevin Burton, Akshay Java, and Ian Soboroff. The ICWSM 2009 spinn3r dataset. In Proceedings of the Third Annual Conference on Weblogs and Social Media, San Jose, CA, May 2009. [19] Charles B. Callaway and James C. Lester. Narrative prose generation. Artif. Intell., 139(2):213–252, 2002. [20] Chris Callison-Burch. Fast, cheap, and creative: evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1, pages 286–295, Singapore, 2009. Association for Computational Linguistics. [21] Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Building a discourse- tagged corpus in the framework of rhetorical structure theory. In Proceedings of the Second SIGdial Workshop on Discourse and Dialogue - Volume 16, pages 1–10, Aalborg, Denmark, 2001. Association for Computational Linguistics. [22] Justine Cassell. Towards a model of technology and literacy development: Story listening systems. Journal of Applied Developmental Psychology, 25(1):75–105, 2004. 179 [23] Marc Cavazza, Fred Charles, and Steven J. Mead. Character-Based interactive storytelling. IEEE Intelligent Systems, 17(4):17–24, 2002. [24] Marc Cavazza, Fred Charles, and Steven J. Mead. Interacting with virtual char- acters in interactive storytelling. In Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1, pages 318– 325, Bologna, Italy, 2002. ACM. [25] Marc Cavazza, Jean-Luc Lugrin, David Pizzi, and Fred Charles. Madame bovary on the holodeck: immersive interactive storytelling. In Proceedings of the 15th international conference on Multimedia, pages 651–660, Augsburg, Germany, 2007. ACM. [26] Marc Cavazza and David Pizzi. Narratology for interactive storytelling: A critical introduction. In Technologies for Interactive Digital Storytelling and Entertain- ment, pages 83, 72. 2006. [27] Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative event chains. In Proceedings of ACL-08: HLT, pages 797, 789. Association for Com- putational Linguistics, 2008. [28] Yun-Gyung Cheong, Yeo-Jin Kim, Wook-Hee Min, Eok-Soo Shim, and Jin- Young Kim. PRISM: a framework for authoring interactive narratives. In Inter- active Storytelling, pages 297–308. 2008. [29] Yun-Gyung Cheong and Michael Young. A computational model of narrative generation for suspense. In AAAI, Boston, Massachusetts, 2006. [30] Michael Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pages 16–23, Madrid, Spain, 1997. Association for Computational Linguistics. [31] Michael Collins and Nigel Duffy. New ranking algorithms for parsing and tag- ging: kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 263–270, Philadelphia, Pennsylvania, 2001. Association for Computational Lin- guistics. [32] F. Michael Connelly and D. Jean Clandinin. Stories of experience and narrative inquiry. Educational Researcher, 19(5):2–14, June 1990. [33] Blandine Courtois and Max D. Silberztein. Dictionnaires lectroniques du franais. Langue Francaise. Larousse, Paris, 1989. 180 [34] Will Crowther and Don Woods. Adventure. 1977. [35] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391—407, 1990. [36] Pascal Denis and Jason Baldridge. A ranking approach to pronoun resolution. In Proceedings of the 20th international joint conference on Artifical intelligence, pages 1588–1593, Hyderabad, India, 2007. Morgan Kaufmann Publishers Inc. [37] Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-weighted linear classification. In Proceedings of the 25th international conference on Machine learning, pages 264–271, Helsinki, Finland, 2008. ACM. [38] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley-Interscience, 2 edition, October 2000. [39] Daniel Choy Edelson. Learning from stories: indexing and reminding in a socratic case-based teaching system for elementary school biology. PhD thesis, Northwestern University, 1993. [40] Micha Elsner and Eugene Charniak. A unified local and global model for dis- course coherence. In In the North American Chapter of the Association for Com- putational Linguistics, Rochester, NY, April 2007. [41] Micha Elsner and Eugene Charniak. Coreference-inspired coherence modeling. In In The Association for Computational Linguistics, Columbus, Ohio, 2008. [42] William Ferguson, Ray Bareiss, Lawrence Birnbaum, and Richard Osgood. Ask systems: an approach to the realization of Story-Based teacher. Journal of the Learning Sciences, 2(1):95–134, January 1992. [43] P Foltz, W Kintsch, and T Landauer. The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2n&3):307, 285, 1998. [44] Sudeep Gandhe and David Traum. Evaluation understudy for dialogue coherence models. In Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, page 172181, Columbus, Ohio, June 2008. Association for Computational Lin- guistics. [45] Dedre Gentner and Russel Landers. Analogical reminding: A good match is hard to find. In The International Conference on Systems, Man and Cybernetics, Tucson, Arizona, November 1985. 181 [46] Dan Gillick. Sentence boundary detection and the problem with the U.S. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 241–244, Boulder, Colorado, 2009. Association for Computational Linguistics. [47] Andrew S Gordon. Browsing image collections with representations of common- sense activities. Journal of the American Society for Information Science and Technology, 52:925—929, 2001. [48] Andrew S. Gordon, Qun Cao, and Reid Swanson. Automated story capture from internet weblogs. In Proceedings of the 4th international conference on Knowl- edge capture, pages 167–168, Whistler, BC, Canada, 2007. ACM. [49] Andrew S. Gordon and Kavita Ganesan. Automated story capture from conver- sational speech. In K-CAP, pages 145–152, 2005. [50] Otis Gospodnetic and Erik Hatcher. Lucene in Action. In Action series. Manning Publications, December 2004. [51] Matthew Gray. Web growth summary. http://www.mit.edu/people/mkgray/net/web-growth-summary.html, 1996. [52] Barbara J Grosz, Scott Weinstein, and Aravind K. Joshi T. Centering: A frame- work for modeling the local coherence of discourse. COMPUTATIONAL LIN- GUISTICS, 21:203—225, 1995. [53] Gary Gygax and Dave Arneson. Dungeons and dragons. http://www.wizards.com/, 1974. [54] Kristian J. Hammond. Case-based planning: viewing planning as a memory task. Academic Press Professional, Inc., 1989. [55] Kristian J. Hammond, Colleen M. Seifert, and Kenneth C. Gray. Functionality in analogical transfer: A hard match is good to find. The Journal of the Learning Sciences, 1(2):111–152, 1991. ArticleType: primary article / Full publication date: 1991 / Copyright 1991 Lawrence Erlbaum Associates (Taylor & Francis Group). [56] Catherine Havasi, Robert Speer, and Jason Alonso. ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge. In Recent Advances in Natural Language Processing, 2007. [57] James Hays and Alexei A. Efros. Scene completion using millions of pho- tographs. In ACM SIGGRAPH 2007 papers, page 4, San Diego, California, 2007. ACM. 182 [58] Ben He and Iadh Ounis. Term frequency normalisation tuning for BM25 and DFR model. IN PROCEEDINGS OF ECIR 2005, pages 200—214, 2005. [59] R. Hill, J. Gratch, W. L. Johnson, C. Kyriakakis, C. LaBore, R. Lindheim, S. Marsella, D. Miraglia, B. Moore, J. Morie, J. Rickel, M. Thiebaux, L. Tuch, R. Whitney, J. Douglas, and W. Swartout. Toward the holodeck: integrating graphics, sound, character and story. In Proceedings of the fifth international conference on Autonomous agents, pages 409–416, Montreal, Quebec, Canada, 2001. ACM. [60] Jerry Hobbs. Pronoun resolution. Research Report 76-1, Department of Com- puter Sciences, City College, City University of New York, August 1976. [61] Jerry Hobbs. On the coherence and structure of discourse. Technical Report CSLI-85-37, Stanford University, Center for the Study of Language and Informa- tion, 1985. [62] Michael E. Houle and Jun Sakuma. Fast approximate similarity search in extremely High-Dimensional data sets. In Proceedings of the 21st International Conference on Data Engineering, pages 619–630. IEEE Computer Society, 2005. [63] Piotr Indyk. Dimensionality reduction techniques for proximity problems. In Pro- ceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pages 371–378, San Francisco, California, United States, 2000. Society for Indus- trial and Applied Mathematics. [64] Thorsten Joachims. Making large-scale support vector machine learning prac- tical. In Advances in kernel methods: support vector learning, pages 169–184. MIT Press, 1999. [65] Thorsten Joachims. Optimizing search engines using clickthrough data. In Pro- ceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–142, Edmonton, Alberta, Canada, 2002. ACM. [66] Tibor Kiss and Jan Strunk. Unsupervised multilingual sentence boundary detec- tion. Comput. Linguist., 32(4):485–525, 2006. [67] Sachin Kulkarni and Ratko Orlandic. High-Dimensional similarity search using Data-Sensitive space partitioning. In Database and Expert Systems Applications, pages 738–750. 2006. [68] Mirella Lapata. Probabilistic text structuring: Experiments with sentence order- ing. In Proc. of the Annual Meeting of the Association for Computational Lin- guistics, pages 545—552, 2003. 183 [69] Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, and Jaz S. Kan- dola. The perceptron algorithm with uneven margins. In Proceedings of the Nine- teenth International Conference on Machine Learning, pages 379–386. Morgan Kaufmann Publishers Inc., 2002. [70] Dekang Lin. Dependency-based evaluation of MINIPAR. In Proceedings of Workshop on the Evaluation of Parsing Systems, First International Conference on Language Resources and Evaluation., 1998. [71] Jo Lonsdale. Active learning through digital storytelling. In Solstice, Edge Hill University, UK, May 2007. [72] John Lyons. Semantics: Volume 1. Cambridge University Press, revised edition, June 1977. [73] Jerre Mangione. The Dream and the Deal: The Federal Writers’ Project, 1935- 1943. Syracuse University Press, June 1996. [74] William Mann and Sandra Thompson. Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3):281, 243, 1988. [75] William Mann and Sandra Thompson. Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3):281, 243, 1988. [76] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction to Information Retrieval. Cambridge University Press, 1 edition, July 2008. [77] Mehdi Manshadi, Reid Swanson, and Andrew S. Gordon. Learning a proba- bilistic model of event sequences from internet weblog stories. In Twenty-first International Conference of the Florida AI Society, Applied Natural Language Processing track, Florida, 2008. [78] Daniel Marcu. The rhetorical parsing of natural language texts. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pages 96–103, Madrid, Spain, 1997. Association for Computational Linguistics. [79] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: the penn treebank. Computational Linguistiscs, 19(2):313–330, 1993. [80] Michael Mateas. Interactive drama, art and artificial intelligence. PhD thesis, Carnegie Mellon University, 2002. 184 [81] Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceed- ings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169–178, Boston, Massachusetts, United States, 2000. ACM. [82] Janice McDrury and Maxine Alterio. Learning through storytelling in higher education. 2003. [83] James Richard Meehan. The metanovel: writing stories by computer. PhD thesis, Yale University, 1976. [84] Manish Mehta, Tina Lacey, Iulian Wande Radu, Abhishek Jain, and Ashwin Ram. Creating behavior authoring environments for everyday users. In International Conference on Computer Games, Multimedia, and Allied Technologies, Singa- pore, May 2009. [85] Manish Mehta, Santiago Ontan, Tom Amundsen, and Ashwin Ram. Authoring behaviors for games using learning from demonstration. In Workshop on Case- Based Reasoning for Computer Games, Seattle, Washington, July 2009. [86] David Milam, Magy Seif El-Nasr, and Ron Wakkary. Looking at the interactive narrative experience through the eyes of the participants. In Proceedings of the 1st Joint International Conference on Interactive Digital Storytelling: Interactive Storytelling, pages 96–107, Erfurt, Germany, 2008. Springer-Verlag. [87] Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber. The penn discourse treebank. IN PROCEEDINGS OF LREC 2004, 2004. [88] Nick Montfort. Twisty Little Passages: An Approach to Interactive Fiction. The MIT Press, December 2003. [89] Nick Montfort. Generating narrative variation in interactive fiction. PhD thesis, University of Pennsylvania, 2007. [90] J Moon. Reflection in higher education learning. Technical Report PDP Working Paper 4, LTSN Generic Centre, 2002. [91] Bradford W. Mott and James C. Lester. U-director: a decision-theoretic narra- tive planning architecture for storytelling environments. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent sys- tems, pages 977–984, Hakodate, Japan, 2006. ACM. [92] Arturo Nakasone and Mitsuru Ishizuka. Storytelling ontology model using RST. In Proceedings of the IEEE/WIC/ACM international conference on Intelligent Agent Technology, pages 163–169. IEEE Computer Society, 2006. 185 [93] Gonzalo Navarro and Mathieu Raffinot. Flexible Pattern Matching in Strings. Cambridge University Press, 1 edition, June 2002. [94] Netcraft. Web server survey archives - netcraft. http://news.netcraft.com/archives/web server survey.html, 2009. [95] Vincent Ng and Claire Cardie. Improving machine learning approaches to coref- erence resolution. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 104–111, Philadelphia, Pennsylvania, 2001. Association for Computational Linguistics. [96] Ian Niles and Adam Pease. Towards a standard upper ontology. In Proceed- ings of the international conference on Formal Ontology in Information Systems - Volume 2001, pages 2–9, Ogunquit, Maine, USA, 2001. ACM. [97] Franz Josef Och. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, pages 160–167, Sapporo, Japan, 2003. Association for Computational Linguistics. [98] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Comput. Linguist., 29(1):19–51, 2003. [99] I. Ounis. Research directions in terrier: a search engine for advanced retrieval on the web. http://eprints.gla.ac.uk/14096/, 2007. [100] Sara H. Owsley, Kristian J. Hammond, David A. Shamma, and Sanjay Sood. Buzz: telling compelling stories. In Proceedings of the 14th annual ACM inter- national conference on Multimedia, pages 261–268, Santa Barbara, CA, USA, 2006. ACM. [101] Edward Packard. Sugarcane Island. Vermont Crossroads Press, 1976. [102] David D. Palmer and Marti A. Hearst. Adaptive multilingual sentence boundary disambiguation. Comput. Linguist., 23(2):241–267, 1997. [103] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311– 318, Philadelphia, Pennsylvania, 2001. Association for Computational Linguis- tics. [104] Federico Peinado and Pablo Gervas. Creativity issues in plot generation. Edin- burgh, UK, 2005. 186 [105] David Pizzi and Marc Cavazza. From debugging to authoring: Adapting pro- ductivity tools to narrative content description. In Proceedings of the 1st Joint International Conference on Interactive Digital Storytelling: Interactive Story- telling, pages 285–296, Erfurt, Germany, 2008. Springer-Verlag. [106] James Pustejovsky, Patrick Hanks, Roser Saur, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. The TIMEBANK corpus. In Corpus Linguistics, Lancaster Uni- versity, Lancaster UK, 2003. [107] Jeffrey C. Reynar and Adwait Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the fifth conference on Applied natural language processing, pages 16–19, Washington, DC, 1997. Association for Computational Linguistics. [108] Mark Owen Riedl and R. Michael Young. An Intent-Driven planner for Multi- Agent story generation. In Proceedings of the Third International Joint Confer- ence on Autonomous Agents and Multiagent Systems - Volume 1, pages 186–193, New York, New York, 2004. IEEE Computer Society. [109] Mark Owen Riedl and R. Michael Young. Story planning as exploratory creativ- ity: Techniques for expanding the narrative search space. New Gener Comput, 24(3):303–323, 2006. [110] Christopher K. Riesbeck and Roger C. Schank. Inside Case-Based Reasoning. Lawrence Erlbaum, 1 edition, July 1989. [111] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. In Neurocomputing: foundations of research, pages 89–114. MIT Press, 1988. [112] Jon Rowe and Derek Partridge. Creativity: a survey of AI approaches. Artificial Intelligence Review, 7(1):43–70, February 1993. [113] Kenji Sagae and Alon Lavie. A best-first probabilistic shift-reduce parser. In Pro- ceedings of the COLING/ACL on Main conference poster sessions, pages 691– 698, Sydney, Australia, 2006. Association for Computational Linguistics. [114] Ruth Sawyer. The Way of the Storyteller. Penguin (Non-Classics), January 1977. [115] Roger C. Schank. Dynamic Memory. Cambridge University Press, New York, 1982. [116] Roger C. Schank. Tell Me a Story: A New Look at Real and Artificial Memory. Atheneum, January 1991. 187 [117] Roger C. Schank, Ray Bareiss, Andrew Fano, Richard Osgood, and William Fer- guson. Agents in the story archive. Technical Report 27, Northwestern University, Evanston, IL., The Institute for the Learning Sciences. [118] Roger C. Schank, William Ferguson, Lawrence Birnbaum, Jorn Barger, and Matt Greising. ASK TOM: an experimental interface for video case libraries. Tech- nical Report 10, Northwestern University, Evanston, IL., The Institute for the Learning Sciences, 1991. [119] Libin Shen and Aravind K. Joshi. Ranking and reranking with perceptron. Mach. Learn., 60(1-3):73–96, 2005. [120] Jr Robert Lee Simpson. A computer model of case-based reasoning in problem solving: an investigation in the domain of dispute mediation (analogy, machine learning, conceptual memory). PhD thesis, Georgia Institute of Technology, 1985. [121] Push Singh, Thomas Lin, Erik T. Mueller, Grace Lim, Travell Perkins, and Wan Li Zhu. Open mind common sense: Knowledge acquisition from the general public. In On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002, pages 1223–1237. Springer-Verlag, 2002. [122] A. F. Smeaton and C. J. van Rijsbergen. The nearest neighbour problem in information retrieval: an algorithm using upperbounds. In Proceedings of the 4th annual international ACM SIGIR conference on Information storage and retrieval: theoretical issues in information retrieval, pages 83–87, Oakland, Cal- ifornia, 1981. ACM. [123] Radu Soricut and Daniel Marcu. Discourse generation using utility-trained coher- ence models. In Proceedings of the COLING/ACL on Main conference poster ses- sions, pages 803–810, Sydney, Australia, 2006. Association for Computational Linguistics. [124] Bruce W. Speck. Collaborative Writing. IAP, February 2008. [125] Viola SPOLIN, Paul Sills, and Paul Sills. Improvisation for the Theater. 2000. [126] Reid Swanson and Andrew Gordon. Say anything: A massively collaborative open domain story writing companion. In Interactive Storytelling, pages 32–40. 2008. [127] Reid Swanson and Andrew S. Gordon. A comparison of retrieval models for open domain story generation. In AAAI 2009 Spring Symposium on Intelligent Narrative Technologies II, Stanford, CA, March 2009. 188 [128] William Swartout, Jonathan Gratch, Randall W. Hill, Eduard Hovy, Stacy Marsella, Jeff Rickel, and David Traum. Toward virtual humans. AI Mag., 27(2):96–108, 2006. [129] Maite Taboada and William C. Mann. Applications of rhetorical structure theory. Discourse Studies, 8(4):567–588, August 2006. [130] Technorati. State of the blogosphere 2008: Techno- rati numbers indicate blogging is niche and slowing. http://www.readwriteweb.com/archives/state of the blogosphere 2008.php, 2008. [131] Scott R. Turner. Minstrel: a computer model of creativity and storytelling. PhD thesis, University of California at Los Angeles, 1993. [132] Joan Williams and Peter Wason. Collaborative writing games. Research Report ED150594, 1977. [133] Alexander Yeh. More accurate tests for the statistical significance of result dif- ferences. In Proceedings of the 18th conference on Computational linguistics - Volume 2, pages 947–953, Saarbrcken, Germany, 2000. Association for Compu- tational Linguistics. [134] R. Michael Young. Notes on the use of plan structures in the creation of interac- tive plot. In The AAAI Fall Symposium on Narrative Intelligence, 1999. 189 Appendix A User Comments This is the collection of all comments left by users during the story authoring phase. Each individual comment is separated by a new line. A.1 The comments Maybe you program should pull names from the user’s sentences? And maybe the computers sentences should be a little more general? That was a blast! :-) I wanted the story to end with my sentence, not having to choose another computer sentence. Interesting! I liked doing it. The computer is presenting random lines picked up from other writing. How very fun! Thank you. This was an interesting assignment. I love to write, and this has to be one of the most fun things that I’ve done in a while. It’s like the cure-me-all for Writer’s Block! Kudos to whoever wrote this code, it’s amazing. I’d pay for something like this, if somebody charged. :D This was fun, thank you! nice experience Forgot the title again! I was so focused on whether this one made sense that I wasn’t thinking. ”Bob and Margaret in D.C.” I didn’t know what you meant by select the index to use, I thought a subject. So I wrote ’cats’, but that ended up being my first sentence, which isn’t a sentence, but I couldn’t change it. This is really interesting, though. I like it. Fun hit! This was actually really fun. I think you should continue to develop this. Started at welcome page and was not able to proceed after using copy & paste to login. Code was not accepted at login. I am copying and pasting the code, but it keeps coming back ”invalid code.” This one also rejects the code that I have copied and pasted. I get an ”invalid code” message. 190 It skipped the screen with my success code, mine was the story with Kamaswami the Guru Invalid code again. I love this! Ignore the first two sentences; I put in a title instead of a first sentence. These are great! Thanks Oops–forgot the title: ”After Work.” I forgot to write a title—is there a way to backpage? I didn’t see one. I thought I was writing the title at first, so ignore the first two sentences. Had some trouble with a previous hit, it said invalid login code. I made some tense mistakes in this story and didn’t even catch them until I read them at the end. Wish there was someway to go back and correct them. this is really fun! There’s one sentence (the one in parentheses) that really doesn’t make sense but other wise it turned out pretty good! This one was a lot of fun! These are great hits. very entertaining. I struggled with this one, but then was very pleased that the computer’s limitations forced me into developing a cute little pun at the end. Thanks! Title: Flying Monkey That was pretty fun. This has been a fun experience, creating a story with a computer! Lots of fun! But the grammar and spelling for many of the choices was incorrect. (By the way, I like how insanely specific the choices are - makes you think harder!) It fell apart at the end - the last set of responses had email addresses and the like. Interesting. Please could you let us know if we’re allowed to submit this HIT more than once? I wouldn’t mind doing another one. Really had a good time with this one! Story got messed up and added one of my sentences twice. Sometimes when I tried to click a sentence, the computer chose another sentence. Noticed a few typos also, great idea though! 191 Loading was slow and choices I made were not the next line in the story. Only one of the response groups did not give me something that at least made some sense. The biggest issue seems to be that I started in the 3rd person and it kinda forced me into the first person, and there seemed to be conflicting tenses in some of the responses. It might make sense to say ”write a story in the 1st person in the present tense” unless you are trying to get the computer to figure out the tenses based on the input which might be more to chew on than just getting content from the context, as it were. I enjoyed completing this task. It was very interesting and fun. cool idea When I hit next sentence, it closed the story. story picked up from last hang up. Invalid Code(2) the app. messed up my story by repeating a sentence and not letting me choose a computer response Invalid access code. The story was really fun and easy to write. I had no problems and could usually pick a sentence out of the ones the computer chose. Theyvwere all mostly relevant to what I wrote. Maybe you could have the option to choose from a second set of sentences if the first set doesn’t fit at all? When copying and pasting this code, it came back as invalid code. This was very fun! Thanks! It messed up and momentarily displayed a computer response I did not select before reverting back what I actually selected upon my submitting my next sentence. This resulting in me being unable to make a story that made sense. Is this a one time hit? Invalid Code (2) An interesting exercise. How would such a game be presented? As some kind of creative writing tool, perhaps? The program had an error and posted my sentence twice and selected its own sentence without me selecting it. 192 Once you’ve reached the point where you can end the story (8+ sentences), you should provide an option to reject all sentences and end the story instead. Mine more or less made sense until the last sentence because I didn’t like any of the choices and tried hitting ”Accept” without selecting anything to see if it would let me end the story there. Instead, it picked the first sentence on the list and added it to the story. I have given this feedback before but there seems to be no way in which the computer recognizes what tense or person (or gender) you are writing in. That was fun! Thanks. It is a little more difficult when some of the computer’s sentences are duplicates. It was fun trying to tie it all together to make sense. I wish I could have edited my text, if only, for spelling. I used blow instead of blew in one of my sentences but caught the error too late to change. A little hard to write the story, but its fun I really enjoyed completing this task. It was very interesting and fun. I think my 7 year old daughter would enjoy this. Yet again a great story between myself and the computer. I was working along nicely–this is a fun thing to do! A bit later on, I clicked the ’next sentence’ button and got an HTTP 500 exception error. When I tried using the back arrow button on my browser, I eventually got back to the page, but the story was gone. I went ahead and started a new story, just to see if I might get the error again. (I didn’t get the error again.) You should perhaps warn workers about the adult content the will encounter in this hit. Thanks, that was the most fun I have had an a hit. I think If I were trying to write a serious story I would have had more trouble, but for being silly it added a lot to the process :) good It Seems Answers offered as the next sentence didn’t seem to go with what i had written this is really neat i love it :) A little challenging. Thanks. 193 Sorry, I was having so much fun doing it that it ended up not making a whole bunch of sense. Something I noticed is that if you repeat certain phrases, you get the same set of sentences, regard- less of the words before the phrases, for example when I used Poetic License, at least four of the sentences were the same. Also, I assumed that 5 meant it was easy to write and 1 meant it was hard to write. Invalid code, or so the site claims. No problems. Fun HIT to do. The directions could be improved and there should be a way to delete a sentence. Otherwise, that was pretty fun. Fun, I guess. This access code did not work, keeps returning as invalid code. These sentences that the computer makes are not so good. Pretty fun. A little confusing but obviously a keyword based algorithm based as the sentence generator. Invalid code(2) Invalid code Some of the computer sentences were nonsensical. it seemed like the sentences I was highlight weren’t actually selected. I choose the wrong computer selection for the last computer sentence. I must have accidently selected that sentence while navigating to ”accept” (using a touchpad). I liked two sentences for the ending, but this one seemed more thought provoking. This was very interesting work. Thank you. Sorry, didn’t save 1st login code, so it won’t let me in with a new one. The access code wouldn’t work, I am submitting it anyway like you said to, thank you. I love these HITs! They are lots of fun! The sentence choices were waaay too random, and made any attempt of story flow impossible. 0Gp5ApFd is the code given to log in, but it says ”Invalid code”. I tried several times and verified that I had copied it correctly. Sometimes the computer’s sentences didn’t seem to make a whole lot of sense with the one’s I put up. All in all though, this is pretty cool. :) 194 It was difficult to work with the computer. It’s choices were very limited. I thought the first line was meant to be a title so it wasn’t a full sentence. This may be an interesting program to help people break writers block. I hope we can do these more than once. It’s so much fun! Fun HIT! ”Select the index to use for this story:” is confusing. It contains derogatory terms? ”gringos” This was a really cool thing. Do you plan on making this ”available” for public? If you do, then I’d love to hear more about it and try it out. Email me on purespirit@gmail.com with updates if possible.. Some of the sentences were offensive, so it may be good to warn people before they accept the HIT. Thanks! I enjoyed this it was fun to see where the story would turn I didn’t realize I hadn’t clicked on anything before clicking accept and it gave me the first item on the list (not an error message). I enjoyed the experience. THe program froze on me three times and made writing the story difficult because it would like it was starting over, and then it would add it to the already written story...and it added it’s own options - I never chose the ”her name was ...” line, but somehow it ended up in the story. ”How difficult was it to write?” Does 5 mean very difficult to write (breaking the ”high/good” theme), or does 5 mean very easy to write (but not making sense at ”It was high difficulty to write”) I put 3, but I would say it was moderately easy. 1 in whichever direction means ”easy” Really made you think and try to plan ahead, computer made it tough! The login code did not work, I received the following message: (2) Invalid code 0IHyjaL1 Invalid code 0iliY9Dc invalid code This one made me laugh Sometimes the way the story goes can make you laugh. Lots of fun! invalid code 195 For that last question: I answered ’2’ to mean it was quite difficult (not easy) to write the story. This one went a lot better code 0KtbT1E8 doesn’t work on login page Welcome! Please enter your access code: (2) Invalid code when i typed cookies i kept getting things that say deliver to this email addy and stuff there was nothing to good under I like cookies. That was really fun and pretty hilarious! Wow, I like this. Its fun!! I want to play with this more. i would have enjoyed it more if there were more sentence choices, and situation varieties. Sometimes none of the sentence options fit, but it is a great Madlibs style story generator. The success code wasn’t there, and I just logged in with nothing in the box and it let me, the code at the beginning wouldn’ twork again, said invalid, so I clicked to log in with it blank-the code is 0m4mHyJt I am sorry, I messed up on my first try and accidentally typed the title as my first sentence. After I rejected the hit and accepted more, it still kept parts of my story. I would like to try more of these though, it seems interesting. interesting! The sentences that the computer generated didn’t really match up with my story at all. another invalid code I love doing this!! I hope all this helps in your development process. I entered in the code, and the page said (2) Invalid code. I was unable to login. Fun! invalid code I was unable to login: Welcome! Please enter your access code: 196 (2) Invalid code The third result didnt make much sense I was unable to login with that code. I got the following message: (2) Invalid code Title: ”Making Sushi” (forgot to put that in). i enjoyed that Great fun! This is one of the harder ones I have done as far as the choices I had. That was actually pretty fun The system was generating sentences that were header lines from email messages, which I don’t believe would go too well in any story. FYI: the site rejected the login code I am unable to proceed because it says invalid code. The sentences given were hard to work with and most made no sense, even when decent keywords were given in my own sentences. Most of the sentences that the computer gave me had no connection to anything going on in the story, and would not have made any sense. I think the program is a great idea, though. A built in spell checker would be nice (although this suggestion doesn’t let me off the hook - sorry for the typos!) This one turned out kind of weird Not sure if I am supposed to create a title??? It was fun but it didn’t always work...I kind of ended up going on a tangent. I think having even more sentence options would be a good idea. I was unable to login with that code: Please enter your access code: (2) Invalid code Darn! Another invalid code. ”Invalid code” invalid code 197 Appendix B Story Examples These are the top 15 stories from each generation model ranked by the average over all the judgment criteria by the independent raters. The computer generated sentences are highlighted in light gray. B.1 Retrieval-Only She kept herself very fit and trim. She ran with her son. Was a regular at the gym. And by regular, boy do I mean regular. She was there every morning and every evening like clockwork. And we all know how much I HATE mornings. But this woman seemed to be full of energy and pep no matter how early in the day. And then she got cancer. We haven’t seen her very much lately but I say a little prayer for her every morning. Overall 4.42 Believability 4.63 Coherence 4.88 Entertainment 3.75 I barely made it to the bathroom in time. Can I just say that there really is nothing worse in a person’s life than throwing up? Of course with me, it is an everyday occurance. And you know its just... wrong. If only I didn’t drink so much every night I would probably not be so sick every morning. Or ever. If only my life didn’t suck so much maybe I wouldn’t have to drink every night and get sick every morning. However, I make it a point to live mentally healthy so I stopped. That made my life much, much worse so I started again. Overall 4.38 Believability 4.63 Coherence 4.75 Entertainment 3.75 198 The rays of the sun glistened through the mist as it rose between the mountains, covering the landscape with a wet cloak. The sight was so romantic, breathtakingly superb. Squinting his eyes against the shimmering light, Andy Sturgil stood in awe of the morning’s beauty. The lighting was cosy and it made me wanna scream. The dew made everything on the ground sparkle, and reaffirmed his belief that this was truly God’s country. There was only one man he trusted enough to ask for help. Andy’s thoughts touched on Turner as he made his way down the serpentine path. He had no idea where he was going; he just wanted to get away. The bottom of his trousers swayed heavily with his strides as they collected the dew from the dense grass and brush. Everything was quiet, almost peaceful in a way, but it wasn’t. Overall 4.33 Believability 4.38 Coherence 4.38 Entertainment 4.25 I can’t remember how long it has been since I have seen my sister. So much has happened in that time. I got married and had 6 children. I have a wonderful life. I just want my sister to be a part of it. My memory fails me here... I can’t even remeber what we had fought about. O.o hmm... I remember now. It was over a stupid boy. Jeez. That is not a valid reason to abandon your sister. Don’t know... Maybe I should call her and make ammends. I don’t know if I can. At that very moment my phone rang and it was my sister. What a wonderful surprise! We talked for hours as if nothing had ever come beteween us. It was great. Overall 4.33 Believability 4.75 Coherence 4.75 Entertainment 3.5 The rays of the sun glistened through the mist as it rose between the mountains, covering the landscape with a wet cloak. The sight was so romantic, breathtakingly superb. Squinting his eyes against the shimmering light, Andy Sturgil stood in awe of the morning’s beauty. The lighting was cosy and it made me wanna scream. The dew made everything on the ground sparkle, and reaffirmed his belief that this was truly God’s country. There was only one man he trusted enough to ask for help. Andy’s thoughts touched on Turner as he made his way down the serpentine path. He had no idea where he was going; he just wanted to get away. The bottom of his trousers swayed heavily with his strides as they collected the dew from the dense grass and brush. Everything was quiet, almost peaceful in a way, but it wasn’t. Overall 4.33 Believability 4.38 Coherence 4.38 Entertainment 4.25 199 With the electric not working, we had no lights in the room. The wiring is so old it has gone brittle and snapped when the power kept on and off during the storm! ”We need to stay calm, and start figuring out what we should do.” he said. Spencer looks sideways, trying to think of something to say. Spencer’s wound was bleeding terribly., he was beginning to go into shock. His eyes rolled to the back of his head and passed out in the pool of blood around him. His pulse was still strong, but we knew he was going into shock. ”How is he?” ”Can’t you see for yourself? Look at him, how do you think he is?” Kevin said standing up. ”Everyone just calm down, we need to keep him warm”, Kevin said. Overall 4.33 Believability 4.5 Coherence 4.38 Entertainment 4.13 ”Don’t be afraid,” he said as he approached me. ”Yes?” I asked. ”I’m not here to hurt you; I just want to talk to you for a minute,” stated. ”There’s nothing to talk about.” ”Believe me,” he said ”we have a lot to talk about.” He gave a sort of half frown before nodding. ”You have every right to not want to talk to me, but I am here to tell you how sorry I am for what I did to you.” I didn’t expect it. An apology from one who had beat me almost to a pulp for no reason; how was I to take this? It wasn’t just busted lips and blackened eyes it was bloodbath. I was in the hospital for three days after what he did to me. I would never be the same. ”Well since you don’t want to talk, I will just say that I was wrong, and I love you - and I’m sorry,” he said before turning and walking out of my life forever. Overall 4.33 Believability 4.14 Coherence 4.57 Entertainment 4.29 I didn’t know the name of the girl asleep next to me. ”I can tell you a lot of things like she’s fun to be with and she’s cool and I like talking and probably playing with her and she’s quite beautiful and she has a nice voice and she can dance pretty well and...” ..I can’t remember her name for the life of me. Isn’t that horrible? Waking up she smiles at me and says ”Good Morning”. I smile too. ”Good morning, you”, I say. ”Get ready and meet me and Denise in the lobby in thirty minutes.” Who in the hell is Denise, I wonder? I climb out of bed and run my hands though my hair and make my way to the door. I only have half and hour to get downstairs to meet with Denise and other girl. Overall 4.29 Believability 4.00 Coherence 4.57 Entertainment 4.29 200 This morning I was running late for work. I decided to stop and grab breakfast anyway. I went to Starbucks and picked up a latte and a muffin. I thought I’d ordered the apricot blueberry (which is low fat) and wound up with a regular blueberry muffin instead. Now, not only was I running late, but I was eating extra fat and calories! Having to walk two miles didn’t help, considering I don’t walk that fast. As as I was walking I tripped and fell, spilling my coffee. Luckily I just got a bad bruise but did not break anything! I hate Mondays! Overall 4.29 Believability 4.57 Coherence 4.57 Entertainment 3.71 The weather ouside indicated tomorrow would be a perfect ski day for Billy. A new day full of chances. Maybe tomorrow would be the day that Billy would finally confess his crush to Meghan. The operative word in the statement being...Maybe. Billy had been in love with Meghan for 12 years, but he was sure that she did not feel anything beyond friendship towards him. At all. But would risking a friendship and a good time on the slopes be worth it to find out? Nope. Overall 4.25 Believability 4.63 Coherence 4.13 Entertainment 4.00 Dora had not seen Scott in 15 years. Over the last year I have tried to get together with h... Things just never seemed to work out. Yesterday I finally had a chunk of time. So i looked him up on the internet It turns out he directed one of the WORST MOVIES I HA VE EVER SEEN IN MY LIFE: Blood- rayne. how funny is that High-larious. Overall 4.25 Believability 4.63 Coherence 4.38 Entertainment 3.75 She was looking for love in all the wrong men. I wonder when she’s going to stop trying to heal her broken heart with everything that has been tried and failed in the past: drugs, alcohol, bad relationships with men that are bad for her. It makes me wonder if we are doomed to repeat our mistakes repeatedly. LOVE is so Complicated. And the ego so fragile. Mine was shattered. I chose to withdraw rather than throw myself out there. Funny thing about that. I’m not sure which of us chose the right path. Overall 4.24 Believability 4.29 Coherence 4.14 Entertainment 4.29 201 The car veered off the road. When I came to, I immediately pulled the wheel away from the ditch I was headed towards, but I did so too quickly and began to slide the other directs. I was headed right for a tree. The people around me SCREAMED. I slammed into the tree, totalling my car. After a bit I was able to breathe but it was really hard to difficult...I felt faint, saw stars and was in an incredible amount of pain! The bystanders ran to my rescue. I was just about to throw myself. Luckily they were able to pull me out before the car caught fire. Overall 4.24 Believability 4.58 Coherence 4.58 Entertainment 3.57 After working at the firm for five years I finally got the promotion. ”You?” my co-worker asked with disdain in his voice. He sounded livid. I was happy about that because my work paid off and his brown nosing didn’t. Still no chance to give him my little green gift bag. I couldn’t wait because I prepared it with a big wad of toilet paper, he would know what it meant. ”Yeah, I know that.” ”Well, I know that you’re disappointed,” I said because I wanted to rub it in so the gift would have to wait until tomorrow. Overall 4.24 Believability 4.29 Coherence 4.43 Entertainment 4.00 I have no more cigars It’s weird. I usually always have at least 100 in my humidor. But I dont at the moment and it freaks me out. I dont know what happened to them all. This week has been very weird. It went by in a blur. I met some incredible people. Overall 4.21 Believability 4.13 Coherence 4.38 Entertainment 4.13 202 B.2 Reranking The car flipped, turned and rolled The roof was crushed in. it shocked me. and yet we all made it out alive Barely. i broke my arm, shelly had a huge gash on her forearm There was blood dribbling down her chin and everything, and she didn’t even scream or yell or anything. I felt bad for her, i took a cloth and cleaned her up. She doesn’t even know what is going on. Overall 4.65 Believability 4.88 Coherence 4.75 Entertainment 4.25 I woke up this morning with a headache. Felt like I’d been kicked in the face. My sister came over to take care of me and help with the housework. ”It’s not in our hands,” I whispered, and for a moment I wondered if the words sounded as hollow and meaningless to her ears as they did to mine. I must have said it because I was feeling disoriented due to the pounding in my head One of these days I will learn to just pass on red wine, it almost always causes a migraines in any quantity. By the time she had fixed us lunch, I was feeling much better. I was no longer dizzy, no longer warding off the marching band in my head. It’s always nice to have family around. Overall 4.58 Believability 4.88 Coherence 4.75 Entertainment 4.00 I went into work and found out I was going to be laid off from my job. While I saw it coming and was already thinking that company just wasn’t for me, it did still hurt. I have never been without a job since I was 16 years old and found myself in a panic because I had no idea what to do. Then, my cell phone rang. My sister in law said their nanny quit. Later at the bar... we decided that I’d work for her as a nanny on a trial basis. Then I don’t know what is going to happen next. But I guess this is what it’s like in the adult world. You know what I mean. Overall 4.56 Believability 4.75 Coherence 4.75 Entertainment 4.15 203 I went into a coffee bar to get a skinny latte. I hadn’t been there for about a week–I’d been going to a place near school. There was a new barrista at the espresso machine, and I thought he was cute. ”If you’re bored during the day, you should come here for more coffee.” Those were his words as he handed my my coffee, along with a strangely alluring smirk. So I did. That’s how I met Brian; we’ve been going out now for five weeks. I’m so happy. Overall 4.42 Believability 4.38 Coherence 4.63 Entertainment 4.25 Why do I eat so much? I know it’s bad for me. I am borderline obese. This needs to stop. But that beer and steak sounds so good. Sigh. Maybe I will try to get more exercise. Maybe. Overall 4.33 Believability 4.50 Coherence 4.63 Entertainment 3.88 His teeth flashed in the light as he came towards me with a knife. Panting, finding it hard to breathe, Rogue scooted backwards, twisting her neck around quickly, searching for what she could possibly use as a weapon. This was not a good place to be right now. Sam had gone into one of his crazy fits. Now they believed me. I was getting all this on tape. No one else believed my tales of Sam’s multiple personality disorder. What a bad timing. Apparently this alternate Sam was not too happy at the prospect of being exposed. He hastily pulled his nightshirt back over his shoulders. And his face lost its maniacal glare. My Sam, the love of my life, had come back to me. Overall 4.33 Believability 4.25 Coherence 4.38 Entertainment 4.38 Her hair snaked down her back. Her face, free of makeup, was a little puffy, and her body was completely hidden by the long skirt and loose peasant shirt she wore. She was not a pretty woman, but she had a wonderful personality. And she is personally responsible for teaching me that there is no such thing as washing your hands too much. She was the one who had always taken care of me when I was younger. They know me better than my own mom. She was my rock, my support throughout my life for all the things in life which my own mother could not provide. We won’t know what we really got til it’s gone. Overall 4.29 Believability 4.75 Coherence 4.38 Entertainment 3.75 204 I once knew a person who had tourettes syndrome. His primary concern, was making a million. ”Making a million what?”, I asked him. As he just stared blankly into space. ”A million people laugh, a million mistakes, a million what?” His beer bottle was sweating and so was he. ”Don’t get upset”, I said, just answer my question please. But maybe, by not answering it, you actually answered it? He just walked away rapidly mumbling a million swears under his breath. Overall 4.29 Believability 4.50 Coherence 4.38 Entertainment 4.00 I admit I am not a very good test taker. It may seem like a lame excuse but it’s true. So there I was, in the middle of the SAT, snapping through pencils every 30 seconds. I still have the shakes. As if I wasnt doing poorly enough, the high school band decided to practice in the yard right by our window. This includes the dancers and flag twirlers. So the teacher gets up and runs out to tell them to quiet down, and I take my chance I mean come on!? I came away that day with a perfect SAT score! Overall 4.29 Believability 4.00 Coherence 4.88 Entertainment 4.00 Her intense hunger caused her to fade in and out of consciousness. He watched her chest rise and fall, counting each breath. This was not going to happen, not here, not now! Damn. After all this time and struggle, all these months spent surviving in these harsh environments, he had to find her something to eat. I am not exaggerating when I say this. She seemed like she was on her deathbed. Then it got to a point where we realized she wasn’t going to be her ever again, she’d be a shell, empty, not what she would have wanted. Overall 4.29 Believability 4.63 Coherence 3.88 Entertainment 4.38 It was in the middle of the night. And Claire greeted me at her door. She opened the door slowly and stood there. Poking her head out she scanned the corridor. She saw that I was alone and invited me in. All of a sudden everyone I could think of showed up family and all my friends, current and previ- ous. That’s the reason why she invited me over, it was a surprise birthday party for me. I loved it. Overall 4.29 Believability 4.29 Coherence 4.57 Entertainment 4.00 205 I love Julia Roberts movies. Watched every single one of them, yep. My favorite is Pretty Woman, with Richard Gere. RDA’s favorite is the next one. Or maybe it was the one before it, called Mystic Pizza. Either way it was a dud. But I still watched it over twenty times! hehe! My mom bought us some close-to-the-real thing costumes based on the movie. Overall 4.29 Believability 4.71 Coherence 4.43 Entertainment 3.71 She got a new cell phone today. Andrew bought the same one and Aaron has it too. They always have to copy me. That doesn’t impress Management. I think I will buy another cell phone now so I can have something different. Haha. They’ll probably run out and buy it too! Without me knowing. Overall 4.25 Believability 4.25 Coherence 4.50 Entertainment 4.00 Henry whirled around, the colt 45 coming out from his holster with a practiced ease. He had to be dreaming, or hallucinating. ”I killed you back in Wyoming” Henry said, his Colt leveled at the phantom’s chest. But I didn’t just kill you.. ”I watched them bury you. I wanted to see you six feet under, you miserable son of a bitch”. Max screeched lunging for him but stopped when he raised his gun. ”Damn you” said Max. ”It wasn’t me. It was my brother.” You shouldn’t... ”Shouldn’t have what?” snarled Max. ”Tried to avenge my brother’s death? Hunt down the man who made my family’s life hell for the past decade?” Hahaha. ”Guess I did a better job than I thought” , said Henry. ”If you love your brother so much, maybe you should join him. His Colt barked, and Max sprawled liifeless on the floor. Overall 4.24 Believability 4.14 Coherence 4.43 Entertainment 4.14 Its rainning outside, i really wish it wasnt. I want to go to the beach. To feel the sand, to watch the girls, to play in the ocean No. I m stuck here, watching the rain drip, drip, drip Wondering where the time went Just wondering how I got stuck here Just three years ago. Sun!!!!! why did you have to leave me? There was no answer. Overall 4.24 Believability 4.86 Coherence 4.58 Entertainment 3.29 206 B.3 Adaptation I smoke too much pot, and then go to school. And im slacking off in school. I’m supposed to graduate this spring but am flunking some of my classes. Have an exam in about 16 hours. Don’t even ask which class because I forgot, drugs do fry the brain. If you have thoughts or embellishment ideas for me, however, please feel free to share. But I’ve heard it all before. ”Your nothing but a worthless piece of scum!” That’s what I hear the most, at least from my father. Grin* Me:... Overall 4.48 Believability 4.71 Coherence 4.71 Entertainment 4.00 The puppy needed a home. He turned his gaze to her. She picked him up and carried him into the house. He wasn’t that heavy, but it got tiring carrying him after a while. ”Here let me take him.” Her husband gently took the puppy from her arms. ”He’s so adorable!” He’s probably getting hungry. Plus last night she would not go to sleep until around 10:00pm. She returned from the kitchen carrying the puppy’s food dish. I think I need a house with a backyard that he can run in. Overall 4.46 Believability 4.63 Coherence 4.63 Entertainment 4.13 My birthday was yesterday. I had the day off from school, so I puttered around doing nothing. A few friends came over to spend time with me. I haven’t done that for so long, it just left me with a nice warm feeling:) They baked a cake, sang Happy Birthday, and gave me some presents. It was extremely tasty. So I invited them to stay the evening and we cooked dinner together. It was the best roast dinner I have ever eaten in my life! Overall 4.42 Believability 4.88 Coherence 4.88 Entertainment 3.50 207 World War II was a very crazy time. Or at least I thought it was. There were so many people affected by it in the world. There was me, my mom, my aunt, my uncle, my grand-mom, my great aunt, and my cousins Olivia, Ben, and Jake. and we were stuck right in the middle of it. It was very stressful. A lot of people got hurt and even more were killed. A LOT of people. It was a sad time, but at least it’s over now. Overall 4.42 Believability 4.88 Coherence 5.00 Entertainment 3.38 Taboo is such a fun game on a Friday night. What’s oddest is I have all these games I want to finish, and yet I spend so many nights bored, watching web browsers hoping someone said something on a forum/IRC/IMs. And yet, I had never realized the joy of playing a boardgame with friends. That will be the future theme of my work: Joy. Hope and joy, I had never had real friends before. And it was fun. I loved having company, being able to share my thoughts with others. Anyone else loving it as much as I am? Overall 4.42 Believability 4.75 Coherence 4.63 Entertainment 3.88 The house on the hill overlooked a tree-lined valley The land seemed to go on forever. I used to love being on that land but now it held nothing but bad memories. I remember a time when flying to visit family was actually an enjoyable experience. Everyone grew apart and never kept in touch. Mostly it was to swap email jokes and stuff, but with occasional genuine exchanges thrown in. I guess that’s what happens to families that aren’t close even when they were living together as a family. We all can’t stay the same together forever, I know that. The house will no longer be a place of refuge. Overall 4.33 Believability 4.71 Coherence 4.58 Entertainment 3.71 It was a beautiful day in sunny California. I had lots of sunscreen on and my hat and my sunglasses. I called a few of my friends and we were off to go sunbathing. We didn’t need to resort to ”oh you are so right” ”no you are so right” ”oh you are so wonderful” ”oh you are such a shining star” interactions with each other, we could have real conversations. It was good to be around my old friends again after how long we had been apart. I adore them as the sky does the stars. Speaking of stars, I wonder if the girls would like to go stargazing later on tonight. It was one of my favorite things to do. Overall 4.33 Believability 4.88 Coherence 4.38 Entertainment 3.75 208 I went to the beach for the first time in months. Not much has changed. The ocean, waves, sand, and air all feel fantastic. I love it. I got a great tan and relaxed all day. I’ll return to Seattle and get rejected as an outsider with this skin. But eventually my tan will fade and I’ll forget how great the beach was. Before this trip I was considering living there at some point in my life, but I wasn’t sure if I’d fit in. The rush of an urban life is motivating, despite the lack of sun and relaxation. We spent the entire time with limited internet access at some points with limited access to any kind of communications at all. It was a nice place to visit, but no place to live. Overall 4.29 Believability 4.50 Coherence 4.63 Entertainment 3.75 Training Beagle puppies can prove to be challenging. Speaking of puppies I’ve come up with a new conspiracy theory. Puppies are actually little devils in disguise. Puppies take WORK! They think everything is theirs to chew up. Then the bad part. My puppy chewed up my favorite shoes. Mind you, they weren’t cheap. I have decided to invest in every chew toy at the puppy store! Overall 4.29 Believability 4.25 Coherence 4.50 Entertainment 4.13 The baseball player swung at the pitch, but missed by a lot. The home plate ump checked with the third base ump who said Casto had fouled the ball off. With that call, the game was still going and Casto had another chance to win the game for his team. The Giants had a young guy named Sanchez pitching who did well. With Casto’s repuation as a good hitter, this was an intriguing and exciting matchup - who would win the battle and therefore the game? Who would get it?!! The suspense seemed to grow with each second and each movement of the pitcher, catcher, or batter. Cursing myself for not having a camera with me, the camera phone would have to do, and snapped the photo above. That photo showed the elation on Casto’s face after getting the game-winning hit off of the tough pitcher Sanchez. Overall 4.29 Believability 4.63 Coherence 4.50 Entertainment 3.75 209 The man swept his eyes methodically over the hotel room, noticing his briefcase had been moved. It had important papers inside and he couldn’t afford to lose them. Important papers was an understatment. The fate of the free world rested on these papers, signed in secret by multiple heads of state that the public had no idea were actuallyworking together. The conscience of the people has been pricked. If word of this got out, chaos would ensue. Anarachy would be too mild a term. One from which she could not... would not... walk away. No, this situation had to be followed through, no matter what. If those papers didn’t reach their destiination in the next 24 hours, bad things were going to happen. Something really, really good? No, I said bad. But anyways, the man opened the briefcase, relieved to find the papers still there. The maid had probably just moved it while cleaning. Overall 4.29 Believability 4.00 Coherence 4.29 Entertainment 4.57 I really hate being a teenager. And being a mom. Both are hard enough on their own but at the same time–impossible! I’m done with people who fucking lie. Like my boyfriend did when he said he was wearing a condom. He was last seen on Manzanita View Rd. in alpine. Right after he found out about our baby. He said I used him. He’s in the Army and thinks I just wanted to leave this town but I wanted to go to college. Just leave. Now I am stuck here with a screaming 2 year old, unpaid child support, no job and no high school diploma–a total failure. In this never ending torture I call my life that is always getting some new twist to it. I’m pregnant again! Overall 4.29 Believability 4.57 Coherence 4.57 Entertainment 3.71 210 Driving drunk is a really bad idea. ”I trust you.” Is what your drunk friends say when you get behind the wheel. As a driver, you’re in charge of this machine that has nothing to do with what you are, a ton or so of metal and rubber and glass and we send it hurtling down the road at fifty- or sixty- or seventy-plus miles an hour along with all these other squishy distractable humans all with their own agendas and their own metal beasts, and half the time you’re close enough to the next car over that if you rolled down your window and even just barely stretched, you could put your hand on it–and this seems like a good idea? After drinking at least a 12 pack? But soon after we wanted out and about. Like the idiots we are, we decide to go to Taco Bell. Then we ordered too much food and ate it all. On our way home we say a cop sitting in a parking lot, or should I say, he saw us? I thought this was so amazing. Because I have the worst luck, I can never get a break. Crunch. I was so worried about the polcie car, I hit the one in front of me. Holy shit... Knowing I was at fauit and won’t pass a sobriety test, I decide to tell him the truth. I’m scared as fuck right now. The officer tells me I am doing the right thing as he snaps the cuffs on me. Overall 4.25 Believability 4.38 Coherence 4.50 Entertainment 3.88 So my dad is dating a lady that’s 81, we call her ”the cougar.” It stands for busy body. Well, not really, haven’t you heard of the show ”Cougar town?” ”Life isn’t fair.” And this is no exception. I just want answers. I want to know how he could date someone who is such a bitch after he was married to my sweet mother for 39 years? Anyway, I don’t know what I expected. But I really thought he’d wait more than 3 months after she died to start screwing the ladies from church. Overall 4.21 Believability 4.25 Coherence 4.63 Entertainment 3.75 I am pregnant for the first time! I am 12 weeks along, and doing ok. Actually more than okay, I cant wait to have this baby! Im not quite sure when im gonna see him. But my doctor said I will deliver at the beginning of June, so fingers crossed! It has been a lot cooler here of late, although just as sunny its quickly got cool... Makes me more comfortable, which I need in this condition. Weather forecast says partially cloudy. Overall 4.21 Believability 4.50 Coherence 4.75 Entertainment 3.38 211 Appendix C System Architecture C.1 Backend The simplicity of the user interface and the brief algorithm description conceal a consid- erable amount of complexity on the backend. Although not large in comparison with the Web, the corpus of stories collected in Chapter 4 is a considerable amount of data that must be stored in a format that allows flexible usage as well as the efficiency required to process requests in near real time. In addition to designing an efficient data man- agement strategy, each of the models uses various custom and off the shelf processing components to manipulate the data, such as an automatic syntactic parser. These com- ponents must be integrated seamlessly with the web application, but must again adhere to the real-time constraints of the system. In this section I will describe in more detail the database schema for maintaining both the statically collected stories as well as the dynamically added user content and the client server architecture used to integrate all of the necessary processing components. C.1.1 Database Schema Relational databases are a deep-rooted standard mechanism for storing large amounts of interdependent data and are the choice of technology used in this work. Several features of the specification and standard implementations make this decision fairly obvious. Data can be arranged into arbitrary pieces (i.e. tables) based on user defined attributes that make sense for the particular application. Is-a and containment relationships can also be easily defined between tables using foreign keys, which allows the underlying data model to closely reflect data structures in the application code. Choosing the right schema of tables and relationships enables the data to be properly modularized, allowing 212 the model to be extended without breaking existing functionality and also minimizing the amount of data retrieved upon each request type. The database schema is organized into two conceptually distinct set of tables: those for managing weblog stories collected in Chapter 4 and those for handling user content generated with the system. Conceptually, a story is the basic unit of data in the weblog story corpus, but in this application sentences are much more fundamental. Although a story table is created, it is only included for completeness and for recreating entire stories for debugging and illustrative purposes. The weblog sentence table, however, includes several vital attributes and stores additional information in three separate tables via containment relationships. The most significant attributes of the sentence table are the preprocessed text that is displayed to the user and the sentence identifier of the next sentence in the story. The first auxiliary table stores the sentence’s associated depen- dency parse that was obtained during the story identification process described in Chap- ter 4. The other two tables store the two separate types of tokens used to index the database as described in Chapter 5. These tables are simply for convenience, allow- ing new types of indexes to be added easily offline and they also facilitate the indexing process described in Chapter 5 1 . On the user side, the concept of a story does play a more fundamental role in the functionality of the system. This table directly keeps track of the basic user-generated attributes such as the author, title, and date it was authored. Additionally it stores the information about which index and generation model the application used to generate its response to the user. A user sentence has some basic attributes, such as the actual text and position in the story, but is also subclassed into two separate types. A human sentence is simply a user sentence that keeps track of its author. A computer sentence 1 Two other indexing strategies not discussed in Chapter 5 were also explored initially, but were excluded in the final system for simplicity and efficiency concerns 213 is both a user sentence and a weblog sentence. Although the text is directly taken from the weblog sentence, it may be slightly modified by the adaptation model described in Chapter 7. Additionally, whereas there is a one-to-one mapping from a human sentence to a position in the story, there can be up to 10 computer sentences for a given sentence number. For a more complete and schematic illustration of the database, please see Figure 5. C.1.2 Client Server Architecture The design of the system utilizes a standard client server architecture for both the web interface and many of the background process that are run. The user communicates to the system via the Apache HTTP web server and GlassFish Enterprise application server. GlassFish is a J2EE compliant application server, which facilitates maintaining state across HTTP requests and allows plain old Java objects to be used as the backing data structure for common web components, such as text boxes, labels and combo boxes. Using this framework, all of the application logic is easily implemented in Java, which can access any required services using either off-the-shelf or custom libraries. Accessing a database is a prototypical example of the client/server model common to many data intensive applications and is not very interesting on the surface. However, special care must be taken not to be lured by the convenient amenities offered by persis- tence APIs provided by J2EE application server or by other 3 rd party libraries. Although these can greatly simplify and reduce the amount of effort required to manage the data, the quality of generated SQL can sometimes lead to extremely poor performance. Even though manually connecting and writing custom SQL for each type of data access is cumbersome and error prone, the efficiency afforded by the extra control was a deciding 214 Figure 5: SQL Schema 215 factor. Accessing the index is the only other area relying on an off-the-shelf library to communicate with a remote service and was discussed in Chapter 5. The same parsing component that was used in the story extraction phase is also a necessary component in constructing queries (described in Chapter 5) and for con- structing features in the reranking phase (described in chapter 6). Parses for the user authored sentences were obtained by wrapping a server application around the existing dependency parser. The interprocess communication was handled using Unix Domain Sockets, which are one of the fastest IPC mechanisms, yet still maintain a simple and standardized API. A similar server set up was constructed for the reranker that was dis- cussed in Chapter 6. A schematic layout of the design is shown in Figure 6. Application Server Core Logic Indexes Index Client Web Layer Interace Control Adapter Database Database Client Parser Parser Client Reranker Reranker Client User Web Server File System TCP/IP UDS UDS Figure 6: Client/Server layout 216
Abstract (if available)
Abstract
Digital interactive storytelling (DIS) is a compelling new medium for expressing and communicating ideas that tries to transform a normally passive experience into an active engagement in the creative process. Despite the enormous potential this medium beholds, the cost and complexity of authoring compelling stories primarily driven by user actions is prohibitively expensive in many DIS systems. While the graphical capabilities and physical interaction with these systems have advanced at a lightening pace, the ability for open interaction in complex domains remains extremely constrained.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Plant substructuring and real-time simulation using model reduction
PDF
Robust interpretable machine learning on data manifold via feature interaction using Shapley framework and quadtree masking
PDF
Application of data-driven modeling in basin-wide analysis of unconventional resources, including domain expertise
Asset Metadata
Creator
Swanson, Reid
(author)
Core Title
Enabling open domain interactive storytelling using a data-driven case-based approach
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/10/2010
Defense Date
04/27/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,case-based reasoning,interactive entertainment,natural language processing,OAI-PMH Harvest,Storytelling
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gordon, Andrew S. (
committee chair
), Dane, Joseph A. (
committee member
), Teng, Shang-Hua (
committee member
)
Creator Email
reid.william.swanson@gmail.com,reid@reidswanson.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3124
Unique identifier
UC1464959
Identifier
etd-Swanson-3800 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-360298 (legacy record id),usctheses-m3124 (legacy record id)
Legacy Identifier
etd-Swanson-3800.pdf
Dmrecord
360298
Document Type
Dissertation
Rights
Swanson, Reid
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
case-based reasoning
interactive entertainment
natural language processing