Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Neural creative language generation
(USC Thesis Other)
Neural creative language generation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NEURAL CREATIVE LANGUAGE GENERATION by Marjan Ghazvininejad A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2018 Copyright 2018 Marjan Ghazvininejad Dedication To my husband, Nima, and my parents and sister for believing in me, even when I did not. 2 Acknowledgement This journey would not be possible without the great encouragement, support, and guidance of my advisor, Kevin Knight. He helped me during all steps of this process, and spent tremendous time and eort to shape my academic personality. I am truly thankful for his patience, inspiration, full support and deep knowledge and I could not imagine having a better advisor and mentor. I would also like to thank my other committee members Morteza Dehghani, Andrew Gordon, Jerry Hobbs and Jonathan May for their insightful comments. I was very fortunate to have Chris Brockett, Yejin Choi, Bill Dolan, Yolanda Gil, Michel Galley, Daniel Marcu and Jonathan May as my mentors. Their encour- agement and guidance were a priceless privilege that I was very lucky to have. A very special gratitude goes out to Yonatan Bisk, Aliya Deri, Ulf Hermjakob, David Kale, Tomer Levinboim, Nanyun Peng, Jay Priyadarshi, Ashish Vaswani and Barret Zoph. It was fantastic to have the opportunity to work with you. Especially, I want to thank my great ocemate and collaborator Xing Shi. I thoroughly thank my parents Mahnaz Parandian and Davood Ghazvininejad and my sister Mahsa. This journey wouldn't be possible without your endless love and support and encouragement. You are always my eternal cheerleaders. Although I was far from my family in Iran, I was so fortunate to have great friends who made my graduate life fun. I want to thank my friends Shima, Mohsen, 3 Reza, Sepideh, Amir Soheil, Payman, Soheil and Sadaf for making these ve years a great period of my life. And nally, I want to thank my dearest friend, my best ocemate and my lovely husband, Nima. I can't even describe how much you make my life interesting and meaningful. Thanks for all the moments we spent together and thanks for believing in me even when I didn't. 4 Contents Dedication 2 Acknowledgement 3 List of Tables 8 List of Figures 10 Abstract 12 1 Introduction 14 1.1 Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Going Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2 Memorable and Secure passwords 27 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Password Generation Methods . . . . . . . . . . . . . . . . . . . . . 30 2.2.1 XKCD Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.2 First Letter Mnemonic . . . . . . . . . . . . . . . . . . . . . 31 2.2.3 All Letter Method . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.4 Frequency Method . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.5 Poetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Generating Topical Poetry 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Topically Related Words and Phrases . . . . . . . . . . . . . . . . . 46 5 3.4 Choosing Rhyme Words . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Strict Rhyme . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 Slant Rhyme . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.3 Non-Topical Rhyming Words . . . . . . . . . . . . . . . . . 50 3.4.4 Rhyme word selection . . . . . . . . . . . . . . . . . . . . . 50 3.5 Constructing FSA of Possible Poems . . . . . . . . . . . . . . . . . 51 3.6 Path extraction through FSA with RNN . . . . . . . . . . . . . . . 52 3.7 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.8 Other Languages and Formats . . . . . . . . . . . . . . . . . . . . . 57 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Interactive Poetry Generation System 62 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Example in Action . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.2 Style Control . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.3 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.4 Learn a New Style Conguration . . . . . . . . . . . . . . . 70 4.2.5 Alexa Skill . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.1 Human-Computer Collaboration . . . . . . . . . . . . . . . . 72 4.3.2 Automatic tuning for quality . . . . . . . . . . . . . . . . . . 73 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5 Poetry Translation - Information Theory Foundation 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1 Shannon Game Data Collection . . . . . . . . . . . . . . . . 80 5.3.2 An Estimation Problem . . . . . . . . . . . . . . . . . . . . 81 5.3.3 Machine Plays the Monolingual Shannon Game . . . . . . . 83 5.3.4 Modeling Human Guess Sequences . . . . . . . . . . . . . . 86 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 Information Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6 Poetry Translation 92 6.1 Human Poetry Translation . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 Language Variation in Poetry Translation . . . . . . . . . . . . . . 97 6.3 Automatic Poetry Translation . . . . . . . . . . . . . . . . . . . . . 98 6.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 6.5.1 Model A: Initial Model . . . . . . . . . . . . . . . . . . . . . 101 6.5.2 Model B: Biased Decoding with Unconstrained Translation 102 6.5.3 Model C: Biased Decoding with All Potential Translation . . 103 6.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6.1 Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . 106 6.6.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . 106 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7 Conclusion and Future Work 109 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Reference List 111 A Appendix 122 A.1 PPMC Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7 List of Tables 2.1 Comparison of methods that convert system-assigned 60-bit strings into English word sequences. Average word lengths range from 4 (XKCD) to 15 (First Letter Mnemonic). Average character lengths include spaces. LM score refers to the log probability assigned by a 5-gram English language model trained on the Gigaword corpus. Capacity tells how many English word sequences are available for an individual 60-bit input string. . . . . . . . . . . . . . . . . . . . 29 2.2 Mapping function between 4-bit sequences and English letters in the First Letter Mnemonic method. . . . . . . . . . . . . . . . . . . . . 32 2.3 Mapping function between bits and English characters in the All Letter Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Memorability of passwords generated by our methods. \Recalls" indicates how many participants returned to type their memorized English sequences, and \Correct Recalls" tells how many sequences were accurately remembered. . . . . . . . . . . . . . . . . . . . . . 39 2.5 User preferences among passwords generated by our methods. . . . 39 3.1 Sample word analyses. . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Dierent methods for extracting words related to the topic tennis. . 47 3.3 Users prefer the system that encourages the use of related words in the body (non-rhyme) portion of the poem. 40 poems are tested with 23 judges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Users prefer poems created with the encoder-decoder translation model over those that use only the RNN language model in gener- ation mode. 40 poems are tested with 25 judges. . . . . . . . . . . . 57 4.1 One poem generated in a 15-minute human/computer interactive poetry contest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Spoken conversation between a user and Alexa. . . . . . . . . . . . 71 5.1 Unigram probabilities of machine and human guesses, in both mono- lingual and bilingual conditions. Amounts of training data (in char- acters) are shown in parentheses. . . . . . . . . . . . . . . . . . . . 84 8 5.2 Entropies of monolingual test guess-sequences (1000 guesses), given varying amounts of context (c = number of previous characters, g = number of previous guess numbers) and dierent training set size (shown in parentheses). Witten-Bell smoothing is used for backo to shorter contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Entropies of human guess-sequences (1000 test-set guesses), given varying amounts of context (c = number of previous characters , g = number of previous guess numbers) and dierent smoothing methods. Prediction models are trained on a separate sequence of 1283 guesses in the monolingual case, and 1378 guesses in the bilingual case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Summary of our entropy bounds. . . . . . . . . . . . . . . . . . . . 89 6.1 Entropy bounds and compression rates of general translation vs. poetry translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 BLEU scores of our dierent systems. . . . . . . . . . . . . . . . . . 106 6.3 Users prefer translations generated by model A. . . . . . . . . . . . 107 6.4 Users prefer translations generated by model C. . . . . . . . . . . . 107 6.5 Quality of the translated poems by model C. . . . . . . . . . . . . . 108 9 List of Figures 3.1 Overview of Hafez converting a user-supplied topic word (wedding) into a four-line iambic pentameter stanza. . . . . . . . . . . . . . . 43 3.2 An FSA compactly encoding all word sequences that obey formal sonnet constraints, and dictating the right-hand edge of the poem via rhyming, topical words delight, chance, ... and joy. . . . . . . . 51 3.3 Sample sonnet generated from the topic phrase bipolar disorder. . . 55 3.4 Sample stanzas generated from dierent topic phrases. . . . . . . . 56 3.5 Stanzas generated with and without a encoder-decoder translation model for topic death. . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Sample word analyses needed to construct Spanish Hafez. v- and -v indicate whether the word starts and/or ends with a vowel sound. . 58 3.7 Sample Spanish poem generated in classical soneta form, on the topic mariposa (butter y). . . . . . . . . . . . . . . . . . . . . . . . 60 4.1 Overview of Hafez . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 A poem generated with (a) default style conguration and (b) user- adjusted style conguration. . . . . . . . . . . . . . . . . . . . . . 67 4.3 The distribution of poem star-ratings against normalized topical, concreteness, sentiment and repetition weights. Star ratings are computed as an oset from the version of the poem generated from default settings. We normalize all features weights by calculating their oset from the default values. The solid curve represents a quadratic regression t to the data. To avoid overlapping points, we plot with a small amount of random noise added. . . . . . . . . 73 5.1 Estimates of the entropy of English (in bits per character). Machine results are taken from actual compression algorithms (Zoph et al., 2015), while human results are computed from data elicited by the Shannon Game. The monolingual column is the original case studied by Shannon (1951). The bilingual column represents the number of additional bits needed to store English, given a Spanish source translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 10 5.2 Bilingual Shannon Game interface. The human subject reads the Spanish source and guesses the translation, character by character. Additional aids include a static machine translation and a dynamic word completion list. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Example guess data collected from the Shannon Game, in both monolingual (top) and bilingual (bottom) conditions. The human subject's guesses are shown from bottom up. For example, in the bilingual condition, after seeing `...reason', the subject guessed `.' (wrong), but then correctly guessed `i' (right). . . . . . . . . . . . . 82 5.4 Guess number distributions from human monolingual Shannon Game experiments (training portion). Plot 5.4a shows all 1238 guesses, while plots 5.4b, 5.4c and 5.4d show guesses made in specic char- acter contexts ' ' (space),'a' and 'p'. The y-axis (probability of guess number) is given in log scale, so a geometric distribution is represented by a straight line. We observe that the single-parameter geometric distribution is a good t for either the head or the tail of the curve, but not both. . . . . . . . . . . . . . . . . . . . . . . . . 87 6.1 Eleven human translations of the same source sentence (LDC2002T01). 94 6.2 Six human translations of the same poem by Wang Wei. . . . . . . 95 6.3 Four dierent human translations of the poem \Inferno" by Dante. 96 6.4 An example of poetry translation by models A and B. Biased decod- ing with unconstrained translation (Model B) produces a better translation compared to the baseline rhythm- and rhyme-constrained system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.5 An example of poetry translation by models B and C. Biased decod- ing with all potential translation (Model C) produces a better trans- lation compared to Model B. . . . . . . . . . . . . . . . . . . . . . . 105 6.6 A sample poem translated by our full system (Model C). . . . . . . 108 A.1 Arithmetic coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 11 Abstract Natural language generation (NLG) is a well-studied and yet very challenging eld in natural language processing. One of the less studied NLG tasks is the generation of creative texts such as jokes, puns, or poems. Multiple reasons contribute to the diculty of research in this area. First, no immediate application exists for cre- ative language generation. This has made the research on creative NLG extremely diverse, having dierent goals, assumptions, and constraints. Second, no quanti- tative measure exists for creative NLG tasks. Consequently, it is often dicult to tune the parameters of creative generation models and drive improvements to these systems. Lack of a quantitative metric and the absence of a well-dened immediate application makes comparing dierent methods and nding the state of the art an almost impossible task in this area. Finally, rule-based systems for creative language generation are not yet combined with deep learning methods. Rule-based systems are powerful in capturing human knowledge, but it is often too time-consuming to present all the required knowledge in rules. On the other hand, deep learning models can automatically extract knowledge from the data, but they often miss out some essential knowledge that can be easily captured in rule-based systems. In this work, we address these challenges for poetry generation, which is one of the main areas of creative language generation. We introduce password poems as 12 a new application for poetry generation. These passwords are highly secure, and we show that they are easier to recall and more preferable compared to passwords created by other methods that guarantee the same level of security. Furthermore, we combine nite-state machinery with deep learning models in a system for gener- ating poems for any given topic. We introduce a quantitative metric for evaluating the generated poems and build the rst interactive poetry generation system that enables users to revise system generated poems by adjusting style conguration settings like alliteration, concreteness and the sentiment of the poem. The system interface also allows users to rate the quality of the poem. We collect users' rat- ing for poems with various style settings and use them to automatically tune the system style parameters. In order to improve the coherence of generated poems, we introduce a method to borrow ideas from existing human literature and build a poetry translation system. We study how poetry translation is dierent from translation of non-creative texts by measuring the language variation added during the translation process. We show that humans translate poems much more freely compared to general texts. Based on this observation, we build a machine trans- lation system specically for translating poetry which uses language variation in the translation process to generate rhythmic and rhyming translations. 13 Chapter 1 Introduction Enabling the computers to automatically generate natural texts is the motivation that started the eld of Natural processing as we know it today. Automatic Natu- ral Language Generation (NLG) branched into two areas: generating from another natural language text whether as in paraphrasing or translating from another lan- guage, and generating from non-textual data or from no data at all. In 1950, Alan Turing proposed a criteria for intelligence as the ability of the computer to indistinguishably impersonate a human in written conversation (Tur- ing, 1950). Authors of the rst automatic machine translation system in 1954 claimed that machine translation would be solved within three to ve years (Hutchins, 2005). In the 1960s, research extended to multiple areas of natural language gen- eration (NLG) including random generation of English sentences (Yngve, 1961), discourse generation (Klein and Simmons, 1963), paraphrasing (Klein, 1965b), essay generation (Klein, 1965a), and dialog generation (Weizenbaum, 1966). Though initially attractive, natural language generation proved to be a very hard problem (Pierce and Carroll, 1966). The rst successful applications of natu- ral language generation were mainly rule based systems that generated reports from statistics and structured data. Some examples are automatic generation of weather forecasts (Glahn, 1970; Goldberg et al., 1994) and summarization of nancial and business data (Anand and Kahn, 1992). The widespread use of the Internet intro- duced new NLG applications. Generating canned responses, customer service chat 14 bots, and conversation agents like Siri and Cortana are a few examples. Mean- while, improvements in statistical machine learning shifted the algorithms used in NLG from rule based to data-driven. In recent years, advances in neural network techniques and the increase in processing power have provided new possibilities to improve traditional NLG tasks or propose new applications. The NLG applications mentioned so far mainly concern generating a uent text in the desired context. They do not consider creativity or aesthetic aspects of the language. Creative language generation is a less studied branch of NLG that deals with this challenge. Specically, creative language generation addresses novel uses of language such as generating poetry, stories, puns and use of humor, metaphors, sarcasm, etc. Adding creativity can make content more appealing, add deeper levels of mean- ing to it and make it more memorable. This makes computational creativity an interesting problem to approach. While natural language generation still has a long way to go, and having cre- ative language generation might seem like a far picture, so was winning against a human in a game of chess or Go. In the near future computers may be able to write kid stories, tell jokes of their own, and produce song lyrics, or movie scripts. As a mid-way milestone, computers can be a big help to humans in generating creative contents by providing suggestions and ideas. Moreover, creative and novel use of language can have potential benets to other NLG tasks like chat bots and conversation agents. A big challenge for current chat bots and conversation agents is increasing their user engagement. Adding creativity can make the language generated by these systems more interesting. It can also prevent repetition, which is a common problem of these system (Li et al., 2017; Dai et al., 2017). These improvements will help keep users engaged. 15 1.1 Previous Works A diverse body of work exists on creative language generation. The main areas of research in this eld can be categorized into metaphor generation, joke and pun generation, story generation and poem generation. An overview of each category is brought here. Metaphor Generation. For many years, computational research around metaphors was based on heavily hand-coded rules which made the systems very hard to scale (Fass, 1991; Martin, 1990; Narayanan, 1997, 1999; Barnden and Lee, 2002; Agerri et al., 2007). In recent years, research has moved towards automatic recognition and interpretation of metaphors (Gedigian et al., 2006; Shutova et al., 2010; Dunn, 2013; Mohler et al., 2013; Hovy et al., 2013; Klebanov et al., 2014). Both metaphor interpretation and metaphor generation require knowledge about literal meanings of phrases, and how they can be stretched or translated into metaphoric meanings (Gatt and Krahmer, 2017). This knowledge can be acquired using web search queries to establish example cases of valid comparisons (Veale and Hao, 2007, 2008; Bollegala and Shutova, 2013; Li et al., 2013b; Harmon, 2015). A comparable technique is used by Veale (2013) to generate metaphors from the Google n-gram corpus to nd stereotypes. Joke and pun Generation. Humor is a knowledge-hungry process and highly depends on cultural knowledge. The most successful computational joke generation systems are mostly focused on pun generation (Gatt and Krahmer, 2017) as they often rely on spelling or word pronunciation or word sense ambiguity which are relatively easy to model. Most works in this area rely on generating very specic kind of jokes (Binsted et al., 2003; Ritchie, 2009; Petrovic and Matthews, 2013; Hempelmann, 2003; Miller and Gurevych, 2015; Jaech et al., 2016). The JAPE (Joke Analysis and Production 16 Engine) system (Binsted and Ritchie, 1994, 1997), which generates jokes in the format of \What's the dierence between X and Y?" or \What do you call X?" and the HAHAcronym system (Stock and Strapparava, 2005), which produces humorous acronyms, are some famous examples of this group of work. Story Generation. Automatic story generation eorts started in the 1970s with the TALE-SPIN system (Meehan, 1977). Early attempts in this eld relied on symbolic planning (Meehan, 1977; Lebowitz, 1987; Turner, 1993; Bringsjord and Ferrucci, 1999; Perez and Sharples, 2001; Riedl and Young, 2010) or case-based reasoning (Gervas et al., 2005). These systems are only able to generate stories for specic domains. Another area of research is using machine learning to learn the story model from existing datasets (Li et al., 2013a) or to generalize knowledge from existing stories to assemble new ones (Swanson and Gordon, 2012). In recent years, deep learning models are used to capture higher level structure in stories. Roemmele et al. (2017) use Skip-thought vectors (Kiros et al., 2015) to encode sentences, and a Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997) to generate stories. Martin et al. (2017) train a recurrent encoder-decoder neural network (Sutskever et al., 2014) to predict the next event in the story. Poem Generation. Automated poem generation has been a popular but challenging research topic (Manurung et al., 2000; Gervas, 2001; Diaz-Agudo et al., 2002; Manurung, 2003; Wong and Chun, 2008; Jiang and Zhou, 2008; Netzer et al., 2009). Recent work attempts to solve this problem by applying grammatical and semantic templates (Oliveira, 2009, 2012), or by modeling the task as statistical machine translation, in which each line is a \translation" of the previous line (Zhou et al., 2009; He et al., 2012). Yan et al. (2013) propose a poem generation method based on summarization techniques. They retrieve candidate sentences from a 17 large corpus of poems based on user's query. They then cluster the constituent terms and summarize each cluster into a line of a poem. Greene et al. (2010) use unsupervised learning to estimate the stress patterns of words in a poetry corpus. They then use these patterns in a nite-state network to generate short English love poems. Several deep learning methods have recently been proposed for generating poems. Zhang and Lapata (2014) use an RNN model to generate 4-line Chi- nese poems. They force the decoder to rhyme the second and fourth lines, trusting the RNN to control the rhythm. Yi et al. (2016) also propose an attention-based bidirectional RNN model for generating 4-line Chinese poems. The only work in this area which tries to generate longer poems is from Wang et al. (2016a). They use an attention-based LSTM model for generating iambic poems. They train on a small dataset and do not use an explicit system for constraining rhythm and rhyme in their poems. Other Categories. Creative language generation is not limited to the cate- gories mentioned above. Other categories for creative natural language generation include novel word generation (Deri and Knight, 2015), sarcasm generation (Joshi et al., 2017), generating fake alternative names (morphs) for entities (Zhang et al., 2014), etc. Weaknesses of previous works The eld of automatic creative writing is still far from perfect. Here are the main limitations that are not yet addressed in the eld: Creative language generation methods do not have an immediate application yet. Despite recent improvements in creative language gener- ation, the quality of the systems is not good enough to be used in real world 18 applications. Having no immediate application has resulted in a diverse span of research in each category of creative language generation, each of which with its own assumptions and goals. This have made comparing related works and nding the state of the art an almost impossible task in this eld. For instance many poetry generation systems have been introduced in the last few years. Most of these systems are specialized for a specic language and format, and mostly generate poems in a specic domain. These dierences make direct comparison of these systems impossible. Diculty of quantitative evaluation limits system improvements. Automatic evaluation methods, like BLEU (Papineni et al., 2002) in machine translation, cannot measure the level of creativity or syntactic/semantic coherence in the text. Subjective evaluation requires evaluators to have rel- atively high literary training, so systems will receive limited feedback during the development phase. 1 These problems make it hard to evaluate creative generation systems frequently. Consequently it is often dicult to tune the parameters of these models and drive improvements to these systems. Individual use of either rule-based models or deep learning models limit the power of the system. Rule-based models and deep learning models each have their own benets and drawbacks. In rule-based systems, the world knowledge is captured in a set of rules. However, it is usually very hard and time consuming to represent all the world knowledge with rules. On the other hand, deep learning models can extract knowledge from the data, without requiring much human eort. However, they often miss out some essential knowledge than can be easily captured in rule based systems. 1 An example of subjective evaluation is the Dartmouth Turing Tests in the Creative Arts (bit.ly/20WGLF3), in which human experts judge generation quality, held once a year. 19 For instance, neural generative systems sometimes make obvious grammat- ical errors like wrong use of a and an (Ghazvininejad et al., 2016). As an other example, it is extremely hard for deep learning poetry systems to learn the rhythm and rhyme pattern of words and generate poems with correct rhythm and rhyme, while these constraints can be easily applied in rule based systems. The research goal of this dissertation is to address these challenges. In this thesis, I present a creative language generation system using a controlled RNN in poetry applications. 1.2 Going Further In previous section, I recounted the main limitations of the current creative lan- guage generation. I address the above-mentioned challenges in the following appli- cations: Poetry Passwords In this thesis, I introduce a novel and immediate application of creative language generation in generating secure and memorable passwords. When it comes to selecting a password, there is a trade-o between security and memorability. Most user-generated passwords (e.g., \Passw00rd!") are memorable but not secure. Many organizations attempt to address this problem by using dierent password composition policies. However these policies often do not lead to a substantial increase in the security (Inglesant and Sasse, 2010). On the other hand, system- assigned passwords are more secure but usually very hard to memorize. People 20 have a low tendency to use such passwords, thus they have been studied to a lesser degree (Shay et al., 2012). In ancient times, people recorded long, historical epics using poetry, in order to enhance memorability. I follow this idea and introduce a new immediate applica- tion for poetry generation in generating random poem passwords which are secure and easy to memorize. In Chapter 2, I propose an algorithm that converts a randomly generated 60-bit password such as 010110011000:::11101001 to a 2-line rhyming poem such as: Diversity inside replied, Soprano finally reside. Our study shows that poetry passwords are more ecient than other methods used for memorizing the 60-bit string like replacing the bit string with a sequence of random words or a plain sentence. The rhythm and rhyme in the poem makes it more appealing and easier to memorize for the user. Although it might be hard to memorize a 2-line poem for every password, the poetry passwords are ideal for more essential cases like master passwords and the private key of bitcoin wallets. 2 Besides proposing an immediate application for poetry generation, we propose a quantitative evaluation methodology for assessing the quality of password poems. This methodology measures both the memorability of the passwords and the ten- dency of the users to memorize them. This methodology brings a quantitative measure for comparing dierent settings of the same system or dierent individual systems. 2 Bitcoin is a special case where users have to memorize their private keys. Private keys can vary in length but often 128-bit keys are used. Our system can be easily modied to generate poems for 128-bit passwords instead of current 60-bit ones. We have been contacted by multiple bitcoin wallet companies for this purpose. 21 Topical Poetry Humans usually have a specic topic in mind when writing a poem. Similarly, it is desired from a poetry generation system to be able to generate a poem for a given topic. The poetry password system does not have this capability. I propose a method for generating topical poems. This method is able to generate any number of dierent poems for any given topic word or phrase. In this system, the rhythm and rhyme of the poems are enforced using a rule-based system, and the coherence and uency is controlled by a neural network. In Chapter 3, a method is proposed to integrate these two systems, guaranteeing formal correctness of the poems, while gaining coherence of long-distance neural networks. Furthermore, an on-line system is developed to generate poems for any given topic within few seconds. This system is free and available to use. 3 This system also enables users to revise and polish generated poems by adjusting various style congurations like alliteration, concreteness and the sentiment of the poem. Exper- iments demonstrate that such polishing mechanisms consider the user's intention and lead to better poems. This system won the rst place in the Dartmouth Turing Test in poetry gener- ation (bit.ly/20WGLF3) in 2016. The Dartmouth competition is an annual event in which human experts are employed to judge the quality of creative computer generations including poems. Participating in such competitions is a good way to evaluate and compare state of the art systems. However, a cheaper quantitative evaluation metric is required for fast and frequent parameter tuning and compar- ison. In order to address this need, a web interface is built where users can rate the quality of each poem with star ratings. Based on such feedback, the system learns to adjust its parameters to improve poetry quality. 3 http://52.24.230.241/poem/ 22 The proposed topical poetry method is generalizable to other languages and poetic formats. Examples poems generated in Shakespearean English sonnet and the classical Spanish soneta format are presented in Chapter 3. Poetry Translation The poems generated by the topical poetry system are locally coherent, but they do not follow any specic start to end ow. Coming up with a high level structured path is still a hard task for computers. Hence, to improve the system the method needs to borrow the high-level ow structure from somewhere else. A good source for borrowing this structure is human poetry. Human poetry translation has been highly debated over by scholars engaged in this eld throughout the history. It is a popular task as it highly involves cultures and their relations. It is also a very hard task as it is often impossible to translate the meaning and transfer all the other aspects of the poem such as its cultural backgrounds, feeling, format, style, etc. Inevitably, human translators have to improvise to keep a balance between preserving the meaning, keeping the poetry format and staying loyal to the style and feeling of the original poem. Similarly, poetry translation is a hard task for computers. Computers might be better than humans in some aspects like nding the right word that ts the format of the poem, but it is often hard for them to nd the right balance between preserving the meaning and keeping the format. This problem is demonstrated in the two previous automatic poetry translation works (Greene et al., 2010; Genzel et al., 2010). The goal of these methods is to translate the source poem into an English poem while fully preserving the meaning. However, they report a high failure rate in translating poem lines with xed rhythm and rhyme format. For 23 most of source lines, they could not nd any translation with the correct rhythm and rhyme format in their translation space. In Chapter 6, I will show why integration of rule-based systems and deep learn- ing methods is a good approach to tackle this problem. In addition quantita- tive evaluation methods will be proposed to measure the quality of the generated poems. 1.3 Contributions These are the contributions I will provide in my thesis: I introduce a new application for creative language generation in creating secure and memorable passwords. I propose dierent models for converting any arbitrary random bit string into a word sequence password (in both prose and poetry form) in a deterministically recoverable way. I show our poetry passwords have the highest recall rate (61%) among users, and 3:75 times more users prefer these passwords compared to a prior method with a similar recall rate (Ghazvininejad and Knight, 2015). I present a topical poetry generation system that can generate any number of distinct poems for any user supplied topic. In this system, I propose combining nite-state machinery with deep learning models, guaranteeing correctness of metrical and rhyme structure, while gaining coherence of long- distance RNN models. In the 2016 Dartmouth Poetix Competition, our system was selected as the best of the submitted systems. 4 In addition, I show the generality of the system with respect to other languages and poetic forms (Ghazvininejad et al., 2016). 4 bit.ly/20WGLF3 24 I describe the rst interactive poetry generation system that enables users to revise and polish system generated poems by adjusting style congura- tion settings like alliteration, concreteness and the sentiment of the poem. The system interface also allows users to rate the quality of the poem. Our studies show that users assign +1.4 more stars (in scale of 1-5) to the poems generated in their interaction with the system compared to the default set- ting. In addition, I collect users' rating for poems with various style settings and use them to automatically tune the system style parameters. Based on our experiments, in 59% of the cases, users prefer poems generated by the tuned system (Ghazvininejad et al., 2017). I provide an information-theoretic way to measure the language variation in the human translation process. That is, once a source text has been compressed, how many additional bits are required to encode its human translation? Data collected through novel bilingual Shannon game shows that human translators add 34% more information on top of the original text and machines are far worse than humans at predicting the behavior of human translators (Ghazvininejad and Knight, 2016). In addition, I study human poetry translation and measure the language vari- ation in this process. I study how human poetry translation is dierent from human translation in general and show that human translators add ???% more information in translating poetry compare to the general translation. I present the rst neural poetry translation system. Unlike previous works that often fail to produce any translation for xed rhyme and rhythm pat- terns, our system always translates a source text to an English poem. Human 25 evaluation of the translations ranks the quality as acceptable 78:2% of the time (Ghazvininejad et al., 2018). 26 Chapter 2 Memorable and Secure passwords One of the main drawbacks of previous works is lack of a potential immediate application. In this chapter, we introduce a novel application for creative language generation as creating secure and memorable passwords. User-generated passwords tend to be memorable, but not secure. A random, computer-generated 60-bit string is much more secure. However, users cannot memorize random 60-bit strings. In this chapter, we investigate methods for con- verting arbitrary bit strings into English word sequences (both prose and poetry), and we study their memorability and other properties. 2.1 Introduction Passwords chosen by users (e.g., \Scarlet%2") are easy to remember, but not secure (Florencio and Herley, 2007). A more secure method is to use a system- assigned 60-bit random password, such as 0010100010100:::00101001. However, this string is hard to memorize. In this chapter, we convert such strings into English phrases, in order to improve their memorability, using natural language processing to select uent passphrases. Our methods are inspired by an XKCD cartoon 1 that proposes to convert a randomly-chosen 44-bit password into a short, nonsensical sequence of English words. The proposed system divides the 44-bit password into four 11-bit chunks, 1 http://xkcd.com/936 27 and each chunk provides an index into a 2048-word English dictionary. XKCD's example passphrase is correct horse battery staple: 44-bit password English phrase --------------- -------------- 10101101010 -> correct 10010110101 -> horse 01010101010 -> battery 10110101101 -> staple The four-word sequence is nonsense, but it is easier to memorize than the 44-bit string, and XKCD hypothesizes that users can improve memorability by building an image or story around the four words. In this chapter, we investigate other methods for converting a system-generated bit string into a memorable sequence of English words. Our methods produce whole sentences, e.g. Fox news networks are seeking views from downtown streets. as well as short poems, e.g. Diversity inside replied, Soprano finally reside. We also move to 60-bit passwords, for better security. One source claims: As of 2011, available commercial products claim the ability to test up to 2,800,000,000 passwords a second on a standard desktop computer using a high-end graphics processor. 2 2 http://en.wikipedia.org/wiki/Password cracking 28 Method Name Average Number of Words Average Number of Characters AVG LM Score Capacity Sample Passwords XKCD 4 31.2 -62.42 1 fees wesley inmate decentralization photo bros nan plain embarrass debating gaskell jennie First Letter Mnemonic 15 87.7 -61.20 2 10 51 It makes me think of union pacic resource said it looks like most commercial networks . Some companies keep their windows rolled down so you don't feel connected to any community . Contains extreme violence and it was a matter of not only its second straight loss . All Letter Method 11.8 70.8 -58.83 3 10 56 Parking and utilities have been searching for a third straight road win . It was the same girl and now a law professor in the former east german town . I know a man who said he was chief of stas in a real and deep conversation . Frequency Method 9.7 55.5 -52.88 6 10 14 Fox news networks are seeking views from downtown streets . The review found a silver tree through documents and artifacts . These big questions are bothering me a bit stronger . Poetry 7.2 52.7 -73.15 10 6 Joanna kissing veried soprano nally reside Diversity inside replied retreats or colors justied Surprise celebrity without the dragging allison throughout Table 2.1: Comparison of methods that convert system-assigned 60-bit strings into English word sequences. Average word lengths range from 4 (XKCD) to 15 (First Letter Mnemonic). Average character lengths include spaces. LM score refers to the log probability assigned by a 5-gram English language model trained on the Gigaword corpus. Capacity tells how many English word sequences are available for an individual 60-bit input string. 29 If this is correct, a 44-bit password would take one hour to crack, while a 60-bit password would take 11.3 years. Our concrete task is as follows: Input: A random, system-generated 60-bit password. Output: An English word sequence with two properties: { It is memorable. { We can deterministically recover the original input 60-bit string from it. This implies that we map 2 60 distinct bit strings into 2 60 distinct English sequences. If a user memorizes the English word sequence supplied to them, then they have eectively memorized the 60-bit string. 2.2 Password Generation Methods We now describe our baseline password generation method, followed by four novel methods. In Section 2.3 we experimentally test their memorability. 2.2.1 XKCD Baseline Our baseline is a version of XKCD. Instead of a 2048-word dictionary, we use a 32,768-word dictionary. We assign each word a distinct 15-bit code. At runtime, we take a system-assigned 60-bit code and split it into four 15-bit sequences. We then substitute each 15-bit segment with its corresponding word. By doing this, we convert a random 60-bit code into a 4-word password. The rst row of Table 2.1 shows three sample XKCD passwords, along with other information, such as the average number of characters (including spaces). 30 2.2.2 First Letter Mnemonic XKCD passwords are short but nonsensical, so we now look into methods that instead create longer but uent English sentences. We might think to guarantee uency by selecting sentences from an already-existing text corpus, but no corpus is large enough to contain 2 60 ( 10 18 ) distinct sentences. Therefore, we must be able to synthesize new English strings. In our rst sentence generation method (First Letter Mnemonic), we store our input 60-bit code in the rst letters of each word. We divide the 60-bit code into 4-bit sections, e.g., `0100-1101-1101-...'. Every 4-bit sequence type corresponds to an English letter or two, per Table 2.2. We build a word-confusion network (or \sausage lattice") by replacing each 4-bit code with all English words that start with a corresponding letter, e.g.: 0100 1101 1111 ... 0011 ---- ---- ---- ---- income my frog ... octopus is miner feast ... of inner priest gratuitous ... oregon ... ... ... ... This yields about 10 74 paths, some good (is my frog. . . ) and some bad (income miner feast. . . ). To select the most uent path, we train a 5-gram language model with the SRILM toolkit (Stolcke, 2002) on the English Gigaword corpus. 3 SRILM also includes functionality for extracting the best path from a confusion network. 3 https://catalog.ldc.upenn.edu/LDC2011T07 31 Bit Sequence Mapped Character Bit Sequence Mapped Character 0000 e 1000 r,x 0001 t 1001 d,j 0010 a 1010 l,k 0011 o 1011 c,v 0100 i 1100 u,b 0101 n 1101 m,p 0110 s,z 1110 w,y 0111 h,q 1111 f,g Table 2.2: Mapping function between 4-bit sequences and English letters in the First Letter Mnemonic method. Table 2.1 shows sample sentences generated by the method. Perhaps surpris- ingly, even though the sentences are much longer than XKCD (15 words versus 4 words), the n-gram language model (LM) score is a bit better. The sentences are locally uent, but not perfectly grammatical. We can easily reconstruct the original 60-bit code by extracting the rst letter of each word and applying the Table 2.2 mapping in reverse. 2.2.3 All Letter Method Most of the characters in the previous methods seem \wasted", as only the word- initial letters bear information relevant to reconstructing the original 60-bit string. Our next technique (All Letter Method) non-deterministically translates every bit into an English letter, per Table 2.3. Additionally, we non-deterministically intro- duce a space (or not) between each pair of letters. This yields 4 10 84 possible output strings per input, 3 10 56 of which consist of legal English words. From those 3 10 56 strings, we choose the one that yields the best word 5-gram score. 32 Bit Sequence Mapped Characters 0 e, o, i, h, r, c, u, f, g, b, v, x ,q 1 t, a, n, s, d, l, m, w, y, p, k, j, z Table 2.3: Mapping function between bits and English characters in the All Letter Method. It is not immediately clear how to process a letter-based lattice with a word- based language model. We solve this search problem by casting it as one of machine translation from bit-strings to English. We create a phrase translation table by pairing each English word with a corresponding \bit phrase", using Table 2.3 in reverse. Sample entries include: din ||| 1 0 1 through ||| 1 0 0 0 0 0 0 yields ||| 1 0 0 1 1 1 We then use the Moses machine translation toolkit (Koehn et al., 2007) to search for the 1-best translation of our input 60-bit string, using the phrase table and a 5-gram English LM, disallowing re-ordering. Table 2.1 shows that these sentences are shorter than the mnemonic method (11.8 words versus 15 words), without losing uency. Given a generated English sequence, we can deterministically reconstruct the original 60-bit input string, using the above phrase table in reverse. 2.2.4 Frequency Method Sentence passwords from the previous method contain 70.8 characters on average (including spaces). Classic studies by Shannon (1951) and others estimate that printed English may ultimately be compressible to about one bit per character. 33 This implies we might be able to produce shorter output (60 characters, including space) while maintaining normal English uency. Our next technique (Frequency Method) modies the phrase table by assigning short bit codes to frequent words, and long bit codes to infrequent words. For example: din ||| 0 1 1 0 1 0 1 0 0 through ||| 1 1 1 1 yields ||| 0 1 0 1 1 1 0 1 Note that the word din is now mapped to a 9-bit sequence rather than a 3-bit sequence. More precisely, we map each word to a random bit sequence of length bmax(1;log P(word) +)c. By changing variables and we can vary between smooth but long sentences ( = 1 and = 0) to XKCD-style phrases ( = 0 and = 15). Table 2.1 shows example sentences we obtain with = 2:5 and =2:5, yielding sentences of 9.7 words on average. 2.2.5 Poetry In ancient times, people recorded long, historical epics using poetry, to enhance memorability. We follow this idea by turning each system-assigned 60-bit string into a short, distinct English poem. Our format is the rhyming iambic tetrameter couplet: The poem contains two lines of eight syllables each. Lines are in iambic meter, i.e., their syllables have the stress pattern 01010101, where 0 represents an unstressed syllable, and 1 represents a stressed syllable. We also allow 01010100, to allow a line to end in a word like Angela. 34 The two lines end in a pair of rhyming words. Words rhyme if their phoneme sequences match from the nal stressed vowel onwards. We obtain stress patterns and phoneme sequences from the CMU pronunciation dictionary. 4 Monosyllabic words cause trouble, because their stress often depends on context (Greene et al., 2010). For example, eighth is stressed in eighth street, but not in eighth avenue. This makes it hard to guarantee that automatically-generated lines will scan as intended. We therefore eject all monosyllabic words from the vocabulary, except for six unstressed ones (a, an, and, the, of, or). Here is a sample poem password: The le-gen-da-ry Ja-pan-ese # " # " # " # " Sub-si-di-ar-ies ov-er-seas # " # " # " # " Meter and rhyme constraints make it dicult to use the Moses machine trans- lation toolkit to search for uent output, as we did above; the decoder state must be augmented with additional short- and long-distance information (Genzel et al., 2010). Instead, we build a large nite-state acceptor (FSA) with a path for each legal poem. In each path, the second line of the poem is reversed, so that we can enforce rhyming locally. The details of our FSA construction are as follows. First, we create a nite- state transducer (FST) that maps each input English word onto four sequences that capture its essential properties, e.g.: 4 http://www.speech.cs.cmu.edu/cgi-bin/cmudict 35 create -> 0 1 create -> 0 1 EY-T create -> 1r 0r create -> EY-T 1r 0r Here, EY-T represents the rhyme-class of words like create and debate. The r indicates a stress pattern in the right-to-left direction. We then compose this FST with an FSA that only accepts sequences of the form: 0 1 0 1 0 1 0 1 X X 1r 0r 1r 0r 1r 0r 1r 0r where X and X are identical rhyme classes (e.g., EY-T and EY-T). It remains to map an arbitrary 60-bit string onto a path in the FSA. Let k be the integer representation of the 60-bit string. If the FSA contains exactly 2 60 paths, we can easily select thekth path using the following method. At each node N of the FSA, we store the total number of paths from N to the nal state|this takes linear time if we visit states in reverse topological order. We then traverse the FSA deterministically from the start state, usingk to guide the path selection. Our FSA actually contains 2 79 paths, far more than the required 2 60 . We can say that the information capacity of the English rhyming iambic tetrameter couplet is 79 bits! Some are very good: 36 Sophisticated potentates misrepresenting Emirates. The supervisor notified the transportation nationwide. Afghanistan, Afghanistan, Afghanistan, and Pakistan. while others are very bad: The shirley emmy plebiscite complete suppressed unlike invite The shirley emmy plebiscite complaints suppressed unlike invite The shirley emmy plebiscite complaint suppressed unlike invite Fortunately, because our FSA contains over a million times the required 2 60 paths, we can avoid these bad outputs. For any particular 60-bit string, we have a million poems to choose from, and we output only the best one. More precisely, given a 60-bit input string k, we extract not only the kth FSA path, but also thek +i 2 60 paths, with i ranging from 1 to 999,999. We explicitly 37 list out these paths, reversing the second half of each, and score them with our 5- gram LM. We output the poem with the 1-best LM score. Table 2.1 shows sample outputs. To reconstruct the original 60-bit string k, we rst nd the FSA path corre- sponding to the user-recalled English string (with second half reversed). We use depth-rst search to nd this path. Once we have the path, it is easy to determine which numbered path it is, lexicographically speaking, using the node-labeling scheme above to recover k. 2.3 Experiments We designed two experiments to compare our methods. The rst experiment tests the memorability of passwords. We asked partici- pants to memorize a password from a randomly selected method 5 and recall it two days later. To give more options to users, we let them select from the 10-best pass- words according to the LM score for a given 60-bit code. Note that this exibility is not available for XKCD, which produces only one password per code. 62 users participated in this experiment, 44 returned to recall the password, and 22 successfully recalled the complete password. Table 2.4 shows that the Poetry and XKCD methods yield passwords that are easiest to remember. In the second experiment, we present a separate set of users with passwords from each of the four methods. We ask which they would prefer to use, without requiring any memorization. Table 2.5 shows that users prefer sentences over poetry, and poetry over XKCD. 5 In all experiments, we omit the First Letter Mnemonic, due to its low performance in early tests. 38 Method Participants Recalls Correct Recalls XKCD 16 12 58:3% All Letter Method 15 9 33:3% Frequency Method 15 10 40:0% Poetry 16 13 61:5% Table 2.4: Memorability of passwords generated by our methods. \Recalls" indi- cates how many participants returned to type their memorized English sequences, and \Correct Recalls" tells how many sequences were accurately remembered. Method Name User preference XKCD 5% All Letter Method 39% Frequency Method 37% Poetry 19% Table 2.5: User preferences among passwords generated by our methods. 2.4 Analysis Table 2.4 shows that the Poetry and XKCD methods yield passwords that are easiest to memorize. Complete sentences generated by the All Letter and Fre- quency Methods are harder to memorize. At the same time Table 2.5 shows that people like the sentences better than XKCD, so it seems that they overestimate their ability to memorize a sentence of 10-12 words. Here are typical mistakes (S = system-generated, R = as recalled by user): 39 (S) Still looking for ruben sierra could be in central michigan (R) I am still looking for ruben sierra in central michigan (S) That we were required to go to college more than action movies (R) We are required to go to college more than action movies (S) No dressing allowed under canon law in the youth group (R) No dresses allowed under canon law for youth groups Users remember the gist of a sentence very well, but have trouble reproducing the exact wording. Post-experiment interview reveal this to be partly an eect of overcondence. Users put little mental work into memorizing sentences, beyond choosing among the 10-best alternatives presented to them. By contrast, they put much more work into memorizing an XKCD phrase, actively building a mental image or story to connect the four otherwise unrelated words. 2.5 Future Directions Actually, we can often automatically determine that a user-recalled sequence is wrong. For example, when we go to reconstruct the 60-bit input string from a user-recalled sequence, we may nd that we get a 62-bit string instead. We can then automatically prod the user into trying again, but we nd that this is not eective in practice. An intriguing direction is to do automatic error-correction, i.e., take the user-recalled sequence and nd the closest match among the 2 60 English sequences producible by the method. Of course, it is a challenge to do 40 this with 1-best outputs of an MT system that uses heuristic beam search, and we must also ensure that security is maintained. We may also investigate new ways to re-rank n-best lists. Language model scoring is a good start, but we may prefer vivid, concrete, or other types of words, or we may use text data associated with the user (papers, emails) for secure yet personalized password generation. 2.6 Related Work Gasser (1975), Crawford and Aycock (2008), and Shay et al. (2012) describe sys- tems that produce meaningless but pronounceable passwords, such as \tufritvi" . However, their systems can only assign 2 30 distinct passwords. Jeyaraman and Topkara (2005) suggest generating a random sequence of char- acters, and nding a mnemonic for it in a text corpus. A limited corpus means they again have a small space of system-assigned passwords. We propose a similar method in Section 2.2.2, but we automatically synthesize a new mnemonic word sequence. Kurzban (1985) and Shay et al. (2012) use a method similar to XKCD with small dictionaries. This leads to longer nonsense sequences that can be dicult to remember. 2.7 Conclusion We introduced several methods for generating secure passwords in the form of English word sequences. We learned that long sentences are seemingly easy to remember, but actually hard to reproduce, and we also learned that our poetry method produced relatively short, memorable passwords that are liked by users. 41 Chapter 3 Generating Topical Poetry In the previous chapter, we introduced a new application for creative language generation and proposed a method for converting any random bit string into a 2-line poem. In this chapter, we instead take our input from a user, who supplied a topic word or phrase, such as \love" or \civil war". Poets usually have a topic in the mind while writing a poem. Similarly, it is desired from a poetry generation system to be able to generate poems for any given topic. In this chapter, we describe Hafez, a program that generates any number of distinct poems on any user-supplied topic. Furthermore, we show the generality of the system with respect to other languages and poetic forms. 3.1 Introduction Automatic algorithms are starting to generate interesting, creative text, as evi- denced by recent distinguishability tests that ask whether a given story, poem, or song was written by a human or a computer. 1 In this chapter, we describe Hafez, a program that generates any number of distinct poems on a user-supplied topic. Figure 3.1 shows an overview of the system, which sets out these tasks: Vocabulary. We select a specic, large vocabulary of words for use in our generator, and we compute stress patterns for each word. 1 For example, in the 2016 Dartmouth test bit.ly/20WGLF3, no automatic sonnet-writing system passed indistinguishability, though ours was selected as the best of the submitted systems. 42 Figure 3.1: Overview of Hafez converting a user-supplied topic word (wedding) into a four-line iambic pentameter stanza. Related words. Given a user-supplied topic, we compute a large set of related words. Rhyme words. From the set of related words, we select pairs of rhyming words to end lines. Finite-state acceptor (FSA). We build an FSA with a path for every conceiv- able sequence of vocabulary words that obeys formal rhythm constraints, with chosen rhyme words in place. 43 word stress pattern strict rhyme class slant rhyme class (coarse version) needing 10 IY1 D IH0 NG IY1 * IH0 NG ordinary 1010 EH1 R IY0 EH1 * IY0 obligate 101 EY1 T last syllable stressed, no slant rhyme Table 3.1: Sample word analyses. Path extraction. We select a uent path through the FSA, using a recurrent neural network (RNN) for scoring. Sections 3.2-3.6 describe how we address these tasks. After this, we show results of Hafez generating 14-line classical sonnets with rhyme scheme ABAB CDCD EFEF GG, written in iambic pentameter (ten syllables per line with alternating stress: \da-DUM da-DUM da-DUM . . . "). We then show experiments on Hafez's parameters and conclude by showing the generality of the approach with respect to language and poetic form. Novel contributions of our work are: We combine nite-state machinery with deep learning, guaranteeing formal correctness of our poems, while gaining coherence of long-distance RNNs. By using words related to the user's topic as rhyme words, we design a system that can generate poems with topical coherence. This allows us to generate longer topical poems. We extend our method to other poetry formats and languages. 3.2 Vocabulary To generate a line of iambic pentameter poetry, we arrange words to form a sequence of ten syllables alternating between stressed and unstressed. For example: 44 010 1 0 10 101 Attending on his golden pilgramage Following our previous work (Ghazvininejad and Knight, 2015), we refer to unstressed syllables with 0 and stressed syllables with 1, so that the form of a Shakespearean sonnet is ((01) 5 ) 14 . To get stress patterns for individual words, we use CMU pronunciation dictionary, 2 collapsing primary and secondary stresses. For example: CAFETERIA K AE2 F AH0 T IH1 R IY0 AH0 becomes CAFETERIA 10100 The rst two columns of Table 3.1 show other examples. From the 125,074 CMU dictionary word types, we can actually only use words whose stress pattern matches the iambic pattern (alternating 1s and 0s). However, we make an exception for words that end in ...100 (such as spatula and Angela). To mimic how human poets employ such words, we convert all \...100" patterns to \...101". This leaves us with a 106,019 word types. Words with multiple syllable-stress patterns present a challenge. For example, our program may use the word record in a \...10..." context, but if it is a verb in that context, a human reader will pronounce it as \01", breaking the intended rhythm. To guarantee that our poems scan properly, we eject all ambiguous words from our vocabulary. This problem is especially acute with monosyllabic words, as most have a stress that depends on context. Greene et al. (2010) apply the EM algorithm to align human-written sonnets with assumed meter, extracting 2 http://www.speech.cs.cmu.edu/cgi-bin/cmudict 45 P(0jword) and P(1jword) probabilities. Using their method, we eject all monosyl- labic words except those with P(0jword)> 0:9 or P(1jword)> 0:9. A consequence is that our poetry generator avoids the words to, it, in, and is, which actually forces the system into novel territory. This yields 16,139 monosyllabic and 87,282 multisyllabic words. Because our uency module (Section 3.6) is restricted to 20,000 word types, we further pare down our vocabulary by removing words that are not found in the 20k-most-frequent list derived from the song lyrics corpus we use for uency. After this step, our nal vocabulary contains 14,368 words (4833 monosyllabic and 9535 multisyllabic). 3.3 Topically Related Words and Phrases After we receive a user-supplied topic, the rst step in our poem generation algo- rithm is to build a scored list of 1000 words/phrases that are related to that topic. For example: User-supplied input topic: colonel Output: colonel (1.00), lieutenant colonel (0.77), brigadier general (0.73), commander (0.67) ... army (0.55) ... This problem is dierent from nding synonyms or hypernyms in WordNet (Miller, 1995). For example, while Banerjee and Pedersen (2003) use WordNet to assign a 1.0 similarity score between car and automobile, they only give a 0.3 similarity between car and gasoline. A second method is to use pointwise mutual information (PMI). Let t be the topic/phrase, and letw be a candidate related word. We collect a set of sentences S that contain t, and sort candidates by 46 Method Window Corpus Phrases? Related words PMI n/a Gigaword no croquet, Romai, Carisbo, NTTF, showcourts ... CBOW 8 Gigaword no squash, badminton, golf, soccer, racquetball ... CBOW 40 Gigaword no singles, badminton, squash, ATP, WTA ... CBOW 40 Song Lyrics no high-heel, Reebok, steel-toed, basketball, Polos ... CBOW 40 Wikipedia no volleyball, racquet, Wimbledon, athletics, doubles ... CBOW 40 Wikipedia yes singles titles, grass courts, tennis club, hardcourt ... Table 3.2: Dierent methods for extracting words related to the topic tennis. Proportion of sentences in S containing w P(w) in general text Table 3.2 shows that PMI has a tendency to assign a high score to low frequency words (Bouma, 2009; Role and Nadif, 2011; Damani, 2013). A third method is word2vec (Mikolov et al., 2013a), which provides distributed word representations. We train a continuous-bag-of-words model 3 with window size 8 and 40 and word vector dimension 200. We score candidate related words/phrases with cosine to topic-word vector. We nd that a larger window size works best (Pennington et al., 2014; Levy and Goldberg, 2014). Table 3.2 shows examples. The training corpus for word2vec has a crucial eect on the quality of the related words. We train word2vec models on the English Gigaword corpus, 4 a song lyrics corpus, and the rst billion characters from Wikipedia. 5 The Gigaword corpus produces related words that are too newsy, while the song lyrics corpus does not cover enough topics. Hence, we train on Wikipedia. To obtain related phrases as well as words, we apply the method of Mikolov et al. (2013b) to the Wikipedia corpus, which replaces collocations like Los Angeles with 3 https://code.google.com/archive/p/word2vec/ 4 https://catalog.ldc.upenn.edu/LDC2011T07 5 http://mattmahoney.net/dc/enwik9.zip 47 single tokens like Los Angeles. Word2vec then builds vectors for phrases as well as words. When the user supplies a multi-word topic, we use its phrase vector if available. Otherwise, we create the vector topic by element wise addition of its words' vectors. 3.4 Choosing Rhyme Words We next ll in the right-hand edge of our poem by selecting pairs of rhyming words/phrases and assigning them to lines. In a Shakespearean sonnet with rhyme scheme ABAB CDCD EFEF GG, there are seven pairs of rhyme words to decide on. 3.4.1 Strict Rhyme The strict denition of English rhyme is that the sounds of two words must match from the last stressed vowel onwards. In a masculine rhyme, the last syllable is stressed; in a feminine rhyme, the penultimate syllable is stressed. We collect phoneme and stress information from the CMU pronunciation dictionary. We pre- compute strict rhyme classes for words (see Table 3.1) and hash the vocabulary into those classes. 3.4.2 Slant Rhyme In practice, human poets do not always use strict rhymes. To give ourselves more exibility in choosing rhyme pairs, we allow for slant (or half) rhymes. By inspect- ing human rhyming data, we develop this operational denition of slant rhyme: 1. Let s1 and s2 be two potentially-rhyming phoneme sequences. 48 2. Replace ER with UH R in both sequences. 3. Let v1 and v2 be the last stressed vowels in s1 and s2. 4. Let w1 and w2 be last vowels in s1 and s2. 5. Let s1 = a1 v1 x1 w1 c1. Likewise, let s2 = a2 v2 x2 w2 c2. 6. Output NO under any of these circumstances: (a) v16= v2, (b) w16= w2, (c) c16= c2, (d) a16= NULL and a26= NULL and a1 = a2. 7. If x1 and x2 are single phonemes: (a) If x1 x2, then output YES. 6 (b) Otherwise, output NO. 8. If x1 and x2 contain dierent numbers of vowels, output NO. 9. Let p1 and q1 be the rst and last phonemes of x1. Let p2 and q2 be the same for x2. 10. If (p1 = p2) and (q1 q2), output YES. 11. If (p1 p2) and (q1 = q1), output YES. 12. Otherwise, output NO. Words whose last syllable is stressed do not participate in slant rhymes. Example slant rhymes taken from our generated poems include Viking/ghting, snoopy/spooky, baby/crazy and comic/ironic. We pre-compute a coarse version of slant rhyme classes (Table 3.1) with the pattern \v i * w i c i ". If two words hash 6 x y if phonemes x and y are similar. Two phonemes are similar if their pairwise score according to Hirjee and Brown (2010) is greater than -0.6. This includes 98 pairs, such as L/R, S/SH, and OY/UH. 49 to the same coarse class, then we subsequently accept or reject depending on the similarity of the intermediate phonemes. 3.4.3 Non-Topical Rhyming Words For rare topics, we may not have enough related words to locate seven rhyming pairs. For example, we generate 1000 related words for the topic Viking, but only 32 of them are found in our 14,368-word vocabulary. To give a chance for all topical words/phrases to be used as rhyme words, for each strict rhyme class, we add the most common word in our song lyric corpus to the list of related words. In addition, we add words from popular rhyme pairs 7 (like do/you and go/know) to the list of related words with a low topic similarity score. 3.4.4 Rhyme word selection We rst hash all related words/phrases into rhyme classes. Each collision generates a candidate rhyme pair (s1, s2), which we score with the maximum of cosine(s1, topic) and cosine(s2, topic). So that we can generate many dierent sonnets on the same topic, we choose rhyme pairs randomly with probability proportional to their score. After choosing a pair (s1, s2), we remove it, along with any other candidate pair that contains s1 or s2. Because a poem's beginning and ending are more important, we assign the rst rhyme pair to the last two lines of the sonnet, then assign other pairs from beginning of the sonnet towards the end. 50 Figure 3.2: An FSA compactly encoding all word sequences that obey formal sonnet constraints, and dictating the right-hand edge of the poem via rhyming, topical words delight, chance, ... and joy. 3.5 Constructing FSA of Possible Poems After choosing rhyme words, we create a large nite-state acceptor (FSA) that compactly encodes all word sequences that use these rhyme words and also obey formal sonnet constraints: Each sonnet contains 14 lines. Lines are in iambic pentameter, with stress pattern (01) 5 . Following poetic convention, we also use (01) 5 0, allowing feminine rhyming. Each line ends with the chosen rhyme word/phrase for that line. Each line is punctuated with comma or period, except for the 4th, 8th, 12th, and 14th lines, which are punctuated with period. To implement these constraints, we create FSA states that record line number and syllable count. For example, FSA state L2-S3 (Figure 3.2) signies \I am in line 2, and I have seen 3 syllables so far". From each state, we create arcs for each 7 http://slate.me/OhTKCA 51 feasible word in the vocabulary. For example, we can move from state L1-S1 to state L1-S3 by consuming any word with stress pattern 10 (such as table or active). When moving between lines (e.g., from L1-S10 to L2-S1), we employ arcs labeled with punctuation marks. To x the rhyme words at the end of each line, we delete all arcs pointing to the line-nal state, except for the arc labeled with the chosen rhyme word. For speed, we pre-compute the entire FSA; once we receive the topic and choose rhyme words, we only need to carry out the deletion step. In the resulting FSA, each path is formally a sonnet. However, most of the paths through the FSA are meaningless. One FSA generated from the topic natural language contains 10 229 paths, including this randomly-selected one: Of pocket solace ammunition grammar. An tile pretenders spreading logical. An stories Jackie gallon posing banner. An corpses Kato biological ... Hence, we need a way to search and rank this large space. 3.6 Path extraction through FSA with RNN To locate uent paths, we need a scoring function and a search procedure. For example, we can build a n-gram word language model (LM)|itself a large weighted FSA. Then we can take a weighted intersection of our two FSAs and return the highest-scoring path. While this can be done eciently with dynamic program- ming, we nd that n-gram models have a limited attention span, yielding poor poetry. 52 Instead, we use an RNN language model (LM). We collect 94,882 English songs (32m word tokens) as our training corpus, 8 and train 9 a two-layer recurrent net- work with long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997). 10 When decoding with the LM, we employ a beam search that is further guided by the FSA. Each beam state C t;i is a tuple of (h;s;word;score), where h is the hidden states of LSTM at step t in ith state, and s is the FSA state at step t in ith state. The model generates one word at each step. At the beginning,h 0;0 is the initial hidden state of LSTM,s 0;0 is the start state of FSA, word 0;0 = <START> and score 0;0 = 0. To expand a beam state C t;i , we rst feedh t;i andword into the LM and get an updated hidden stateh next . The LM also returns a probability distribution P (V ) over the entire vocabularyV for next word. Then, for each succeeding state s suc of s t;i in the FSA and the word w next over each edge from s t;i to s suc , we form a new state (h next ;s suc ;w next ;score t;i + log(P (w next ))) and push it into next beam. Because we x the rhyme word at the end of each line, when we expand the beam states immediately before the rhyme word, the FSA states in those beam states have only one succeeding state|LN-S10, where N = [1; 14], and only one succeeding word, the xed rhyme word. For our beam size b = 50, the chance is quite low that in those b words there exists any suitable word to precede that rhyme word. We solve this by generating the whole sonnet in reverse, starting from the nal rhyme word. Thus, when we expand the state L1-S8, we can choose 8 http://www.mldb.org/ 9 We use the toolkit: https://github.com/isi-nlp/Zoph RNN 10 We use a minibatch of 128, a hidden state size of 1000, and a dropout rate of 0.2. The output vocabulary size is 20,000. The learning rate is initially set as 0.7 and starts to decay by 0.83 once the perplexity on a development set starts to increase. All parameters are initialized within range [0:08; +0:08], and the gradients are re-scaled when the global norm is larger than 5. 53 from almost every word in vocabulary instead of just b possible words. The price to pay is that at the beginning of each line, we need to hope in thoseb words there exists some that are suitable to succeed comma or period. Because we train on song lyrics, our LM tends to generate repeating words, like never ever ever ever ever. To solve this problem, we apply a penalty to those words that already generated in previous steps during the beam search. To create a poem that ts well with the pre-determined rhyme words at the end of each line, the LM model tends to choose \safe" words that are frequent and suitable for any topic, such as pronouns, adverbs, and articles. During decoding, we apply a reward on all topically related words (generated in Section 4) in the non-rhyming portion of the poem. Finally, to further encourage the system to follow the topic, we train an encoder- decoder sequence-to-sequence model (Sutskever et al., 2014). For training, we select song lyric rhyme words and assemble them in reverse order (encoder side), and we pair this with the entire reversed lyric (decoder side). At generation time, we put all the selected rhyme words on the source side, and let the model to generate the poem conditioned on those rhyme words. In this way, when the model tries to generate the last line of the poem, it already knows all fourteen rhyme words, thus possessing better knowledge of the requested topic. We refer to generating poems using the RNN LM as the \generation model" and to this model as the \translation model". 3.7 Results and Analysis Sample outputs produced by our best system are shown in Figures 3.3 and 3.4. We nd that they generally stay on topic and are fairly creative. If we request a 54 Bipolar Disorder Existence enters your entire nation. A twisted mind reveals becoming manic, An endless modern ending medication, Another rotten soul becomes dynamic. Or under pressure on genetic tests. Surrounded by controlling my depression, And only human torture never rests, Or maybe you expect an easy lesson. Or something from the cancer heart disease, And I consider you a friend of mine. Without a little sign of judgment please, Deliver me across the borderline. An altered state of manic episodes, A journey through the long and winding roads. Figure 3.3: Sample sonnet generated from the topic phrase bipolar disorder. poem on the topic Vietnam, we may see the phrase Honky Tonkin Resolution; a dierent topic leads the system to rhyme Dirty Harry with Bloody Mary. In this section, we present experiments we used to select among dierent versions of our poem generator. The rst experiment tests the eect of encouraging topical words in the body of the poem, via a direct per-word bonus. For 40 dierent topics, we generate 2 sonnets with and without encouragement, using the same set of rhyme words. Then we ask 23 human judges to choose the better sonnet. Each judge compares sonnets for 10 dierent topics. Table 3.3 shows that using topical words increases the quality of the sonnets. Next, we compare the translation model with generation model. For each of 40 topics, we generate one poem with generation model and one poem with 55 Love at First Sight An early morning on a rainy night, Relax and make the other people happy, Or maybe get a little out of sight, And wander down the streets of Cincinnati. Girlfriend Another party started getting heavy. And never had a little bit of Bobby, Or something going by the name of Eddie, And got a nger on the trigger sloppy. Noodles The people wanna drink spaghetti alla, And maybe eat a lot of other crackers, Or sit around and talk about the salsa, A little bit of nothing really matters. Civil War Creating new entire revolution, An endless nation on eternal war, United as a peaceful resolution, Or not exist together any more. Figure 3.4: Sample stanzas generated from dierent topic phrases. translation model, using the same set of rhyme words. We ask 25 human judges to chose the better poem. Each judge compares sonnets for 10 dierent topics. This experiment is run separately for sonnets and stanzas. Table 3.4 shows how the translation model generates better poems, and Figure 3.5 compares two stanzas. We check for plagiarism, as it is common for optimal-searching RNNs to repeat large sections of the training data. We hypothesize that strong conditions on rhyme, meter, repetition, and ambiguously-stressed words will all mitigate against plagiarism. We nd that on average, each sonnet copies only 1:2 5-grams from the 56 Preference Encourages Does Not Encourage Cannot Decide Sonnets 54% 18% 28% Table 3.3: Users prefer the system that encourages the use of related words in the body (non-rhyme) portion of the poem. 40 poems are tested with 23 judges. Generation Another tiny thousand ashes scattered. Model And never hardly ever really own, Or many others have already gathered, The only human being left alone. Translation Being buried under ashes scattered, Model Many faces we forgotten own, About a hundred thousand soldiers gathered, And I remember standing all alone. Figure 3.5: Stanzas generated with and without a encoder-decoder translation model for topic death. training data. If we relax the repeated-word penalty and the iambic meter, this number increases to 7:9 and 10:6 copied 5-grams, respectively. Considering the lack of copying, we nd the RNN-generated grammar to be quite good. The most serious|and surprisingly common|grammatical error is the wrong use of a and an, which we x in a post-processing step. 3.8 Other Languages and Formats To show the generality of our approach, we modify our system to generate Spanish- language poetry from a Spanish topic. We use these resources: Preference Generation Model Translation Model Cannot Decide Stanzas 26% 43% 31% Sonnets 21% 57% 22% Table 3.4: Users prefer poems created with the encoder-decoder translation model over those that use only the RNN language model in generation mode. 40 poems are tested with 25 judges. 57 word stress rhyme v- -v consultado 0010 -ado yes aduciendo 0010 -endo yes yes r egimen 100 -egimen hospital 001 -al yes Figure 3.6: Sample word analyses needed to construct Spanish Hafez. v- and -v indicate whether the word starts and/or ends with a vowel sound. A song lyric corpus for training our RNN. We download 97,775 Spanish song lyrics from LyricWikia, 11 which amounts to 20m word tokens and 219k word types. A Spanish Wikipedia dump 12 consisting of 885m word tokens, on which we run word2vec to nd words and phrases related to the topic. Our vocabulary consists of the 20k most frequent lyric words. For each word, we compute its syllable-stress pattern and its rhyme class (see Figure 3.6). Because Spanish writing is quite phonetic, we can retrieve this from the letter strings of the vocabulary. For any given vocabulary word: 13 1. We remove silent h, and convert y into i. 2. We count the number of syllables by isolating vowel groups. In such groups, weak vowels (i;u) attached to strong vowels (a;e;o) do not form separate syllables, unless they are accented (d -as versus dios). Strong clusters are broken into separate syllables (eg, ca-er). 11 http://lyrics.wikia.com/wiki/Category:Language/Spanish 12 https://dumps.wikimedia.org/eswiki/20160305/eswiki-20160305-pages-meta- current.xml.bz2 13 http://community.dur.ac.uk/m.p.thompson/verse.htm 58 3. We determine which vowel (and therefore syllable) is stressed. If any vowel is accented, it is stressed. If the word is accent-free, then the second-to-last syllable is stressed, unless the word ends in a consonant other than n or s, in which case the last syllable is stressed. 4. We form the word's rhyme class by breaking o a letter sux starting at the last stressed vowel (as in English). Weak vowels do not participate (e.g., tienda! -enda, not -ienda). We remove h from the rhyme, so b uho rhymes with contin uo. Because rhyming is easier in Spanish than English, we do not need slant rhyme. Most Spanish poetic formats enforce some number of syllables per line, without meter. However, there are two caveats when counting syllables: 1. Sinalefa merges vowels across word boundaries. Thus, la obra is counted as two syllables instead of three, and va a hacer is counted as two syllables instead of four. A line may therefore have more words than syllables. 2. For the last word of a line (only), we count up to its last stressed syllable, then add one. This means that even though iambic meter is not employed, we still need stress patterns to correctly count syllables. We implement these constraints in the FSA framework, now with separate states for \I have seen M syllables, and the last word ended in a vowel sound" and \I have seen M syllables, and the last word ended in a consonant sound." Technically speaking, the FSA includes single-state cycles for the Spanish word a, due to sinalefa. Line-ending states can only be reached by words that have their syllable count adjusted as in point 2 above. Figure 3.7 shows a sample Spanish output. The format is the classical Spanish soneta, which consists of 14 eleven-syllable lines under the rhyme scheme ABBA 59 Mariposa Quieres saber d onde est a el escorpi on, Ni ayer ni antes vos sos corona dorada. Ya os ves m as tal cual tortuga pintada, A el nos gusta andar con cola marr on. Ella es qui en son las alas de alg un gorri on. Si al n pod es ver tu imagen manchada, O hoy vas bajo un cielo azul plateada, Por qu e est as tan lejos del aguij on. No hay luz que al sol se enreda en tus palmera. Ay por qu e eres v bora venenosa, Sin querer igual a un enredadera. Y si a un sue~ nas con ser mariposa, En vez de abrir los ojos y espera, Sabes muy bien que el amor no es gran cosa. Figure 3.7: Sample Spanish poem generated in classical soneta form, on the topic mariposa (butter y). ABBA CDC DCD. This scheme requires us to choose up to four words with the same rhyme. Overall, we also nd Spanish outputs to be uent, fairly creative, and on topic. Grammatical problems are a bit more common than in our English generator| for example, adjacent words sometimes disagree in number or gender. The RNN generalizations that permit these errors no doubt also permit creative phrasings. 3.9 Conclusion We have described Hafez, a poetry generation system that combines hard for- mat constraints with a deep-learning recurrent network. The system uses special 60 techniques, such as rhyme-word choice and encoder-decoder modeling, to keep the poem on topic. 61 Chapter 4 Interactive Poetry Generation System In the previous chapter, we described a system that generates any number of distinct poems for any user-supplied topic. However, the only input to this system is the topic, and the system does not let users adjust the generated poems. In this chapter, we extend our poetry generation system and design an interactive poetry generation system which enables users to revise and polish generated poems by adjusting various style congurations. We describe the web interface for the system where users can rate the quality of each poem. We explain how, based on such feedback, the system learns to adjust its parameters to improve poetry quality. 1 4.1 Introduction Automated poetry generation is attracting increasing research eort. Researchers approach the problem by using grammatical and semantic templates (Oliveira, 2009, 2012) or treating the generation task as a translation/summarization task (Zhou et al., 2009; He et al., 2012; Yan et al., 2013; Zhang and Lapata, 2014; Yi et al., 2016; Wang et al., 2016a; Ghazvininejad et al., 2016). However, such poetry generation systems face these challenges: 1 Work described in this chapter is joint with Xing Shi. 62 1. Diculty of evaluating poetry quality. Automatic evaluation methods, like BLEU, cannot judge the rhythm, meter, creativity or syntactic/semantic coherence, and furthermore, there is no test data in most cases. Subjective evaluation requires evaluators to have relatively high literary training, so systems will receive limited feedback during the development phase. 2 2. Inability to adjust the generated poem. When poets compose a poem, they usually need to revise and polish the draft from dierent aspects (e.g., word choice, sentiment, alliteration, etc.) for several iterations until satisfac- tion. This is a crucial step for poetry creation. However, given a user-supplied topic or phrase, most existing automated systems can only generate dierent poems by using dierent random seeds, providing no other support for the user to polish the generated poem in a desired direction. 3. Slow generation speed. Generating a poem may require a heavy search procedure. For example, our poetry generation system (Ghazvininejad et al., 2016) needs 20 seconds for a four-line poem. Such slow speed is a serious bot- tleneck for a smooth user experience, and prevents the large-scale collection of feedback for system tuning. This work is based on our previous poetry generation system introduced in chapter 3, called Hafez. This system generates poems in three steps: (1) search for related rhyme words given user-supplied topic, (2) create a nite-state accep- tor (FSA) that incorporates the rhyme words and controls meter, and (3) use a recurrent neural network (RNN) to generate the poem string, guided by the FSA. We address the above-mentioned challenges with the following approaches: 2 The Dartmouth Turing Tests in the Creative Arts (bit.ly/20WGLF3), in which human experts are employed to judge the generation quality, is held only once a year. 63 1. We build a web interface 3 for our poem generation system, and for each generated poem, the user can rate its quality from 1-star to 5-stars. Our logging system collects poems, related parameters, and user feedback. Such crowd-sourcing enables us to obtain large amounts of feedback in a cheap and ecient way. Once we collect enough feedback, the system learns to nd a better set of parameters and updates the system continuously. 2. We add additional weights during decoding to control the style of generated poem, including the extent of words repetition, alliteration, word length, cursing, sentiment, and concreteness. 3. We increase speed by pre-calculation, pre-loading model parameters, and pruning the vocabulary. We also parallelize the computation of FSA expan- sion, weight merging, and beam search, and we port them into a GPU. Overall, we can generate a four-line poem within 2 seconds, ten times faster than our previous CPU-based system. With the web interface's style control and fast generation speed, people can generate creative poems within a short time. Table 4.1 shows one of the poems generated in a poetry mini-competition where 7 people are asked to use Hafez to generate poems within 15 minutes. We also conduct experiments on Amazon Mechanical Turk, which show: rst, through style-control interaction, 71% users can nd a better poem than the poem generated by the default conguration. Second, based on users' evaluation results, the system learns a new conguration which generates better poems. 3 Live demo at http://52.24.230.241/poem/advance/ 64 Topic: Presidential elections To hear the sound of communist aggression! I never thought about an exit poll, At a new Republican convention, On the other side of gun control. Table 4.1: One poem generated in a 15-minute human/computer interactive poetry contest. 4.2 System Description Figure 4.1: Overview of Hafez Figure 4.1 shows an overview of Hafez. In the web interface, a user can input topic words or phrases and adjust the style conguration. This information is then sent to our backend server, which is primarily based on our previously-described work in chapter 3. First, the backend will use the topic words/phrases to nd related rhyme word pairs by using a word2vec model and a pre-calculated rhyme- type dictionary. Given these rhyme word pairs, an FSA that encodes all valid word sequences is generated, where a valid word sequence follows certain type of meter and puts the rhyme word at the end of each line. This FSA, together with the user- supplied style conguration, is then used to guide the Recurrent Neural Network 65 (RNN) decoder to generate the rest of the poem. User can rate the generated poem using a 5-star system. Finally, the tuple (topic, style conguration, generated poem, star-rating) is pushed to the logging system. Periodically, a module will analyze the logs, learn a better style conguration and update it as the new default style conguration. 4.2.1 Example in Action Figure 4.2 provides an example in action. The user has input the topic word \love" and left the style conguration as default. After they click the \Generate" button, a four-line poem is generated and displayed. The user may not be satised with current generation, and may decide to add more positive sentiment and encourage a little bit of the alliteration. After they move the corresponding slider bars and click the \Re-generate with the same rhyme words" button, a new poem is returned. This poem has more positive sentiment (\A lonely part of you and me tonight" vs. \A lovely dream of you and me tonight") and more alliteration (\My merry little love", \The lucky lady" and \She sings the sweetest song" ). 4.2.2 Style Control During the RNN's beam search, each beam cell records the current FSA state s. Its succeeding state is denoted ass suc . All the words over all the succeeding states forms a vocabularyV suc . To expand the beam stateb, we need to calculate a score for each word in V suc : score(w;b) = score(b) + logP RNN (w) + X i w i f i (w);8w 2 V suc (4.1) 66 (a) Poem generated with default style settings (b) Poem generated with user adjusted style settings Figure 4.2: A poem generated with (a) default style conguration and (b) user- adjusted style conguration. where logP RNN (w) is the log-probability of word w calculated by RNN. score(b) is the accumulated score of the already-generated words in beam state b . f i (w) is i-th feature function and w i is the corresponding weight. To control the style, we design the following 8 features: 1. Encourage/discourage words. User can input words that they would like in the poem, or words to be banned. f(w) = I(w;V enc=dis ) where I(w;V ) = 1 if w is in the word list V , otherwise I(w;V ) = 0. w enc = 5 and w dis =5. 67 2. Curse words. We pre-build a curse-word list V curse , and f(w) =I(w;V curse ). 3. Repetition. To control the extent of repeated words in the poem. For each beam, we record the current generated words V history , and f(w) = I(w;V history ). 4. Alliteration. To control how often adjacent non-function words start with the same consonant sound. In the beam cell, we also record the previous generated word w t1 , and f(w t ) = 1 if w t and w t1 share the same rst consonant sound, otherwise it equals 0. 5. Word length. To control a preference for longer words in the generated poem. f(w) =length(w) 2 . 6. Topical words. For each user-supplied topic words, we generate a list of related words V topical . f(w) =I(w;V topical ). 7. Sentiment. We pre-build a word list together with its sentiment scores based on SentiWordNet (Baccianella et al., 2010). f(w) equals to w's sentiment score. 8. Concrete words. We pre-build a word list together with a score to re ect its concreteness based on (Brysbaert et al., 2014). f(w) equals to w's concrete- ness score. 4.2.3 Speedup To nd the rhyming words related to the topic, we employ a word2vec model. Given a topic word or phrasew t 2V , we nd related wordsw r based on the cosine distance: w r = arg max wr2V 0 V cosine(e wr ;e wt ) (4.2) 68 where e w is the embedding of word w. Then we calculate the rhyme type of each related word w r to nd rhyme pairs. To speed up this step, we carefully optimize the computation with these meth- ods: 1. Pre-load all parameters into RAM. As we are aiming to accept arbitrary topics, the vocabulary V of word2vec model is very large (1.8M words and phrases). Pre-loading saves 3-4 seconds. 2. Pre-calculate the rhyme types for all words w2V 0 . During runtime, we use this dictionary to lookup the rhyme type. 3. Shrink V'. As every rhyme word/phrase pairs must be in the target vocabu- lary V RNN of the RNN, we further shrink V 0 =V\V RNN : To speedup the RNN decoding step, we use GPU processing for all forward- propagation computations. For beam search, we port to GPU the two most time- consuming parts, calculating scores with Equation 4.1 and nding the top words based the score: 1. We warp all the computation needed in Equation 4.1 into a single large GPU kernel launch. 2. With beam size B, to nd the top k words, instead of using a heap sort on CPU with complexityO(BjV suc jlogk), we do a global sort on GPU with complexityO(BjV suc jlog(BjV suc j)) in one kernel launch. Even though the complexity increases, the computation time in practice reduces quite a bit. Finally, our system can generate a 4-line poem within 2 seconds, which is 10 times faster than the previous CPU-based version. 69 4.2.4 Learn a New Style Conguration Except the xed weight for encouragement and discouragement of words, all other 7 weights form our style conguration space: W =fw i ji = 1::7g (4.3) and we denote the default conguration as W d , intuitively selected. Users will usually start with W d to generate their rst poem p d , and if they later adjust the style conguration and click \Re-generate with same rhyme words" button, the new poemp i will change the style accordingly, but still with the same rhyme words as p d . In the logging system, a unique hash h r is recorded to distin- guish dierent rhyme word sets in the poem. After proper processing, our logging system organizes the data points as a dic- tionary D =fh r : [(p d ;W d ;r d ); (p 1 ;W 1 ;r 1 );:::; (p n ;W n ;r n )]g. r d represents user's star rating for the poem with default setting(p d ) while r i 2f1;:::;ng represents user's star rating for p 1 ;:::;p n . To learn a new style congurationW new , we construct our training dataD train = fh r : [(W 1 ; r 1 );:::; (W n ; r n )]g, where W i = W i W d and r i = r i r d . Then, we t a quadratic regression between the rating change r and each weight change w j 2 W independently: r =a(w j ) 2 +bw j +c (4.4) and the new weight w j;new is 70 User: Alexa, start Custom Poetry Alexa: Welcome to Custom Poetry. Please give me a topic. User: Mountains Alexa: A hundred thousand Morrison formation, An ancient crown of gold or mountain chains, Mountains from the land of elevation, A northern storm across the hills and plains. Alexa: Feel free to give me another topic. User: Lakes Alexa: ... Alexa: Feel free to give me another topic. User: Stop Alexa: Thank you for trying Custom Poetry. Have a nice day! Table 4.2: Spoken conversation between a user and Alexa. w j;new =w j;d + arg max w j r (4.5) 4.2.5 Alexa Skill To feel beauty, rhythm, and emotion in poetry, it is better to listen to a spoken version. With this in mind, we also publish our system as an Amazon Alexa Skill (\Custom Poetry"), so that users can ask Alexa to compose and recite a poem on any topic. Table 4.2 shows a sample conversation between a user and Alexa. 4 4 This Alexa skill won nalist award in Amazon Alexa skill challenge in 2017. 71 4.3 Experiments We design an Amazon Mechanical Turk task to explore the eect of style options. In this task Turkers rst use Hafez to generate a default poem on an arbitrary topic with the default style conguration, and rate it. Next, they are asked to adjust the style congurations to re-generate at least ve dierent adjusted poems with the same rhyme words, and rate them as well. Improving the quality of adjusted poems over the default poem is not required for nishing the task, but it is encouraged. For each task, Turkers can select the best generated poem, and if subsequent human judges (domain experts) rank that poem as \great", a bonus reward will be assigned to that Turker. We gathered data from 62 completed HITs (Human Intelligence Tasks) for this task. 4.3.1 Human-Computer Collaboration This experiment tests whether human collaboration can help Hafez generate better poems. In only 10% of the HITs, the reported best poem was generated by the default style options, i.e., the default poem. Additionally, in 71% of the HITs, users assign a higher star rating to at least one of the adjust poems than the default poem. On average the best poems got +1:4 more stars compared to the default one. However, poem creators might have a tendency to report a higher ranking for poems generated through the human/machine collaboration process. To sanity check the results we designed another task and asked 18 users to compare the default and the reported best poems. This experiment seconded the original rank- ings in 72% of the cases. 72 -1 0 1 2 3 4 Normalized Topical Weight -4 -3 -2 -1 0 1 2 3 4 Normalized Score (a) -5 -4 -3 -2 -1 0 1 2 3 4 5 Normalized Concreteness Weight -4 -3 -2 -1 0 1 2 3 4 Normalized Score (b) -5 -4 -3 -2 -1 0 1 2 3 4 5 Normalized Sentiment Weight -4 -3 -2 -1 0 1 2 3 4 Normalized Score (c) -5 -4 -3 -2 -1 0 1 2 3 4 5 Normalized Repetition Weight -4 -3 -2 -1 0 1 2 3 4 Normalized Score (d) Figure 4.3: The distribution of poem star-ratings against normalized topical, con- creteness, sentiment and repetition weights. Star ratings are computed as an oset from the version of the poem generated from default settings. We normalize all features weights by calculating their oset from the default values. The solid curve represents a quadratic regression t to the data. To avoid overlapping points, we plot with a small amount of random noise added. 4.3.2 Automatic tuning for quality We learn new default congurations using the data gathered from Mechanical Turk. As we explained in Section 4.2.4, we examine the eect of dierent feature weights like repetition and sentiment on star ranking scores. We aim to cancel out the eect of topic and rhyme words on our scoring function. We achieve this by plotting the score oset from the default poem for each topic and set of rhyme words. Figure 4.3 shows the distribution of scores against topical, concreteness, 73 sentiment and repetition weights. In each plot the zero weight represents the default value. Each plot also shows a quadratic regression curve t to its data. In order to alter the style options toward generating better default poems, we re-set each weight to the maximum of each quadratic curve. Hence, the new weights encourage more topical, less concrete, more positive words and less repetition. It is notable that for sentiment, users prefer both more positive and more negative words to the initial neutral setting, but the preference is slightly biased towards positive words. We update Hafez's default settings based on this analysis. We ask 29 users to compare poems generated on the same topic and rhyme words using both old and new style settings. In 59% of the cases, users prefer the poem generated by the new setting. We thus improve the default settings for generating a poem, though this does not mean that the poems cannot be further improved by human collaboration. In most cases, a better poem can be generated by collaboration with the system (changing the style options) for the specic topic and set of rhyme words. 4.4 Conclusion We demonstrate Hafez, an interactive poetry generation system. It enables users to generate poems about any topic, and revise generated texts through multiple style congurations. We speed up the system by vocabulary pruning and GPU computation. Together with an easily-accessible web interface, we can collect large numbers of human evaluations in a short timespan, making automatic system tuning possible. 74 Chapter 5 Poetry Translation - Information Theory Foundation In Chapters 3 and 4, we described a poetry generation system that can generate any number of distinct poems for any user supplied topic. Furthermore, this system enables users to revise and polish the generated poems by adjusting dierent style conguration settings. Although the poems generated by this system have topical coherency, they do not follow any specic start to end ow. One way to face this problem is by modeling the structure, ow and goal of human generated poems and use these models in computer generation. It is pos- sible to model the structure of poetry using nite state machines (Greene et al., 2010) or discourse relation (Harris, 1989), but generating a coherent text to tell a specic story or convey a desired feeling is an extremely challenging task. Never- theless, computer algorithms are progressing in rewriting a concept into another format. Automatic caption generation from images (Xu et al., 2015), text simpli- cation (Wang et al., 2016b) and summarization (Gambhir and Gupta, 2017) are some examples. We decide to follow this path and let the computer borrow high- level ow structure from human literature, specically from poems. We propose a machine translation system designed to translate poems into a poetic format. 75 Many examples of human poetry translation suggest that poetry translation is far from literal translation. Translators change source poems to convey con- cepts and images into other cultures. In this chapter, we propose an information- theoretic way to measure the language variation in a quantitative way in the trans- lation process. Next, we use this technique to estimate the language variation in translation of non-creative text (Spanish to English translation of Europarl corpus v7 (Koehn, 2005)). Later, in Chapter 6, this tool is used to measure language variation in poetry translation and study whether a translator translates poetry more freely compared to a general text. 5.1 Introduction Zoph et al. (2015) ask the question \How much information does a human trans- lator add to the original?" That is, once a source text has been compressed, how many additional bits are required to encode its human translation? If translation were a deterministic process, the answer would be close to zero. But in reality, we observe an amount of free variation in target texts. We might guess, therefore, that human translators add something like 10% or 20% extra information, as they work. To get an upper bound on this gure, Zoph et al. (2015) devise and implement an algorithm to actually compress target English in the presence of source Spanish. 1 The size of their compressed English is 68% of the size of compressed Spanish. This 1 Zoph et al. (2015) do not conrm that their bilingual data was indeed translated from Spanish to English, rather than vice-versa, or from a third language. We proceed similarly in this chapter, deferring whether one direction is more predictable than another. 76 bound seems rather generous. In setting up a common task, Zoph et al. (2015) encourage researchers to develop improved bilingual compression technology. 2 In this chapter, we investigate answering the same question. We do not do this by building better compression algorithms, but by seeing how well human beings can predict the behavior of human translators. Shannon (1951) explores this exactly question for the simpler case of estimating the entropy of free text (not translation). If a human subject were able to write a probability distribution for each subsequent character in a text (given prior context), these distributions could be converted directly into entropy. But it is hard to get these from human subjects. Shannon instead asks a subject to simply guess the next character until she gets it right, and he records how many guesses are needed to correctly identify it. The character sequence thus becomes a guess sequence, e.g.: T h e _ b r o k e n _ v 2 1 1 1 11 3 2 5 1 1 1 15 The subject's identical twin would be able to reconstruct the original text from the guess sequence, so in that sense, it contains the same amount of information. Let c 1 ;c 2 ;:::c n represent the character sequence, let g 1 ;g 2 ;:::g n represent the guess sequence, and let j range over guess numbers from 1 to 95, the number of printable English characters plus newline. (Shannon, 1951) provides two results. (Upper Bound). The entropy of c 1 ;c 2 ;:::c n is no greater than the unigram entropy of the guess sequence: 1 n log( Q i=1 P(g i )) = 1 n P n i=1 log(P(g i )) = P 95 j=1 P(j) log(P(j)) 2 www.isi.edu/natural-language/compression. 77 Monolingual Bilingual Machine 1.39 0.89 Human 1.25 0.42 (this thesis) Figure 5.1: Estimates of the entropy of English (in bits per character). Machine results are taken from actual compression algorithms (Zoph et al., 2015), while human results are computed from data elicited by the Shannon Game. The mono- lingual column is the original case studied by Shannon (1951). The bilingual column represents the number of additional bits needed to store English, given a Spanish source translation. This is because this unigram entropy is an upper bound on the entropy of g 1 ;g 2 ;:::g n , which equals the entropy of c 1 ;c 2 ;:::c n . In human experiments, Shannon obtains an upper bound of 1.3 bits per character (bpc) for English, sig- nicantly better than the character n-gram models of his time (e.g., 3.3 bpc for trigram). (Lower Bound). The entropy of c 1 ;c 2 ;:::c n is no less than: P 95 j=1 j [P(j) - P(j + 1)] log(j) with the proof given in his paper. Shannon reported a lower bound of 0.6 bpc. Figure 5.1 gives the context for our work, drawing prior numbers from Zoph et al. (2015). Novel contributions of our work are: A web-based bilingual Shannon Game tool. A collection of guess sequences from human subjects, in both monolingual and bilingual conditions. An analysis of machine guess sequences and their relation to machine com- pression rates. A method to estimate the amount of information added in human transla- tions. For English given Spanish, we obtain an upper bound of 0.48 bpc, 78 which is tighter than Shannon's method, and signicantly better than the current best bilingual compression algorithm (0.89 bpc). 5.2 Related works Compression has attracted research attention for a long time, e.g., (Rissanen and Langdon, 1981; Cleary and Witten, 1984; Witten et al., 1987; Brown et al., 1992; Zobel and Moat, 1995; Teahan and Cleary, 1996; Witten et al., 1999; Mahoney, 2005). The Hutter Prize (Hutter, 2006), a competition to compress a 100m-word extract of English Wikipedia, was designed to futher encourage research in text compression. Bilingual and multilingual text compression is a less-studied eld (Conley and Klein, 2008; Martinez-Prieto et al., 2009; Adiego et al., 2009, 2010; Sanchez-Martinez et al., 2012; Conley and Klein, 2013; Zoph et al., 2015). These papers provide dierent algorithms for compressing text in multilingual format, but they do not demonstrate how humans perform on this task. Shannon (1951) devised an experimental method to estimate the entropy of written English. After that, many other papers used Shannon's method to calcu- late the entropy of English on dierent passages and context lengths (Grignetti, 1964; Burton and Licklider, 1955; Paisley, 1966; Guerrero, 2009). Other papers use Shannon's technique to measure the entropy of other languages (Jamison and Jamison, 1968; Rajagopalan, 1965; Newman and Waugh, 1960; Siromoney, 1963; Wanas et al., 1976). Shannon's method was modied by Cover and King (1978) who asked their subjects to gamble on the next character. Nevill and Bell (1992) describe a parallel-text Shannon Game, but they work in an English-English paraphrasing scenario, with dierent versions of the Bible. Zoph et al. (2015) brie y mention a Shannon Game experiment in which human 79 subjects guessed subsequent characters in a human translation. They report a Shannon upper bound for English-given-Spanish guess sequences as 0.51 bpc, but they do not give details, and they do not separate testing sequences from training. We note that dierent text genres studied in the literature yield dierent results, as some genres are more predictable than others. Dierent alphabet sizes (e.g., 26 letters versus 95 characters) have a similar eect. Our interest is not to discover an entropy gure that holds across all genres, but rather to provide an information- theoretic way to measure this entropy. 5.3 Materials and Methods 5.3.1 Shannon Game Data Collection Figure 5.2 shows our bilingual Shannon Game interface. It displays the current (and previous) source sentence, an automatic Google translation (for assistance only), and the target sentence as guessed so far by the subject. The tool also sug- gests (for further assistance) word completions in the right panel. Our monolingual Shannon Game is the same, but with source sentences suppressed. To gather data, we asked 3 English-speaking subjects plus a team of 4 bilingual people to play the bilingual Shannon game. For each subject/team, we assigned a distinct 3-5 sentence text from the Spanish/English Europarl corpus v7 (Koehn, 2005) and asked them to guess the English characters of the text one by one. We gathered a guess sequence with 684 guesses from our team and a guess sequence with 1694 guesses from our individuals (2378 guesses in total). We also asked 3 individuals and a team of 3 people to play the monolingual Shannon game. We gathered a guess sequence with 514 guesses from our team and a guess sequence with 1769 guesses from our individuals (2283 guesses in total). 80 Figure 5.2: Bilingual Shannon Game interface. The human subject reads the Spanish source and guesses the translation, character by character. Additional aids include a static machine translation and a dynamic word completion list. Figure 5.3 shows examples of running the monolingual and bilingual Shannon Game on the same sentence. 5.3.2 An Estimation Problem Our overall task is now to estimate the (per-guess) entropy of the guess sequences we collect from human subjects, to bound the entropy of the translator's text. To accomplish this, we build an actual predictor for guess sequences. Shannon (1951) and Zoph et al. (2015) both use a unigram distribution over the guess numbers 81 Monolingual Shannon Game (no source sentence) I t _ i s _ d e f e n d e d _ t h r o u g h _ r e a s o n i n g . D w h i m e i f i a , m t a s _ _ c o n c i f m o d p t Bilingual Shannon Game (source sentence = "Se defiende con argumentos.") I t _ i s _ d e f e n d e d _ t h r o u g h _ r e a s o n i n g . w d . a Figure 5.3: Example guess data collected from the Shannon Game, in both mono- lingual (top) and bilingual (bottom) conditions. The human subject's guesses are shown from bottom up. For example, in the bilingual condition, after seeing `...reason', the subject guessed `.' (wrong), but then correctly guessed `i' (right). (in our case 1..95). However, we are free to use more context to obtain a tighter bound. For example, we may collect 2-gram or 3-gram distributions over our observed guess sequence and use those to estimate entropy. In this case, it becomes impor- tant to divide our guess sequences into training and test portions|otherwise, a 10-gram model would be able to memorize large chunks of the guess sequences and deliver an unreasonably low entropy. Shannon (1951) applies some ad hoc smoothing to his guess counts before computing unigram entropy, but he does not split his data into test and train to assess the merits of that smoothing. 82 We set aside 1000 human guesses for testing and use the rest for training|1378 in the bilingual case, and 1283 in the monolingual case. We are now faced with how to do eective modeling with limited training data. But before we turn to that problem, let us rst work in a less limited playground, that of machine guess sequences rather human ones. This gives us more data to work with, and further- more, because we know the machine's actual compression rate, we can measure how tight our upper bound is. 5.3.3 Machine Plays the Monolingual Shannon Game In this section, we force the state-of-the art text compressor PPMC (Witten et al., 1987) to play the monolingual Shannon Game. PPMC builds a context-dependent probability distribution over the 95 possible character types. We describe the PPMC estimator in detail in Appendix A.1. For the Shannon Game, we sort PPMC's distribution by probability, and continue to guess from the top down until we correctly identify the current character. We let PPMC warm up on 50m characters, then collect its guesses on the next 100m characters (for training data), plus an additional 1000 characters (our test data). For the text corresponding to this test data, PPMC's actual compression rate is 1.37 bpc. The simplest model of the training guess sequence is a unigram model. Table 5.1 shows the unigram distribution over training characters, for both machine and human guess data. 3 3 These numbers combine data collected from individuals and from teams (Section 5.3.1). In the bilingual case, teams outperformed individuals, guessing correctly on the rst try 94.3% of the time, versus 90.5% for individuals. In the monolingual case, individuals and teams performed equally well. 83 guess monolingual bilingual # machine human machine human (100m) (1283) (100m) (1378) 1 0.732 0.744 0.842 0.916 2 0.105 0.086 0.074 0.035 3 0.047 0.047 0.024 0.013 4 0.027 0.030 0.014 0.011 5 0.017 0.020 0.009 0.005 6 0.012 0.012 0.007 0.004 7 0.009 0.008 0.005 0.001 8 0.007 0.007 0.004 0.001 9 0.006 0.005 0.003 0.001 10 0.005 0.004 0.003 0 ... ... ... ... ... 93 7.05E-08 0 0 0 94 7.69E-08 0 0 0 95 1.09E-07 0 0 0 Table 5.1: Unigram probabilities of machine and human guesses, in both monolin- gual and bilingual conditions. Amounts of training data (in characters) are shown in parentheses. We consider two types of context|the g guess numbers preceding the current guess, and the c characters preceding the current guess. For example, if c = 3 and g = 2, we estimate the probability of the next guess number from previous 3 characters and previous 2 guess numbers. In: T h e _ c h a p 2 1 1 1 7 2 4 ? we calculate P(?j character context=c;h;a; guess context=2; 4). The context gives us more accurate estimates. For example, if c = 1 and the previous character is 'q', then we nd the machine able to correctly guess the next character on its rst try with probability 0.981, versus 0.732 if we ignore that 84 c g Machine Guessing Human (100m) (10m) (1m) (1k) (1283) 0 0 1.72 1.72 1.72 1.76 1.68 0 1 1.70 1.70 1.71 1.84 1.75 0 2 1.69 1.69 1.71 2.03 1.92 1 0 1.54 1.54 1.70 1.86 1.74 1 1 1.52 1.52 1.54 2.20 2.18 1 2 1.50 1.52 1.55 2.37 2.32 2 0 1.45 1.51 1.58 2.25 2.20 2 1 1.44 1.53 1.54 2.56 2.58 2 2 1.48 1.56 1.59 2.73 2.70 8 0 1.37 Table 5.2: Entropies of monolingual test guess-sequences (1000 guesses), given varying amounts of context (c = number of previous characters, g = number of previous guess numbers) and dierent training set size (shown in parentheses). Witten-Bell smoothing is used for backo to shorter contexts. context. Likewise, having g previous guesses allows us to model \streaks" on the part of the Shannon Game player. Asg andc grow, it becomes necessary to smooth, as test guess sequences begin to contain novel contexts. PPMC itself makes character predictions using c=8 andg=0, and it smooths with Witten-Bell, backing o to shorter n-gram contexts c = 1::7. We also use Witten-Bell, but with a more complex backo scheme to accommodate the two context streams g and c. If gc we back o to the model with g 1 previous guesses and c previous characters, and if g<c we back o to the model with g previous guesses and c 1 previous characters. Table 5.2 shows test-set entropies obtained from diering amounts of training data, and diering amounts of context. We draw several conclusions from this data: Character context (c) is generally more valuable than guess context (g). 85 With large amounts of training data, modest context (g=1,c=2) allows us to develop a fairly tight upper bound (1.44 bpc) on PPMC's actual compression rate (1.37 bpc). With small amounts of training data, Witten-Bell does not make eective use of context. In fact, adding more context can result in worse test-set entropy! The last column of Table 5.2 shows entropies for necessarily-limited human guess data, computed with the same methods used for machine guess data. We see that human guessing is only a bit more predictable than PPMC's. Indeed, PPMC's guesses are fairly good|its massive 8-gram database is a powerful counter to human knowledge of grammar and meaning. 5.3.4 Modeling Human Guess Sequences How can we make better use of limited training data? Clearly, we do not observe enough instances of a particular context to robustly estimate the probabilities of the 95 possible guess numbers that may follow. Rather than estimating the multinomial directly, we instead opt for a parametric distribution. Our rst choice is the geometric distribution, with one free parameterp, the chance of a successful guess at any point. For each context in the training data, we t p to best explain the observations of which guesses follow. This one parameter can be estimated more robustly than the 94 free parameters of a multinomial. Figure 5.4 shows that the geometric distribution is a decent t for our observed guess data, but it does not model the head of the distribution well|the probability of a correct guess on the rst try is consistently greater than p. Therefore, we introduce a smoothing method (\Frequency and Geometric Smooth- ing") that only applies geometric modeling to guess numbers greater thani, where 86 0 5 10 15 20 25 30 35 40 45 50 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 Guess Number Log Probability (a) 0 5 10 15 20 25 30 35 40 45 50 −2.5 −2 −1.5 −1 −0.5 0 Guess Number Log Probability (b) 0 5 10 15 20 25 30 −2.5 −2 −1.5 −1 −0.5 0 Guess Number Log Probability (c) 0 1 2 3 4 5 6 7 8 −1.5 −1 −0.5 0 Guess Number Log Probability (d) Figure 5.4: Guess number distributions from human monolingual Shannon Game experiments (training portion). Plot 5.4a shows all 1238 guesses, while plots 5.4b, 5.4c and 5.4d show guesses made in specic character contexts ' ' (space),'a' and 'p'. The y-axis (probability of guess number) is given in log scale, so a geometric distribution is represented by a straight line. We observe that the single-parameter geometric distribution is a good t for either the head or the tail of the curve, but not both. data is sparse. For each context, we choose i such that we have seen all guess numbers 1::i at least k times each, where k =min( Number of samples seen in context 20 ; 4) 87 Monolingual Human Guessing Bilingual Human Guessing c g Witten-Bell Geometric Frequency and Witten-Bell Geometric Frequency and Smoothing Smoothing Geo. Smoothing Smoothing Smoothing Geo. Smoothing 0 0 1.68 2.02 1.62 0.54 0.73 0.67 0 1 1.75 2.06 1.62 0.56 0.72 0.66 0 2 1.92 2.06 1.65 0.61 0.72 0.67 1 0 1.74 1.57 1.50 0.65 0.57 0.48 1 1 2.18 1.55 1.48 0.84 0.56 0.48 1 2 2.32 1.52 1.49 0.93 0.56 0.49 2 0 2.20 1.65 1.60 0.94 0.63 0.63 2 1 2.58 1.57 1.57 1.10 0.63 0.62 2 2 2.70 1.59 1.58 1.18 0.63 0.63 Table 5.3: Entropies of human guess-sequences (1000 test-set guesses), given vary- ing amounts of context (c = number of previous characters ,g = number of previous guess numbers) and dierent smoothing methods. Prediction models are trained on a separate sequence of 1283 guesses in the monolingual case, and 1378 guesses in the bilingual case. Table 5.3 (left half) demonstrates the eect of dierent smoothing methods on estimated entropies for human guess data. The monolingual Witten-Bell smooth- ing column in this gure is the same of last column of Table 5.2. The right half of Table 5.3 shows the bilingual case for human guess sequence. 5.4 Results For calculating nal entropy bounds, we rst divide our guess sequence into 1000 for training data, 100 for development, and remainder for test (1183 for the mono- lingual case and 1278 for the bilingual case). We use the development set to nd the best context model and smoothing model. In all experiments, usingc = 1 pre- vious characters,g = 1 previous guesses, and Frequency and Geometric Smoothing works best. For the machine case, we use the algorithm of Zoph et al. (2015). Note that the machine and human subjects both make use of source-sentence context when 88 Guesser Shannon Our Improved Compression Rate Shannon Upper Bound Upper Bound Lower Bound Monolingual Machine 1.76 1.63 1.39 0.63 Human 1.65 1.47 1.25 0.57 Bilingual Machine 1.28 1.01 0.89 0.46 Human 0.54 0.48 0.42 0.21 Table 5.4: Summary of our entropy bounds. predicting. However, we do not use source context when modeling guess sequences, only target context. Table 5.4 summarizes our results. As shown in the gure, we also computed Shannon lower bounds (see Section 5.1) on all our guess sequences. For the bilingual case of English-given-Spanish, we give a 0.48 bpc upper bound and a 0.21 bpc lower bound. In the case of machine predictors, we nd that our upper bound is loose by about 13%, making it reasonable to guess that true translation entropy might be near 0.42 bpc. Hence, we can conclude that in this case, human translator adds 0.42 bpc information in the translation process. 5.5 Information Loss So far, we estimate how much information a human translator adds to the source text when they translate. We use H(EjS) to represent the conditional entropy of an English textE given Spanish textS, i.e., how many bits are required to reconstruct E from S. A related question is how much information from the original text is lost in the process of translation. In other words, how much of the precise wording of S is no longer obvious when we only have the translation E? We measure the number of bits needed to reconstruct the S from E, denoted H(SjE). We could estimate H(SjE) by running another (reversed) bilingual Shannon game in which 89 subjects predict Spanish from English. However, fortunately we can skip this time- consuming process and calculate H(SjE) based on the denition of joint entropy (Cover and Thomas, 2006): H(EjS) + H(S) = H(SjE) + H(E) (5.1) where H(E) and H(S) are the monolingual entropies of E and S. We can estimate H(S) using the monolingual Spanish Shannon game like what we did for estimating H(E). However, as we show in this chapter, PPMC compres- sion is close to what we get from the monolingual human Shannon game (1.39 vs. 1.25). So we can estimate H(S)' 1:26, using PPMC on Spanish Europarl data, as reported by Zoph et al. (2015). Using this estimate, we obtain the amount of information lost in translation as 1:26 + 0:42 1:39 = 0:29. We see that in the case Spanish and English, the translation process both adds and subtracts information. Other translation scenarios are asymmetric. For example, when translating the word \uncle" into Persian, we must add information (maternal or paternal uncle), but we do not lose information, as \uncle" can be reconstructed perfectly from the Persian word. 5.6 Conclusion We have presented an information-theoretic way to measure the language vari- ation in the human translation process. We estimate that a translator adds 0.42 / 1.25 = 34% in translating Spanish Europarl data to English. 4 4 Here, the denominator is monolingual English entropy, rather than Spanish, but we assume these are close under human-level compression. 90 It should be possible to reduce our 0.48 upper bound by better modeling of guess sequence data, and by use of source-language context. We also conjecture that the bilingual Shannon Game can be used for machine translation evaluation, on the theory that good human translators exhibit more predictable behavior than bad machine translators. 91 Chapter 6 Poetry Translation In Chapters 3 and 4, we introduced a novel poetry generation system. This system is capable of generating locally coherent poems, but the poems lack an overall start to end ow. In this chapter, we propose a poetry translation system as a means for generating globally coherent poems. In this chapter, we discuss that poetry translation is a highly controversial task. It is often impossible to stay fully loyal to the meaning and preserve the aesthetic aspects of the poem at the same time. As a result, translators normally compromise on literal accuracy to gain literary richness. This injects a relatively high amount of variation in the translation process, compared to translating non-creative texts. To support this hypothesis, we measure the language variation (a.k.a. the \added information") in poetry translation using the information theoretic method intro- duced in Chapter 5, and compare it to the language variation in translation of non-creative texts (Spanish to English translation of Europarl corpus v7 (Koehn, 2005)) measured in the previous chapter. We show that language variation almost doubles when poems get translated compared to non-creative texts. Following this observation, we present an automatic poetry translation system that is able to incorporate far more language variation into the translation process compared to traditional translation methods. Consequently, unlike previous works that often fail to produce any translation for xed rhyme and rhythm patterns, our system always translates a source text to an English poem. Human evaluation ranks translation quality as acceptable 78:2% of the time. 92 6.1 Human Poetry Translation It is highly debated that poetry cannot be translated. - Poetry, indeed, cannot be translated. { Samuel Johnson. - Poetry is what gets lost in translation. { Robert Frost. Nevertheless, in practice poems have always been translated and will continue to be translated between dierent languages and cultures. Poetry constitutes a large portion of the literature in many languages and has a notable presence in the culture of the corresponding regions. In order to introduce these cultures to people of other languages, it is essential to have a proper translation of this large cultural component. Poetry translation is a means for preserving, extending and learning about the culture, traditions and history of dierent regions in dierent time areas. Without translating masterpieces like Dante's Divine Comedy, or Homer's Iliad and Odyssey, these poets would be less known in today's world. Contemporary poets as well can gain more publicity by translating their works in multiple languages, this will also help more people to enjoy a wider variety of poetry options (Malmkjaer and Windle, 2011). However, poetry translation is an intricate task. Poems are often composed of multiple interconnected semantic and aesthetic levels. Translating all these levels together in the format of a single poem is not always possible. Most literary devices used in poems are either language specic or only make sense in a specic culture. For instance, sound based devices like alliteration, assonance, or consonance, or even rhythm and rhyme are features specic to the language of the poem and very hard to translate directly. Moreover, devices like allusion, idiom, irony, metaphor or symbol are normally specic to a culture and hard to translate without extra explanation. 93 上个星期的战斗至少夺取12个人的生命。 At least 12 people were killed in the battle last week. Last week's ght took at least 12 lives. The ghting last week killed at least 12. The battle of last week killed at least 12 persons. At least 12 people lost their lives in last week's ghting. At least 12 persons died in the ghting last week. At least 12 died in the battle last week. At least 12 people were killed in the ghting last week. During last week's ghting, at least 12 people died. Last week at least twelve people died in the ghting. Last week's ghting took the lives of twelve people. Figure 6.1: Eleven human translations of the same source sentence (LDC2002T01). These hardships make poetry translation a highly controversial task. One side of the solution is fully translating the meaning and sacricing a large portion of the aesthetic component. And at the opposite side, the solution is transferring the gist of the meaning while recreating the poem in the target language (Naghiyeva, 2015). Translators have to choose a position somewhere within this interval. This choice is nontrivial even when it comes to keeping or sacricing the most essen- tial component of the poem, i.e. the rhyme. One group of the translators argue that it is hard to nd proper rhyme words, especially for less exible target lan- guages (Osers, 1996), and forcing a rhyme pattern will result in semantic shifts that can destroy the integrity of the poem (Bly, 1982). The other group believe that rhyme is a non-separable part of the meaning of the poems, and if it is not possible to translate rhymes in poetry, it is better to only translate non-rhyming poems (Barnstone, 1984; Moett, 1989, 1999). The complexity of literal translation of the poems while keeping their aesthetic components has given the translators the liberty to improvise in order to create a balance between preserving the meaning and transferring the aesthetic aspects. As a result, we see much more variation in dierent translations of a poem compared to 94 空山不见人, 但闻人语响。 返景入深林, 复照青苔上。 So lone seems the hills; there is no one in the sight there. But whence is the echo of voices I hear? The ray of the sunset pierce slanting the forest, And in their re ection green mosses appear. {W.J.B Fletcher There seems to be no one on the empty mountain. And yet I think I hear a voice, Where sunlight, entering a grove, shines back to me from the green moss. {Witter Bynner & Kiang Kang-hu An empty hill, and no one in sight But I hear the echo of voices. The slanting sun at evening penetrates the deep woods And shines re ected on the blue lichens. {Soame Jenyns Through the deep wood, the slanting sunlight Casts motley patterns on the jade-green mosses. No glimpse of man in this lonely mountain, Yet faint voices drift on the air. {Chang Yin-nan & Lewic C. Walmsley On the lonely mountain I meet no one, I hear only the echo of human voices. At an angle the sun's rays enter the depths of the wood, And shine upon the green moss. {C.J. Chen & Michael Bullock On the empty mountains no one can be seen, But human voices are heard to resound. The re ected sunlight pierces the deep forest And falls again upon the mossy ground. {James J.Y. Liu Figure 6.2: Six human translations of the same poem by Wang Wei. 95 Nel mezzo del cammin di nostra vita, mi ritrovai per una selva oscura, ch e la diritta via era smarrita. Halfway through the journey we are living I found myself deep in a darkened forest, For I had lost all trace of the straight path. {James Finn Cotter Midway upon the journey of our life I found myself within a forest dark, For the straightforward pathway had been lost. {Henry Longfellow When I had journeyed half of our life's way I found myself within a shadowed forest for I had lost the path that does not stray {Allen Mandelbaum Midway in the journey of our life I came to myself in a dark wood, for the straight way was lost. {Charles Singleton Figure 6.3: Four dierent human translations of the poem \Inferno" by Dante. dierent translations of a sentence. For instance consider Figure 6.1 and Figure 6.2 where twelve dierent translations of a single Chinese sentence and six dierent translations of a single Chinese poem (Weinberger and Paz, 1987) are presented respectively. Figure 6.3 demonstrates another example of language variation in poetry translation. 96 6.2 Language Variation in Poetry Translation In the previous section, we argue that poetry translation is a hard task for humans. It is often impossible to completely preserve the meaning while transferring the aesthetic aspects of the poem. As a result, we see more language variation in translations of a poem compared to a non-creative text. In Chapter 5, we introduce an information-theoretic way to quantitatively mea- sure the language variation in the human translation process. We estimate that a human translator adds 0:42 bpc in translating Spanish Europarl data to English. In this section, we use this information-theoretic tool to measure how much information human translators add in translating Spanish poems to English, and whether poems are translated more freely. To do so, we choose four 20th-century Spanish poems and their English trans- lations from Twentieth-century Latin American poetry: a bilingual anthology Tap- scott (1996), and ask two human volunteers to play the bilingual Shannon game (see Section 5.3.1), by guessing the English translation character by character. As line crossing is common in translating poems, we provide the whole Spanish poem as the context for translation instead of just the current source sentence. We gather a guess sequence with 2247 guesses. Like the experiments in Section 5.4, we divide our guess sequence into 1000 for training, 100 for development, and the remaining 1147 guesses for test. We use the development set to nd the best context model and smoothing model. Table 6.1 shows our estimate for the amount of information a human translator adds in poetry translation. The General Translation row in Table 6.1 shows the entropy bound calculated for Europarl translation (see Section 5.4). We see that the compression rate in poetry translation is almost twice the compression rate in Europarl translation. 97 Entropy Compression Entropy Upper Bound Rate Lower Bound General Translation 0.48 0.42 0.21 Poetry Translation 0.95 0.83 0.43 Table 6.1: Entropy bounds and compression rates of general translation vs. poetry translation. 6.3 Automatic Poetry Translation Despite recent improvements in machine translation, automatic translation of poetry remains a challenging problem. As we discussed in the previous sections, this challenge is partially due to the intrinsic complexities of translating a poem. In this chapter, we introduce a method for automatic poetry translation. As an example, consider the following French poem: French poem: Puis je venais m'asseoir pr`es de sa chaise Pour lui parler le soir plus `a mon aise. (Literally: Then I came to sit near her chair To discuss with her the evening more at my ease.) Our goal is to translate this poem into English, but also to obey target rhythm and rhyme patterns specied by the user, such as 2-line rhyming iambic pentameter, ten syllables per line with alternating stress 0101010101, where 0 represents an unstressed syllable, and 1 represents a stressed syllable. Lines strictly rhyme if their pronunciations match from the nal stressed vowel onwards; slant rhyming allows variation. Overall, this is a dicult task even for human translators. In spite of recent works in automatic poetry generation (Oliveira, 2012; He et al., 2012; Yan et al., 2013; Zhang and Lapata, 2014; Yi et al., 2016; Wang 98 et al., 2016a; Ghazvininejad et al., 2016, 2017; Hopkins and Kiela, 2017; Oliveira, 2017), little has been done on automatic poetry translation. Greene et al. (2010) use phrase-based machine translation techniques to translate Italian poetic lines to English-translation lattices. They search these lattices for the best translation that obeys a given rhythm pattern. Genzel et al. (2010) also use phrase-based machine translation technique to translate French poems to English ones. They apply the rhythm and rhyme constraints during the decoding process. Both methods report total failure in generating any translations with a xed rhythm and rhyme format for most of the poems. Genzel et al. (2010) report that their method can generate translations in a specied scheme for only 12 out of 109 6-line French stanzas. This failure is due to the nature of the phrase-based machine translation (PBMT) systems. PBMT systems are bound to generate translations according to a learned bilingual phrase table. These systems are well-suited to unconstrained translation, as often the phrase table entries are good translations of source phrases. However, when rhythm and rhyme constraints are applied to PBMT, translation options become extremely limited, to the extent that it is often impossible to generate any translation that obeys the poetic constraints (Greene et al., 2010). In addition, literal translation is not always desired when it comes to poetry. As we showed in previous section, human translators add more language variation in translating poetry. However, PBMT is bound to translate phrase-by-phrase, and it cannot easily add, remove, or alter details of the source poem. In this chapter, we propose the rst neural poetry translation system and show its quality in translating French to English poems. Our system is much more exible than those based on PBMT, and is always able to produce translations into any scheme. In addition, we propose two novel improvements to increase the quality of the translation while satisfying specied rhythm and rhyme constraints. 99 Our proposed system generates the following translation for the French couplet mentioned above: French poem: Puis je venais m'asseoir pr`es de sa chaise Pour lui parler le soir plus `a mon aise. (Literally: Then I came to sit near her chair To discuss with her the evening more at my ease.) Our system: And afterwards I came to sit together. To talk about the evening at my pleasure. 6.4 Data We use a French translation of Oscar Wilde's Ballad of Reading Gaol Wilde (2001) by Jean Guiloineau 1 as our input poem, and the original Wilde's poem as the human reference. This test set contains 109 6-line stanzas, 29 of which we use for development. For each stanza, we require our machine translation to produce odd lines with iambic tetrameter and even lines with iambic trimeter, with even lines (2, 4, 6) rhyming. 1 https://bit.ly/2GN1ZGk 100 6.5 Proposed Method 6.5.1 Model A: Initial Model Unconstrained Machine Translation. The base of our poetry translation sys- tem is an encoder-decoder sequence-to-sequence model (Sutskever et al., 2014) with a two-layer recurrent neural network (RNN) with long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997). It is pre-trained on parallel French-English WMT14 corpus. 2 Specically, we use 2-layer LSTM cells with 1000 hidden cells for each layer. For pre-training, we set the dropout ratio to 0.5. Batch size is set to 128, and the learning rate is initially set as 0.5 and starts to decay by 0.5 when the perplexity of the development set starts to increase. Gradients are clipped at 5 to avoid gradient explosion. We stop pre-training the system after 3 epochs. In order to adapt the translation system to in-domain data, we collect 16,412 English songs with their French translations and 12,538 French songs with their English translations (6M word tokens in total) as our training corpus, 3 and continue training the system (warm start) 4 with this dataset. This encoder-decoder RNN model is used to generate the unconstrained translation of the poems. Enforcing Rhythm in Translation. To enforce the rhythm constraint, we adopt the technique of Ghazvininejad et al. (2016). We create a large nite-state acceptor (FSA) that compactly encodes all word sequences that satisfy the rhythm constraint. In order to generate a rhythmic translation for the source poem, we constrain the possible LSTM translations with this FSA. To do so, we alter the 2 http://www.statmt.org/wmt14/translation-task.html 3 http://lyricstranslate.com/ 4 We continue training the system while we set dropout ratio to 0.2, and keep the other settings xed. 101 beam search of the decoding phase of the neural translation model to only generate outputs that are accepted by this FSA. Enforcing Rhyme in Translation. Ghazvininejad et al. (2016) x the rhyme words in advance and build an FSA with the chosen rhyme words in place. Unlike their work, we do not x the rhyme words in the FSA beforehand, but let the model choose rhyme words during translation. We do so by partitioning the vocabulary into rhyme classes and building one FSA for each class. This FSA accepts word sequences that obey the rhythm pattern and end with any word within the cor- responding rhyme class. Then we translate each line of the source poem multiple times, once according to each rhyme class. In the nal step, for each set of rhyming lines, we select a set of translations that come from the same rhyme class and have the highest combined translation score. In practice, we just make FSAs for each of the 100 most frequent rhyme classes (out of 1505), which covers 67% of actual rhyming word tokens in our development set. 6.5.2 Model B: Biased Decoding with Unconstrained Trans- lation Naive application of rhythm and rhyme constraints to the neural translation system limits the translation options of the system. Sometimes the beam search nds no related translation that satises the constraints, forcing the decoder to choose an unrelated target-language token. The system does not have a way to recover from this situation, and continues to generate a totally unrelated phrase. An example is rhythm- and rhyme-constrained translation of \Et buvait l'air frais jusqu'au soir" (\And drinking fresh air until the evening") to \I used to close my hair" by our initial system (Figure 6.4). 102 We therefore propose to use the output of unconstrained translation as a guide- line for the constrained translation process. To do so, we encourage the words that appear in the unconstrained translation during the decoding step of the constrained one. We encourage by multiplying their RNN log probabilities by 5 during beam search. Figure 6.4 shows how this technique addresses the problem. 6.5.3 Model C: Biased Decoding with All Potential Trans- lation Our poetry translation system is also challenged by rare words for which the system has not learned a good translation. The unconstrained system produces a special <UNK> token for these cases, but the FSA does not accept <UNK>, as it is not pronounceable. We can let the system produce its next guess instead, but<UNK> is a sign that the translation system is not sure about the source meaning. To overcome this problem, we use an idea similar to model B. This time, in addition to encouraging the unconstrained translated words, we encourage all poten- tial translations of the foreign words. To get the potential translations, we use the translation table (t-table) extracted from parallel French-English training data using Giza++ Och and Ney (2003). This way, the system receives an external signal that guides it toward selecting better translations for the rare foreign word. We run ve iterations of each of IBM models 1, 2, HMM, and 4 to get the t-table. An example of how this method improves the poem quality over model B can be observed in the fth line of the poems in Figure 6.5. 103 French poem: Sans mains tordues, comme ces hommes, Ces pauvres hommes sans espoir, Qui osent nourrir l'esp erance Dans le caveau du d esespoir: Il regardait vers le soleil Et buvait l'air frais jusqu'au soir. Human reference: He did not wring his hands, as do Those witless men who dare To try to rear the changeling Hope In the cave of black Despair: He only looked upon the sun, And drank the morning air. Unconstrained machine translation: Like these men These poor men without hope, Who dare to feed the hope. In the vault of despair He was looking to the sun And drinking fresh air until the evening. Rhythmic and rhyming translation by model A: Without a crooked hand as men. These hopeless people there. Who dare to feed the expectations. Surrounded by despair. He only looking at the sun. I used to close my hair. Rhythmic and rhyming translation by model B: Without a crooked hand as men. These hopeless people there. Who dare to feed the expectations. Surrounded by despair. He only looking at the sun. Was drinking fresh of air. Figure 6.4: An example of poetry translation by models A and B. Biased decoding with unconstrained translation (Model B) produces a better translation compared to the baseline rhythm- and rhyme-constrained system. 104 French poem: Il n'y avait que sable et boue O u s' etait ouverte la tombe. Le long des murs de la prison On ne voyait aucune tombe. Un petit tas de chaux ardente Servait de linceul a cette ombre. Human reference: For where a grave had opened wide, There was no grave at all: Only a stretch of mud and sand By the hideous prison-wall, And a little heap of burning lime, That the man should have his pall. Unconstrained machine translation: There was only sand and mud Where the grave opened. Along the walls of prison We saw no grave A little pile of <UNK> <UNK> to this shadow. Rhythmic and rhyming translation by model B: But there was only sand and mud. To where the grave was laid. Along the walls of prison wall. We saw no masquerade. A little lot of prostitutes. They used to shroud this shade. Rhythmic and rhyming translation by model C: But there was only sand and mud. To where the grave was laid. Along the walls of prison wall. We saw no masquerade. A little bunch of shiny lime. They used to shroud this shade. Figure 6.5: An example of poetry translation by models B and C. Biased decoding with all potential translation (Model C) produces a better translation compared to Model B. 105 6.6 Results 6.6.1 Automatic Evaluation We use the standard machine translation evaluation method, BLEU (Papineni et al., 2002) for automatic evaluation of our system. We translate 80 test French stanzas with dierent models and calculate the BLEU score for each system. The results can be found in Table 6.2. Overall, the observed BLEU scores are much lower than those of translation of non-creative texts with similar size of training data. This is expected due to the higher language variation in the task of poetry translation. In our experiments, unconstrained machine translation model reaches the highest BLEU score as it deals with no constraints. The initial constrained system (Model A) gets a BLEU score of 4.42. Table 6.2 clearly shows that Models B and C have increasingly improved the poetry translation to almost ll the gab between the initial and the unconstrained system. Method Name BLEU Score Unconstrained Machine Translation 6.63 Model A 4.42 Model B 5.40 Model C 5.94 Table 6.2: BLEU scores of our dierent systems. 6.6.2 Human Evaluation Our rst human evaluation experiment compares model A with model B. These systems generated non-identical translations for 77 (out of 80) of the test stanzas. We ask 154 Amazon mechanical turk judges to compare these translations (each 106 pair of translations was compared twice). We present the judges with the French poem for reference and did not mention that the poems are computer generated. Judges have the option to prefer either of the poems or state they cannot decide. The results in Table 6.3 clearly show that the model B generates better translations. In the second experiment, we compare model B with model C. We have 84 judges compare 42 dierent poems. Table 6.4 shows that judges preferred the outputs of model C by a 17:7% margin. We also ask 238 judges to rank translations of all the 80 stanzas of the test set as very bad, bad, ok, good, and very good. Table 6.5 shows the distribution of this ranking. We see that 78:2% of the judges ranked the output ok or more (49:6% of the poems were ranked good or very good). Figure 6.6 shows an example of the poems ranked as very good. Method Name User Preference Model A 18.2% Cannot Decide 19.5% Model B 62.3% Table 6.3: Users prefer translations generated by model A. Method Name User Preference Model B 26.7% Cannot Decide 28.9% Model C 44.4% Table 6.4: Users prefer translations generated by model C. 107 French poem: Tels des vaisseaux dans la temp^ ete, Nos deux chemins s' etaient crois es, Sans m^ eme un signe et sans un mot, Nous n'avions mot a d eclarer ; Nous n' etions pas dans la nuit sainte Mais dans le jour d eshonor e. Human reference: Like two doomed ships that pass in storm We had crossed each other's way: But we made no sign, we said no word, We had no word to say; For we did not meet in the holy night, But in the shameful day. Translation by our full system (model C): And like some ships across the storm. These paths were crossed astray. Without a signal nor a word. We had no word to say. We had not seen the holy night. But on the shameful day. Figure 6.6: A sample poem translated by our full system (Model C). Very Bad Bad OK Good Very Good 5.9% 15.9% 28.6% 35.3% 14.3% Table 6.5: Quality of the translated poems by model C. 6.7 Conclusion In this chapter we presented the rst neural poetry translation system and pro- vided two novel methods to improve the quality of the translations. We conducted human evaluations on generated poems and showed that the proposed improve- ments highly improve the translation quality. 108 Chapter 7 Conclusion and Future Work 7.1 Conclusion In this work we tackle the challenges facing creative natural language generation (NLG). We chose poetry generation as one of the major elds of creative language generation and develop tools and techniques for facing these challenges. The rst problem of creative NLG is lack of an immediate application to moti- vate research on this area and enable a meaningful comparison of dierent existing methods. We propose generating password poems as an immediate application of poetry generation. We implement a system for generating highly secure password poems and show that the generated passwords are more memorable and more pre- ferred compared to passwords generated by other methods that guarantee the same security. The second problem that we approach is the need to combine human knowledge with the power of deep learning methods to extract world knowledge from data. We propose a method for combining nite-state machinery with recurrent neural networks (RNN) in a system for generating poems for any given topic. Finite state machines guarantee correctness of metrical and rhyme structure, while the RNN ensures coherence of the poems. In order to improve the quality of generated poems, we propose to borrow ideas from human poems into the poems generated by our system and develop a poetry translation system. This system will also be based on combination of nite-state machines and RNNs. 109 The last and one of the most important challenges of creative NLG is lack of a quantitative metric to measure the quality of the generations. Without this metric, it is dicult to tune the parameters of the system and drive improvements to it. Without a quantitative metric it is also very hard to compare dierent methods and nd the state of the art in the eld. We propose quantitative evaluation metrics for the password poems and for the topical poetry system. Furthermore we build an on-line interactive poetry generation system that enables users to revise system generated poems by adjusting style conguration settings. The system interface also allows users to rate the quality of the poem. We collect users' rating for poems with various style settings and use them to automatically tune the system style parameters. We also use quantitative metrics for poetry translation. 110 Reference List Joaqu n Adiego, Nieves Brisaboa, Miguel Mart nez-Prieto, and Felipe S anchez- Mart nez. A two-level structure for compressing aligned bitexts. In Procedings of String Processing and Information Retrieval, 2009. Joaqu n Adiego, Miguel Mart nez-Prieto, Javier Hoyos-Tor o, and Felipe S anchez- Mart nez. Modelling parallel texts for boosting compression. In Proceedings of Data Compression Conference, 2010. Rodrigo Agerri, John Barnden, Mark Lee, and Alan Wallington. Metaphor, infer- ence and domain independent mappings. In Proceedings of RANLP, 2007. Tej Anand and Gary Kahn. SPOTLIGHT: A data explanation system. In Pro- ceedings of the Eighth Articial Intelligence for Applications, 1992. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the 7th Conference on Language Resources and Evaluation, 2010. Satanjeev Banerjee and Ted Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of IJCAI, 2003. John Barnden and Mark Lee. An articial intelligence approach to metaphor understanding. 6(1):399{412, 2002. Willis Barnstone. Preferences in translating poetry. Translation: Literary, Lin- guistic and Philosophical Perspectives, 1984. Kim Binsted and Graeme Ritchie. An implemented model of punning riddles. In Proceedings of AAAI, 1994. Kim Binsted and Graeme Ritchie. Computational rules for generating punning riddles. HUMOR-International Journal of Humor Research, 10(1):25{76, 1997. Kim Binsted, Benjamin Bergen, and Justin McKay. Pun and non-pun humour in second-language learning. In Proceedings of CHI Workshop on Humor Modeling in the Interface, 2003. 111 Robert Bly. The eight stages of translation. The Kenyon Review, 4(2):68{89, 1982. Danushka Bollegala and Ekaterina Shutova. Metaphor interpretation using para- phrases extracted from the web. PloS one, 8(9):e74304, 2013. Gerlof Bouma. Normalized (pointwise) mutual information in collocation extrac- tion. In Proceedings of Biennial GSCL Conference, 2009. Selmer Bringsjord and David Ferrucci. Articial intelligence and literary creativity: Inside the mind of brutus, a storytelling machine. Psychology Press, 1999. Peter Brown, Robert Pietra, Vincentand Mercer, Stephen Pietra, and Jennifer Lai. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31{40, 1992. Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. Concreteness rat- ings for 40 thousand generally known English word lemmas. Behavior research methods, 46(3):904{911, 2014. N. Burton and Joseph Licklider. Long-range constraints in the statistical structure of printed English. The American journal of psychology, 68(4):650{653, 1955. John Cleary and Ian Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396{402, 1984. Ehud Conley and Shmuel Klein. Using alignment for multilingual text compression. International Journal of Foundations of Computer Science, 19(01):89{101, 2008. Ehud Conley and Shmuel Klein. Improved alignment-based algorithm for mul- tilingual text compression. Mathematics in Computer Science, 7(2):137{153, 2013. Thomas Cover and Roger King. A convergent gambling estimate of the entropy of English. IEEE Transactions on Information Theory, 24(4):413{421, 1978. Thomas Cover and Joy Thomas. Elements of information theory (2nd edition). Wiley-interscience, 2006. Heather Crawford and John Aycock. Kwyjibo: automatic domain name genera- tion. Software: Practice and Experience, 38(14):1561{1567, 2008. Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fidler. Towards diverse and natural image descriptions via a conditional GAN. arXiv preprint arXiv:1703.06029, 2017. Om Damani. Improving pointwise mutual information (PMI) by incorporating signicant co-occurrence. In Proceedings of ACL, 2013. 112 Aliya Deri and Kevin Knight. How to make a frenemy: Multitape FSTs for port- manteau generation. In Proceedings of HLT-NAACL, 2015. Belen Diaz-Agudo, Pablo Gervas, and Pedro Gonzalez-Calero. Poetry generation in COLIBRI. In Proceedings of ECCBR. 2002. Jonathan Dunn. What metaphor identication systems can tell us about metaphor- in-language. In Proceedings of the First Workshop on Metaphor in NLP, 2013. Dan Fass. met*: A method for discriminating metonymy and metaphor by com- puter. Computational Linguistics, 17(1):49{90, 1991. Dinei Florencio and Cormac Herley. A large-scale study of web password habits. In Proceedings of the 16th international conference on World Wide Web, 2007. Mahak Gambhir and Vishal Gupta. Recent automatic text summarization tech- niques: a survey. Articial Intelligence Review, 47(1):1{66, 2017. Morrie Gasser. A random word generator for pronounceable passwords. Technical report, Electronic Systems Division, Air Force Systems Command, USAF, 1975. Albert Gatt and Emiel Krahmer. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. arXiv preprint arXiv:1703.09902, 2017. Matt Gedigian, John Bryant, Srini Narayanan, and Branimir Ciric. Catching metaphors. In Proceedings of the Third Workshop on Scalable Natural Language Understanding. Association for Computational Linguistics, 2006. Dmitriy Genzel, Jakob Uszkoreit, and Franz Och. Poetic statistical machine trans- lation: rhyme and meter. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Lin- guistics, 2010. Pablo Gervas. An expert system for the composition of formal Spanish poetry. Knowledge-Based Systems, 14(3), 2001. Pablo Gervas, Belen Diaz-Agudo, Federico Peinado, and Raquel Hervas. Story plot generation based on CBR. Knowledge-Based Systems, 18(4):235{242, 2005. Marjan Ghazvininejad and Kevin Knight. How to memorize a random 60-bit string. In Proceedings of NAACL, 2015. Marjan Ghazvininejad and Kevin Knight. Humans outperform machines at the bilingual shannon game. Entropy, 19(1):15, 2016. 113 Marjan Ghazvininejad, Xing Shi, Yejin Choi, and Kevin Knight. Generating top- ical poetry. In Proceedings of EMNLP, pages 1183{1191, 2016. Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry generation system. In Proceedings of ACL Demo Track, 2017. Marjan Ghazvininejad, Yejin Choi, and Kevin Knight. Neural poetry translation. In Proceedings of NAACL, 2018. Harry Glahn. Computer-produced worded forecasts. Bulletin of the American Meteorological Society, 51(12):1126{1131, 1970. Eli Goldberg, Richard Kittredge, and Norbert Driedger. A new approach to the synthesis of weather forecast text. IEEE Expert (Special track on NLP), 1994. Erica Greene, Tugba Bodrumlu, and Kevin Knight. Automatic analysis of rhyth- mic poetry with applications to generation and translation. In Proceedings of EMNLP. Association for Computational Linguistics, 2010. Mario Grignetti. A note on the entropy of words in printed English. Information and Control, 7(3):304{306, 1964. Fabio Guerrero. A new look at the classical entropy of written English. arXiv preprint arXiv:0911.2284, 2009. Sarah Harmon. Figure8: A novel system for generating and evaluating gurative language. In Proceedings of ICCC, 2015. Mary Harris. Analysis of the discourse structure of lyric poetry. Computers and the Humanities, 23(4):423{428, 1989. Jing He, Ming Zhou, and Long Jiang. Generating Chinese classical poems with statistical machine translation models. In Proceedings of AAAI, 2012. Christian Hempelmann. Paronomasic puns: Target recoverability towards auto- matic generation. PhD thesis, Indiana University-Purdue University Indianapo- lis, 2003. Hussein Hirjee and Daniel Brown. Using automated rhyme detection to character- ize rhyming style in rap music. In Empirical Musicology Review, 2010. Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural com- putation, 9(8):1735{1780, 1997. Jack Hopkins and Douwe Kiela. Automatically generating rhythmic verse with neural networks. In Proceedings of ACL, 2017. 114 Dirk Hovy, Shashank Srivastava, Sujay Jauhar, Mrinmaya Sachan, Kartik Goyal, Huiying Li, Whitney Sanders, and Eduard Hovy. Identifying metaphorical word use with tree kernels. In Proceedings of the First Workshop on Metaphor in NLP, 2013. John Hutchins. The rst public demonstration of machine translation: the Georgetown-IBM system. Publicacion electronica, 2005. Marcus Hutter. 50,000 Euro prize for compressing human knowledge. http:// prize.hutter1.net, 2006. Accessed: 2017-07-25. Philip Inglesant and Angela Sasse. The true cost of unusable password policies: password use in the wild. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2010. Aaron Jaech, Rik Koncel-Kedziorski, and Mari Ostendorf. Phonological pun- derstanding. In Proceedings of NAACL, 2016. Dean Jamison and Kay Jamison. A note on the entropy of partially-known lan- guages. Information and Control, 12(2):164{167, 1968. Sundararaman Jeyaraman and Umut Topkara. Have your cake and eat it too{ infusing usability into text-password based authentication systems. In Proceed- ings of ACSAC, 2005. Long Jiang and Ming Zhou. Generating Chinese couplets using a statistical MT approach. In Proceedings of COLING, 2008. Aditya Joshi, Diptesh Kanojia, Pushpak Bhattacharyya, and Mark Carman. Sar- casm suite: A browser-based engine for sarcasm detection and generation. In Proceedings of AAAI, 2017. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, 2015. Beata Klebanov, Chee Leong, Michael Heilman, and Michael Flor. Dierent texts, same metaphors: Unigrams and beyond. In Proceedings of the Second Workshop on Metaphor in NLP, 2014. Sheldon Klein. Automatic paraphrasing in essay format. Mech. Translat. & Comp. Linguistics, 8(3-4):68{83, 1965a. Sheldon Klein. Control of style with a generative grammar. Language, 41(4): 619{631, 1965b. 115 Sheldon Klein and Robert Simmons. Syntactic dependence and the computer generation of coherent discourse. Mechanical Translation, 7(2):50{61, 1963. Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Machine Translation Summit X, 2005. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed- erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, and Richard Zens. Moses: Open source toolkit for statistical machine translation. In Pro- ceedings of the 45th Annual Meeting of the ACL, Demo and Poster Sessions. Association for Computational Linguistics, 2007. Stanley Kurzban. Easily remembered passphrases: a better approach. ACM SIGSAC Review, 3(2-4):10{21, 1985. Michael Lebowitz. Planning stories. In Proceedings of the cognitive science society, Hillsdale, 1987. Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Proceed- ings of ACL, 2014. Boyang Li, Stephen Lee-Urban, George Johnston, and Mark Riedl. Story genera- tion with crowdsourced plot graphs. In Proceedings of AAAI, 2013a. Hongsong Li, Kenny Zhu, and Haixun Wang. Data-driven metaphor recognition and explanation. Transactions of the Association for Computational Linguistics, 1:379{390, 2013b. Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017. Matthew Mahoney. Adaptive weighting of context models for lossless data com- pression. Technical Report CS-2005-16, Florida Institute of Technology, 2005. Kirsten Malmkjaer and Kevin Windle. The Oxford handbook of translation studies. OUP Oxford, 2011. Hisar Manurung. An evolutionary algorithm approach to poetry generation. Ph.D. thesis, University of Edinburgh, 2003. Hisar Manurung, Graeme Ritchie, and Henry Thompson. Towards a computational model of poetry generation. In Proceedings of AISB Symposium on Creative and Cultural Aspects and Applications of AI and Cognitive Science. 2000. James Martin. A computational model of metaphor interpretation. Academic Press Professional, Inc., 1990. 116 Lara Martin, Prithviraj Ammanabrolu, William Hancock, Shruti Singh, Brent Harrison, and Mark Riedl. Event representations for automated story generation with deep neural nets. arXiv preprint arXiv:1706.01331, 2017. Miguel Martinez-Prieto, Joaquin Adiego, Felipe Sanchez-Martinez, Pablo Fuente, and Rafael Carrasco. On the use of word alignments to enhance bitext compres- sion. In Proceedings of Data Compression Conference, 2009. James Meehan. TALE-SPIN, an interactive program that writes stories. In Pro- ceedings of Ijcai, 1977. Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Ecient estimation of word representations in vector space. In Proceedings of NIPS, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Je Dean. Dis- tributed representations of words and phrases and their compositionality. In Proceedings of NIPS, 2013b. George Miller. WordNet: A lexical database for English. Communications of the ACM, 38(11), 1995. Tristan Miller and Iryna Gurevych. Automatic disambiguation of English puns. In Proceedings of ACL, 2015. Judith Moett. Playing scrabble without a board. In Translating Poetry, pages 144{160. Springer, 1989. Judith Moett. On formal translation. In Translation Of Poetry And Poetic Prose, pages 83{100. 1999. Michael Mohler, David Bracewell, David Hinote, and Marc Tomlinson. Semantic signatures for example-based linguistic metaphor detection. In Proceedings of the First Workshop on Metaphor in NLP, 2013. Shahla Naghiyeva. Does poetry lose or gain in translation? English Language and Literature Studies, 5(3):107, 2015. Srini Narayanan. Knowledge-based action representations for metaphor and aspect (KARMA). Computer Science Division, University of California at Berkeley dissertation, 1997. Srinivas Narayanan. Moving right along: A computational model of metaphoric reasoning about events. AAAI, 121127, 1999. Yael Netzer, David Gabay, Yoav Goldberg, and Michael Elhadad. Gaiku: Gener- ating haiku with word associations norms. In Proceedings of NAACL Workshop on Computational Approaches to Linguistic Creativity, 2009. 117 Craig Nevill and Timothy Bell. Compression of parallel texts. Information Pro- cessing & Management, 28(6):781{793, 1992. Edwin Newman and Nancy Waugh. The redundancy of texts in three languages. Information and Control, 3(2):141{153, 1960. Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Computational linguistics, 29(1), 2003. Hugo Oliveira. Automatic generation of poetry: an overview. In Proceedings of 1st Seminar of Art, Music, Creativity and Articial Intelligence, 2009. Hugo Oliveira. PoeTryMe: a versatile platform for poetry generation. Computa- tional Creativity, Concept Invention, and General Intelligence, 1, 2012. Hugo Gon calo Oliveira. A survey on intelligent poetry generation: Languages, features, techniques, reutilisation and evaluation. In Proceedings of the 10th International Conference on Natural Language Generation, 2017. Ewald Osers. Techniques of verse translation: German poetry. Salzburg Studies in English Literature, 101:468{476, 1996. William Paisley. The eects of authorship, topic, structure, and time of composi- tion on letter redundancy in English texts. Journal of verbal learning and verbal behavior, 5(1):28{34, 1966. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. Jerey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of EMNLP, 2014. Rafael Perez and Mike Sharples. MEXICA: A computer model of a cognitive account of creative writing. Journal of Experimental & Theoretical Articial Intelligence, 13(2):119{139, 2001. Sasa Petrovic and David Matthews. Unsupervised joke generation from big data. In Proceedings of ACL, 2013. John Pierce and John Carroll. Language and machines: Computers in translation and linguistics. 1966. K. Rajagopalan. A note on entropy of kannada prose. Information and Control, 8 (6):640{644, 1965. 118 Mark Riedl and Robert Young. Narrative planning: Balancing plot and character. Journal of Articial Intelligence Research, 39:217{268, 2010. Jorma Rissanen and Glen Langdon. Universal modeling and coding. Information Theory, IEEE Transactions on, 27(1):12{23, 1981. Graeme Ritchie. Can computers create humor? AI Magazine, 30(3):71, 2009. Melissa Roemmele, Sosuke Kobayashi, Naoya Inoue, and Andrew M Gordon. An rnn-based binary classier for the story cloze test. LSDSem 2017, page 74, 2017. Francois Role and Mohamed Nadif. Handling the impact of low frequency events on co-occurrence based measures of word similarity|a case study of pointwise mutual information. In Knowledge Discovery and Information Retrieval, 2011. Felipe Sanchez-Martinez, Rafael Carrasco, Miguel Martinez-Prieto, and Joaquin Adiego. Generalized biwords for bitext compression and translation spotting. Journal of Articial Intelligence Research, 43:389{418, 2012. Claude Shannon. Prediction and entropy of printed English. Bell System Technical Journal, 30(1):50{64, 1951. Richard Shay, Patrick Gage Kelley, Saranga Komanduri, Michelle L Mazurek, Blase Ur, Timothy Vidas, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cra- nor. Correct horse battery staple: Exploring the usability of system-assigned passphrases. In Proceedings of the eighth symposium on usable privacy and secu- rity, page 7. ACM, 2012. Ekaterina Shutova, Lin Sun, and Anna Korhonen. Metaphor identication using verb and noun clustering. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010. Gift Siromoney. Entropy of Tamil prose. Information and Control, 6(3):297{300, 1963. Oliviero Stock and Carlo Strapparava. The act of creating humorous acronyms. Applied Articial Intelligence, 19(2):137{151, 2005. Andreas Stolcke. SRILM-an extensible language modeling toolkit. In Proceedings of INTERSPEECH, 2002. Ilya Sutskever, Oriol Vinyals, and Quoc Le. Sequence to sequence learning with neural networks. In proceedings of NIPS, 2014. Reid Swanson and Andrew Gordon. Say anything: Using textual case-based rea- soning to enable open-domain interactive storytelling. ACM Transactions on Interactive Intelligent Systems, 2(3):16, 2012. 119 Stephen Tapscott. Twentieth-century Latin American poetry: a bilingual anthology. University of Texas Press, 1996. William Teahan and John Cleary. The entropy of english using ppm-based models. In Data Compression Conference. IEEE, 1996. Alan Turing. Computing machinery and intelligence. Mind, 59(236):433{460, 1950. Scott Turner. Minstrel: a computer model of creativity and storytelling. 1993. Tony Veale. Once more, with feeling! using creative aective metaphors to express information needs. In Proceedings of ICCC, 2013. Tony Veale and Yanfen Hao. Comprehending and generating apt metaphors: a web-driven, case-based approach to gurative language. In Proceedings of AAAI, volume 2007, 2007. Tony Veale and Yanfen Hao. A uid knowledge representation for understanding and generating creative metaphors. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for Computa- tional Linguistics, 2008. M. Wanas, A. Zayed, M. Shaker, and E. Taha. First second-and third-order entropies of Arabic text. IEEE Transactions on Information Theory, 22(1): 123{123, 1976. Qixin Wang, Tianyi Luo, Dong Wang, and Chao Xing. Chinese song iambics generation with neural attention-based model. arXiv:1604.06274, 2016a. Tong Wang, Ping Chen, John Rochford, and Jipeng Qiang. Text simplication using neural machine translation. In Proceedings of AAAI, 2016b. Eliot Weinberger and Octavio Paz. Nineteen ways of looking at Wang Wei. Moyer Bell, 1987. Joseph Weizenbaum. ELIZA|a computer program for the study of natural lan- guage communication between man and machine. Communications of the ACM, 9(1):36{45, 1966. Oscar Wilde. Ballad of Reading Gaol. Electric Book Company, 2001. Ian Witten, Radford Neal, and John Cleary. Arithmetic coding for data compres- sion. Communications of the ACM, 30(6):520{540, 1987. Ian Witten, Alistair Moat, and Timothy Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999. 120 Martin Wong and Andy Chun. Automatic haiku generation using VSM. In Pro- ceedings of ACACOS, 2008. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, 2015. Rui Yan, Han Jiang, Mirella Lapata, Shou-De Lin, Xueqiang Lv, and Xiaoming Li. I, Poet: Automatic Chinese poetry composition through a generative sum- marization framework under constrained optimization. In Proceedings of IJCAI, 2013. Xiaoyuan Yi, Ruoyu Li, and Maosong Sun. Generating chinese classical poems with RNN encoder-decoder. arXiv:1604.01537, 2016. Victor Yngve. Random generation of English sentences. In The 1961 Conference on Machine Translation of Languages and Applied Language Analysis, 1961. Boliang Zhang, Hongzhao Huang, Xiaoman Pan, Heng Ji, Kevin Knight, Zhen Wen, Yizhou Sun, Jiawei Han, and Bulent Yener. Be appropriate and funny: Automatic entity morph encoding. In Proceedings of ACL, 2014. Xingxing Zhang and Mirella Lapata. Chinese poetry generation with recurrent neural networks. In Proceedings of EMNLP, 2014. Ming Zhou, Long Jiang, and Jing He. Generating Chinese couplets and quatrain using a statistical approach. In Proceedings of Pacic Asia Conference on Lan- guage, Information and Computation, 2009. Justin Zobel and Alistair Moat. Adding compression to a full-text retrieval sys- tem. Software: Practice and Experience, 25(8):891{903, 1995. Barret Zoph, Marjan Ghazvininejad, and Kevin Knight. How much information does a human translator add to the original? In Proceedings of EMNLP, 2015. 121 Appendix A Appendix A.1 PPMC Compression PPM is the most well-known adaptive, predictive compression technique (Cleary and Witten, 1984). It predicts by producing a complete probability distribution for the next character P(Xjcontext), based on the previous n-1 characters. It adaptively constructs empirical character n-gram tables (usually n=1..5) as it com- presses. In a given context, a n-gram table may predict only a subset of characters, so PPM reserves some probability mass for an escape (ESC), after which it exe- cutes a hard backo to the (n-1)-gram table. PPM models are dierent in assigning probabilities to ESC. In PPMA, P(ESC) is 1/(1+D), where D is the number of times the context has been seen. PPMB uses q/D, where q is the number of distinct character types seen in the context. PPMC uses q/(q+D), also known as Witten-Bell smoothing. PPMD uses q/2D. For compression, after the model calculates the probability of the next character given the context, it sends it to the arithmetic coder (Rissanen and Langdon, 1981; Witten et al., 1987). Figure A.1 sketches the technique. We produce context- dependent probability intervals, and each time we observe a character, we move to its interval. Our working interval becomes smaller and smaller, but the better our predictions, the wider it stays. A document's compression is the shortest bit string that ts inside the nal interval. In practice, we do the bit-coding as we navigate 122 Figure A.1: Arithmetic coding. probability intervals. Here, when we force the machine to play the Shannon game, we are only interested in the probability distribution of the next character, so we skip the arithmetic coding of the probabilities. 123
Abstract (if available)
Abstract
Natural language generation (NLG) is a well-studied and yet very challenging field in natural language processing. One of the less studied NLG tasks is the generation of creative texts such as jokes, puns, or poems. Multiple reasons contribute to the difficulty of research in this area. First, no immediate application exists for creative language generation. This has made the research on creative NLG extremely diverse, having different goals, assumptions, and constraints. Second, no quantitative measure exists for creative NLG tasks. Consequently, it is often difficult to tune the parameters of creative generation models and drive improvements to these systems. Lack of a quantitative metric and the absence of a well-defined immediate application makes comparing different methods and finding the state of the art an almost impossible task in this area. Finally, rule-based systems for creative language generation are not yet combined with deep learning methods. Rule-based systems are powerful in capturing human knowledge, but it is often too time-consuming to present all the required knowledge in rules. On the other hand, deep learning models can automatically extract knowledge from the data, but they often miss out some essential knowledge that can be easily captured in rule-based systems. ❧ In this work, we address these challenges for poetry generation, which is one of the main areas of creative language generation. We introduce password poems as a new application for poetry generation. These passwords are highly secure, and we show that they are easier to recall and more preferable compared to passwords created by other methods that guarantee the same level of security. Furthermore, we combine finite-state machinery with deep learning models in a system for generating poems for any given topic. We introduce a quantitative metric for evaluating the generated poems and build the first interactive poetry generation system that enables users to revise system generated poems by adjusting style configuration settings like alliteration, concreteness and the sentiment of the poem. The system interface also allows users to rate the quality of the poem. We collect users’ rating for poems with various style settings and use them to automatically tune the system style parameters. In order to improve the coherence of generated poems, we introduce a method to borrow ideas from existing human literature and build a poetry translation system. We study how poetry translation is different from translation of non-creative texts by measuring the language variation added during the translation process. We show that humans translate poems much more freely compared to general texts. Based on this observation, we build a machine translation system specifically for translating poetry which uses language variation in the translation process to generate rhythmic and rhyming translations.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Neural sequence models: Interpretation and augmentation
PDF
Generating psycholinguistic norms and applications
PDF
Neural networks for narrative continuation
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Fairness in natural language generation
PDF
Deep learning models for temporal data in health care
PDF
Language understanding in context: incorporating information about sources and targets
PDF
Identifying Social Roles in Online Contentious Discussions
PDF
Building a knowledgebase for deep lexical semantics
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Hashcode representations of natural language for relation extraction
PDF
Modeling, learning, and leveraging similarity
PDF
Identifying and mitigating safety risks in language models
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Building generalizable language models for code processing
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Multimodal reasoning of visual information and natural language
Asset Metadata
Creator
Ghazmininejad, Marjan
(author),
Ghazvini Nejad, Marjan
(author)
Core Title
Neural creative language generation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/06/2018
Defense Date
05/07/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
creative language generation,deep learning,natural language generation,natural language processing,neural network,OAI-PMH Harvest,poem generation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Knight, Kevin (
committee chair
), Dehghani, Morteza (
committee member
), May, Jonathan (
committee member
)
Creator Email
m.ghazvininejad@gmail.com,mghazvin@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-53898
Unique identifier
UC11669116
Identifier
etd-GhazviniNe-6655.pdf (filename),usctheses-c89-53898 (legacy record id)
Legacy Identifier
etd-GhazviniNe-6655.pdf
Dmrecord
53898
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ghazvini Nejad, Marjan; Ghazmininejad, Marjan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
creative language generation
deep learning
natural language generation
natural language processing
neural network
poem generation