Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Automatic decipherment of historical manuscripts
(USC Thesis Other)
Automatic decipherment of historical manuscripts
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Automatic Decipherment of Historical Manuscripts by Nada Aldarrab A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2022 Copyright 2022 Nada Aldarrab Dedication to my mom and dad. ii Acknowledgements Throughout my PhD, I have had amazing people who helped me learn and enjoy the journey. Writing this dissertation would not have been possible without the support of my family, friends, colleagues, and mentors. I have been blessed with a very loving and supportive family. Thank you for always being there for me even though we are thousands of miles apart. I am totally indebted to you for all the sacrifice, encouragement, and endless support. I have been very fortunate to have Jonathan May as my PhD advisor. His curiosity and boundless enthusiasm have always made our weekly meetings a true joy. Over the course of my PhD, Jon has provided continuous support and advice while still making sure we have fun together as a research group. I also want to thank my proposal and dissertation committee members: Aiichiro Nakano, Shrikanth Narayanan, Aram Galstyan, and Greg Ver Steeg. They were such an exceptional committee that made my dissertation defense quite enjoyable. A big thank you to Professor Beáta Megyesi at Uppsala University, Sweden. This work would not have been possible without the astonishing work and effort that she put into the DECODE project. Thank you to Milena Anfosso for verifying our automatic solution of the IA cipher. iii Thank you to my wonderful colleagues at ISI: Mozhdeh Gheini, Thamme Gowda, Kushal Chawla, Justin Cho, Meryem M’hamdi, and Ulf Hermjakob. I have learned a lot from our interac- tions and discussions as well as our weekly meetings. Thank you to all my friends who made the past few years so wonderful: Fatma, Barakah, Hafsa, Samaher, Wijdan, Betül, and many, many others. A special thank you to Lizsl De Leon at the USC Computer Science Department and Peter Zamar at ISI for taking care of all non-research issues. Your prompt help has always made my life much easier. iv TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 The big picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Historical value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 Lost knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Beyond historical decipherment . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Basic terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Cipher types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Main decipherment challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.2 Cipher type detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.3 Plaintext language identification . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.4 Ciphertext segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.5 Finding the key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Previous Decipherment Work and Going Further . . . . . . . . . . . . . . . . . 15 2.1 Solving substitution ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Plaintext language identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Ciphertext segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 3: Cracking Historical Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 v 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Synthetic Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.2 Decipherment methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.3 Plaintext language identification . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.4 Decipherment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 The Borg cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Language ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.3 Decipherment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.4 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.5 The Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.6 What the book is about . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 4: Multilingual Decipherment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 The Decipherment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Decipherment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Decipherment as a Sequence-to-Sequence Translation Problem . . . . . . 40 4.3.2 Frequency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.3 The Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5.1 Cipher Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.2 No-Space Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5.3 Unknown Plaintext Language . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5.4 Transcription Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5.5 The Borg Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Anagram Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 5: Ciphertext Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.1 Substitution types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.2 Cipher elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.3 Fixed and variable-length ciphers . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.4 Ciphertext segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3 Segmenting Ciphers with no Existing Keys . . . . . . . . . . . . . . . . . . . . . . 57 5.3.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.2 Byte Pair Encoding (BPE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.3 Unigram language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 vi 5.5.1 Monoalphabetic ciphers with word spaces . . . . . . . . . . . . . . . . . . 62 5.5.2 Monoalphabetic ciphers without word spaces . . . . . . . . . . . . . . . . 64 5.5.3 Homophonic ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.6 Segmenting Non-Deterministic Ciphers with an Existing Key . . . . . . . . . . . . 67 5.6.1 Lattice segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.6.2 The IA cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter 6: Deciphering from Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.1 OCR challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 OCR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Character segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4 Character clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.5 Decipherment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.6 Transcription error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Chapter 7: Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2.1 System improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2.2 Beyond human language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.2.3 Adaptation to other domains . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Appendix A: Synthetic Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Appendix B: The Borg Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Appendix C: The IA Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vii ListofTables 3.1 Summary of the properties of the seven synthetic ciphers we used for our decipherment experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Summary of data sets obtained from Project Gutenberg and English Wikipedia . . 22 3.3 Summary of decipherment results on our seven synthetic ciphers . . . . . . . . . 26 3.4 Top-5 languages according to perplexity scores from the decipherment of Borg . . 31 3.5 The transcription scheme and key of the Borg cipher . . . . . . . . . . . . . . . . 34 4.1 SER (%) for solving 1:1 substitution ciphers of various lengths using our decipherment method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 TER (%) for solving monoalphabetic substitution ciphers of length 256 with different spacing conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 SER (%) for solving monoalphabetic substitution ciphers using a multilingual model trained on a different number of languages . . . . . . . . . . . . . . . . . . 46 4.4 TER (%) for solving monoalphabetic substitution ciphers with random insertion, deletion, and substitution noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Statistics of the ciphers obtained from the DECRYPT database . . . . . . . . . . . 59 5.2 Average F1 % (↑) for segmenting 10 synthetic ciphers using different models . . . 62 5.3 Average SegER % (↓) for segmenting 100 synthetic ciphers using different models. 63 5.4 F1 % (↑) for segmenting three real homophonic ciphers using different models. . . 66 5.5 SegER % (↓) for segmenting three real homophonic ciphers using different models. 66 viii 5.6 Corrections/additions to the IA cipher key discovered by our approach. . . . . . . 70 6.1 Seven randomly selected clusters we get from clustering the first three pages of Borg 80 6.2 Summary of decipherment results from automatic vs. manual transcription . . . . 82 6.3 Summary of decipherment results from automatic vs. manual transcription, with transcription error computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4 Properties of the ciphers that we get from OCR, compared to the gold ciphers . . 87 ix ListofFigures 1.1 Historical cipher examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 The Zodiac-408 Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 The Voynich manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 The IA cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1 An example synthetic monoalphabetic cipher . . . . . . . . . . . . . . . . . . . . . 21 3.2 The noisy-channel formulation of the decipherment problem . . . . . . . . . . . . 23 3.3 The Borg Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Page 0166v of the Borg Cipher. The image shows signs of degradation of the manuscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 An excerpt of the Borg Cipher showing a confusing symbol for transcription . . . 29 3.6 Transcription of the first page of Borg. Square brackets are used to indicate what seems to be cleartext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Decipherment as a sequence-to-sequence translation problem. (a) shows the original ciphers being fed to the model. (b) shows the same ciphers after frequency encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Example system output for a cipher with 15% random noise. The system recovered 34/40 errors (TER is 5.86%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 The first 132 characters of the Borg cipher and its decipherment . . . . . . . . . . 50 4.4 Example anagram encryption and decryption process . . . . . . . . . . . . . . . . 51 x 5.1 The IA cipher (16th century) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2 An example homophonic key from the Vatican Secret Archives (16th century) . . 55 5.3 Three historical ciphers from the Vatican Secret Archives . . . . . . . . . . . . . . 60 5.4 Example BPE segmentation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 F1 % (↑) and SegER % (↓) for segmentation of different cipher lengths. . . . . . . . 65 5.6 Example segmentation ambiguity for the IA cipher . . . . . . . . . . . . . . . . . 68 5.7 Part of the segmentation FST for the IA cipher . . . . . . . . . . . . . . . . . . . . 69 5.8 Part of the key FST for the IA cipher . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1 Two pages from the Borg cipher. Images show many challenges for OCR . . . . . 73 6.2 An FSA for generating characters in a row . . . . . . . . . . . . . . . . . . . . . . 76 6.3 An FST for generating the number of black pixels in each column . . . . . . . . . 77 6.4 Character segmentation results for the first page of Borg . . . . . . . . . . . . . . 78 6.5 An example of how edit distance could be computed using our integer program . 85 6.6 An example of computing edit distance line-by-line . . . . . . . . . . . . . . . . . 86 C.1 A Sample of the IA Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 xi Abstract Libraries and archives are filled with enciphered documents from the early modern period. Example documents include encrypted letters, diplomatic correspondences, medical books, and books from secret societies. A collective effort has been put into finding, collecting, and archiving those documents. However, the information hidden in those documents are still unknown to the contemporary age. Decipherment of classical ciphers is an essential step to reveal the contents of those historical documents. Decipherment is a challenging problem. Given some encrypted text (and nothing else), the task is to find the original text before encryption. This requires recognizing patterns in a usually small piece of data (in text or image format). Learning patterns from data is a major goal in artificial intelligence for many different applications. In this thesis, we take the task of decipherment as an example to improve computer ability to recognize and interpret concealed patterns in a small piece of data using unsupervised methods. Decipherment methods have been used in computer science and other fields not only to solve historical ciphers, but also to build more efficient systems in other domains such as machine translation. This thesis aims to make contributions to three problems: 1. Automatically transcribing degraded handwritten documents that use an arbitrary symbol set (not necessarily alphabetical) without using any labeled data. 2. Deciphering noisy ciphers and ciphers with an unknown xii plaintext language. 3. Segmenting ciphertext that consists of a sequence of digits to get meaningful segments that represent language units. We describe our models and algorithms to attack each of these problems. We test our methods by deciphering synthetic and real historical ciphers. Among the results of this work, we automatically crack two real historical ciphers from the 17th century; the Borg cipher and the IA cipher. The contents of these ciphers have not been known until this work. We release new datasets, tools, and models to the research community. xiii Chapter1 Introduction Since the invention of writing systems, the need to hide certain messages from unintended recipients has emerged. As a result, various encryption techniques have been invented to protect the content of those documents. Thousands of encrypted documents are found in archives, with their contents out of reach of historical research. This has attracted the attention of scholars from different fields, including history, philology, computer science, and linguistics. The mysteries surrounding historical ciphers have sparked much interest among amateur enthusiasts and computer scientists. In fact, many advancements in computing were inspired by the intention of code breaking during World War II (Clements, 2013). In this thesis, we target the challenging task of historical decipherment. We discuss different methods for automatically solving historical ciphers and apply those methods to crack real historical ciphers. 1.1 Thebigpicture Communication is a profound term that has many ramifications. It involves exchanging infor- mation by speaking, writing, or using some other medium. A multitude of areas of theory and 1 applications have sprung out of the field of communication. Furthermore, communication is a term that goes beyond that between humans. Interspecies communication is communication between different species of animals, plants, or microorganisms. Yet, human beings communicate with each other through human languages that have different forms including different sounds, tones, and characters. In recent years, computers have been used to analyze, emulate, and solve many problems that arise in communications using human languages. This brought about many fields of computer science such as Natural Language Processing (NLP) with many tasks such as machine translation, speech recognition, virtual assistants and autocorrect. While human languages are developed to provide a way of communication between individ- uals speaking the same language or different languages, there are other languages of different nature, and they have their own ciphers that are oral rather than written. For example, there are the Whistled languages that originated in remote, mountainous villages and dense forests, where environmental conditions required a method of communication that could travel over long distances (Meyer, 2015). These languages and their ciphers have also attracted the attention of many researchers. Communication in all its forms can be a gateway to further understand and interact with the world and creatures around us. From a technical point of view, we need to build technology that extracts patterns from different data modalities and uses them for a purpose. This has been a driving force for unsupervised methods in Artificial Intelligence (AI). In this work, we take historical decipherment as an example task to advance computer ability to extract patterns from a small dataset without supervision and perform a real-world task. One motivation for using unsupervised methods is that humans are able to perform similar tasks without supervision. For 2 a) A sample of the Copiale cipher. 1 b) A sample of the Borg cipher. 2 Figure 1.1: Historical cipher examples. example, children learn language without an explicit key. Humans have solved difficult ciphers without keys thanks to their intuitions. The contents of thousands of historical documents are still unknown to the contemporary age, even though they are encrypted using classical methods. Example documents include books from secret societies, diplomatic correspondences, and pharmacological books (Figure 1.1). Previous work has been done on collecting historical ciphers from libraries and archives and making them available for researchers (Pettersson and Megyesi, 2019; Megyesi et al., 2020). However, decipherment of classical ciphers is an essential step to reveal the contents of those historical documents. 1 https://cl.lingfil.uu.se/~bea/copiale/ 2 https://cl.lingfil.uu.se/~bea/borg/ 3 Traditionally ciphers were solved manually, by time-consuming trial-and-error approaches. With the advent of computers, the general process of deciphering a ciphered text is carried out in two phases. The first phase requires a transcription of the cipher under consideration into computer-readable format. The second phase is to develop algorithms to decipher the transcribed text including, among other tasks, recognizing the language of the original text and segmenting the ciphertext into individual replacement segments. The goal is to be able to fully automate the process so that someone could take a cellphone camera (or any similar device designed for this purpose) into an archive, point it at a new cipher page, and then see the deciphered plaintext appear on the screen. 1.2 Motivation The thousands of encrypted documents that we have in libraries and archives are the result of a collective effort. Archaeologists, historians, librarians, and many people are involved in the process of finding, collecting, and archiving historical ciphers. However, apart from carbon-dating the manuscripts, very little is known about the contents of those documents until they are deciphered. In this section, we present a sample of historical ciphers and different applications of historical decipherment in other domains. 1.2.1 Historicalvalue In the pre-computer era, people relied on classical decipherment methods to encrypt messages. As a result, the historical ciphers that are found in archives include letters exchanged during wars, 4 diplomatic correspondences, etc. Without solving those ciphers, many event details and secrets are missing. Such information is valuable for historical research. 1.2.2 Security Some encrypted documents can be connected to crimes. For this reason, solving those ciphers becomes important for security purposes. A famous example is the ciphers that were sent by the Zodiac Killer. The Zodiac Killer was a serial killer in northern California in the late 1960s and early 1970s. In 1969, the killer sent out three letters to the Vallejo Times Herald, the San Francisco Chronicle, and The San Francisco Examiner and demanded that they be printed on each paper’s front page or he would kill a dozen people over the weekend. Each letter included a third of a 408-symbol cryptogram (shown in Figure 1.2). The cipher was solved manually one week after its release. The killer later sent out another similar looking 340-symbol cipher, which was not solved until 2020 (i.e., 51 years after its release). Two more ciphers are still unsolved. The identity of the Zodiac Killer remains unknown to date, which makes his ciphers even more intriguing. Unless we build technology that can solve such ciphers in a timely manner, we would still be unable to read those documents with all the risks associated with them. 1.2.3 Lostknowledge Looking at the history of humanity, we can see the many civilizations that passed a long time ago. Some of their relics we can witness today. However, a great deal of knowledge had been lost from those times. An example is the Egyptian hieroglyphs, which were the formal writing system in 5 Figure 1.2: The Zodiac-408 Cipher. 6 Figure 1.3: The Voynich manuscript (15th century). 3 Ancient Egypt. All the ancient writings remained mysterious until they were finally deciphered in the 1820s with the help of the Rosetta Stone. Deciphering the hieroglyphs was essential to the modern understanding of ancient Egyptian culture. However, we are still not totally sure how the Pyramids were built. There are a lot of mysteries surrounding their construction. Another example is The Voynich manuscript; a 240-page book, handwritten, with vivid illustrations, in an unknown writing system (a sample of which is shown in Figure 1.3). Carbon- dated to the early 15th century (1404-1438), it included many very colorful illustrations covering a wide range of subjects such as plants and herbs, human health, and astronomy. However, it is still mysterious; the language is not cracked, nor is the question of if it is language at all. 3 https://collections.library.yale.edu/catalog/2002046 7 1.2.4 Beyondhistoricaldecipherment Building technology for decipherment has been useful for other applications. For example, decipherment methods have been used in machine translation (MT). As Warren Weaver wrote in his letter to Norbert Wiener (Weaver, 1947): [...] one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” Since parallel corpora are expensive and not available for every language, decipherment has been used to leverage the usually more abundant monolingual data to train translation models. Example applications include low-resource machine translation (Dou and Knight, 2013; Dou et al., 2014, 2015), out-of-domain machine translation (Dou and Knight, 2012), and decipherment of lost languages (Snyder et al., 2010). Decipherment can also be used to build more efficient models. By efficiency, we refer to time efficiency (i.e. building faster systems) as well asdataefficiency (i.e. building better systems without requiring more training data). For example, Pourdamghani et al. (2019) build fast Neural Machine Translation (NMT) systems using word-by-word decoding. Kambhatla et al. (2022) use enciphered text as a data augmentation method to improve NMT performance without requiring more parallel data. Using these techniques for a language we know nothing about is still difficult. Current unsupervised work assumes some knowledge of the language family and appropriate segmentation of the language to be translated. If truly nothing is known, then existing techniques fail. The techniques we present in this work address these issues. 8 Figure 1.4: The IA cipher (16th century). 4 1.3 Terminology In this section, we introduce the major concepts and terms that are used throughout this disserta- tion. 1.3.1 Basicterms Consider the cipher example shown in Figure 1.4. Ciphertext is the encrypted text shown in the Figure as sequences of digits. Cleartext, on the other hand, is non-encrypted text that can be present throughout the enciphered document. Plaintext is the original text that was enciphered to create the ciphertext. Decipherment is the process of recovering the plaintext given the ciphertext. 4 https://de-crypt.org/decrypt-web/RecordsView/189?showdetail= 9 1.3.2 Ciphertypes Various encipherment methods can be used to create ciphers. Since we are dealing with pre- computer ciphers, we focus our discussion on the encipherment methods that were used in the early modern ages. Historical ciphers have two main categories: Substitutionciphers andtranspositionci- phers. A substitution cipher is a cipher that is created by substituting each plaintext character with another character according to a substitution table called thekey, whereas transposition ciphers are created by rearranging the letters of a message. A cipher can be created using a mixture of substitution and transposition (like the Zodiac-340 cipher). Substitution ciphers that use a single cipher alphabet where each plaintext letter type is deterministically replaced with one cipher letter type are known asmonoalphabeticciphers. A substitution cipher that provides multiple substitutions for some letters (i.e., 1→M substitutions) is called ahomophoniccipher. 1.4 Maindeciphermentchallenges Decipherment conditions vary from one cipher to another. For example, some cleartext might be found along with the encrypted text, which gives a hint to the plaintext language of the cipher. In other cases, calledknown-plaintextattacks, some decoded material is found, which can be exploited to crack the rest of the encoded script. However, in aciphertext-onlyattack, the focus of this thesis, the cryptanalyst has access only to the ciphertext. This means that the encipherment method, the plaintext language, and the key are all unknown. In this section, we describe five main challenges in historical decipherment: Transcription, cipher type detection, plaintext language identification, ciphertext segmentation, and finding the key. 10 1.4.1 Transcription Real ciphers pose great challenges for automatic decipherment. The challenges begin with turning those ciphers into computer-readable format; a process usually referred to as transcription. To transcribe a historical document, a transcriber must first recognize the cipher alphabet and decide on character boundaries, which is sometimes hard to do, especially if the cipher uses unfamiliar symbols. Then the transcriber should come up with an easy-to-type, easy-to-remember transcription scheme for faster and more accurate transcription. Moreover, the transcriber usually faces the challenges of degraded manuscripts, misshapen pages, ink blotches, and low-quality scans, which make the transcription process more challenging. In Chapter 6, we discuss automating the transcription process for historical ciphers. 1.4.2 Ciphertypedetection In a ciphertext-only attack, the encipherment method that was used to create the cipher is unknown. Various methods have been proposed for cipher type detection. In this work, we focus on solving historical ciphers from the early modern period (16th-18th centuries). Those ciphers are mostly substitution ciphers (monoalphabetic and homophonic ciphers). Thus, we do not address the problem of cipher type detection. 1.4.3 Plaintextlanguageidentification Sometimes, ciphers are accompanied with some cleartext. This is usually a strong clue of the plaintext language of the ciphertext. Most ciphers, however, are completely enciphered, so it is 11 crucial to identify the language of the plaintext in order to find the key. In Chapter 4, we discuss the problem of plaintext language identification. 1.4.4 Ciphertextsegmentation Some historical ciphers have clear substitution units. For example, in the ciphers shown in Figure 1.1, each symbol is an encryption of some plaintext character (or syllable). However, the cipher shown in Figure 1.4 is a continuous sequence of digits that hide character boundaries. It is thus essential to find substitution units in order to solve the cipher. We discuss the problem of ciphertext segmentation in Chapter 5. 1.4.5 Findingthekey Once we have a theory about the cipher type and plaintext language, we can proceed to the next step of finding the key. The key is usually a letter (or word) substitution table. Finding the key is a very challenging task. Throughout history, many methods have been used by cryptographers to challenge cryptanalysts. For example, many ciphers in the early modern ages include nulls (Dooley, 2013). Nulls are spurious characters added randomly to the cipher to confuse cryptanalysts. Those characters are not part of the enciphered message and are only used to further conceal the message. Methods for finding the key range from manual frequency analysis (800s C.E.) to modern, computer-based methods (Dooley, 2013). We discuss key finding methods in Chapters 3 and 4. 1.5 Contributionsofthisthesis The major contributions of this thesis are as follows: 12 • We present a multilingual sequence-to-sequence decipherment model for solving monoal- phabetic ciphers. Our method is able to decipher 700 synthetic ciphers from 14 different languages with less than 1% character errors and less than 6% character errors on the Borg cipher; a real historical cipher. In addition, our experiments show that our models are robust to different types of noise, and can even recover from many of them (Aldarrab and May, 2021). • We propose novel unsupervised methods for automatic ciphertext segmentation for ciphers withnoexistingkeys. Our methods are able to segment 100 randomly generated monoal- phabetic ciphers with an average segmentation error of less than 3%, while still being robust to removing spaces. We test our methods on 3 real homophonic ciphers and achieve an average segmentation error of 27%, with a segmentation error of 14% on the F283 cipher (Aldarrab and May, 2022). • We propose a method to segment non-deterministic ciphers with existing keys. Our method unveils the content of the IA cipher; a letter from the 16th century that has not been revealed until this work (Aldarrab and May, 2022). • We present an unsupervised end-to-end system aimed at deciphering from images and report our results on deciphering real handwritten historical cipher scans (Aldarrab et al., 2017; Yin et al., 2019). • We develop a generic tool for computing the edit distance between strings that have differ- ent vocabularies using Integer Linear Programming (ILP). We use this tool for evaluating 13 transcription accuracy and release it to the research community (Aldarrab et al., 2017; Yin et al., 2019). • We release new datasets for the research community. Our released datasets include: – The Borg cipher: a 408-page book that we automatically cracked. We release the full transcription, solution, and translation from Latin to English (Aldarrab et al., 2017). This dataset has been used for several research projects since we released it, e.g. (Baró et al., 2019; Renfei, 2020; Magnifico, 2021; Souibgui et al., 2021a,b, 2022) – The IA cipher: a 10-page letter from the 16th century that we fully solved. We release the full solution and the gold segmentation of the ciphertext (Aldarrab and May, 2022). – Historical datasets for 14 languages (Aldarrab and May, 2021). 14 Chapter2 PreviousDeciphermentWorkandGoingFurther The major contributions of this thesis fall into four categories: Solving substitution ciphers (i.e. finding the key), plaintext language identification, ciphertext segmentation, and the automatic transcription of historical ciphers. This chapter discusses previous work and shows how our work takes it further by improving state-of-the-art systems and/or attacking unsolved problems. 2.1 Solvingsubstitutionciphers Early methods for attacking ciphers were based onfrequencyanalysis, a method discovered by the great polymath, Al-Kindi (801-873 C.E.). The method of frequency analysis was first described in his book on secret messages,AManuscriptonDecipheringCryptographicMessages, which is considered one of the most important books in the history of cryptology (Dooley, 2013). Frequency analysis has become a fundamental method in cryptanalysis and is usually the first step cryptanalysts take to get an idea of the cipher type and properties. Automatic methods have been developed to solve substitution ciphers, e.g. (Hart, 1994; Knight et al., 2006a; Olson, 2007; Ravi and Knight, 2008; Corlett and Penn, 2010; Ravi and Knight, 2008, 15 2011; Nuhn et al., 2013; Berg-Kirkpatrick and Klein, 2013; Nuhn and Knight, 2014; Nuhn et al., 2014; Hauer et al., 2014). Many proposed methods search for the substitution table (i.e. cipher key) that leads to a likely target plaintext according to a character n-gram language model. The current state-of-the-art method uses beam search and a neural language model to score candidate plaintext hypotheses from the search space for each cipher, along with a frequency matching heuristic incorporated into the scoring function (Kambhatla et al., 2018). However, all these methods assume prior knowledge of the target plaintext language, which is not the case in ciphertext-only-attacks. When dealing with an unsolved cipher, we have a real ciphertext-only-attack. We face the challenging processes of transcription, noise that comes from degraded historical documents, unclear handwriting, unknown plaintext language, etc. In the end, after coming up with a “possible” solution, we still need to verify that the solution is correct. Not only does this require a native speaker of the language, but it also requires expertise in historical texts since languages change drastically over time. In Chapter 3, we present our work on automatically cracking a previously unsolved real historical cipher; the Borg cipher. We discuss the challenges involved in ciphertext-only attacks. We build decipherment models to crack real historical ciphers and test them on both synthetic ciphers and a real historical cipher. 2.2 Plaintextlanguageidentification As mentioned in the previous section, many automatic methods have been used to attack substitu- tion ciphers. However, these methods assume prior knowledge of the target plaintext language, which is not the case in ciphertext-only-attacks. 16 Early work on language identification relies on a brute-force guess-and-check strategy that does not scale well as more languages are considered (Knight et al., 2006b; Hauer and Kondrak, 2016). In this thesis, we ask: Can we build an end-to-end model that deciphers directly without relying on a separate language ID step? Can we build models that recover from historical cipher noise? In Chapter 4, we present a multilingual sequence-to-sequence decipherment model. We show that our model can solve ciphers without explicit language identification while still being robust to noise. Recent research has looked at applying other neural models to different decipherment problems. Greydanus (2017) find an LSTM model can learn the decryption function of polyalphabetic substitution ciphers when trained on a concatenation of <key + ciphertext> as input and plaintext as output. Our work looks at a different problem. We target a ciphertext-only-attack for short substitution ciphers. Gomez et al. (2018) propose CipherGAN, which uses a Generative Adversarial Network to find a mapping between the character embedding distributions of plaintext and ciphertext. This method assumes the availability of plenty of ciphertext. Our method, by contrast, does not require a large amount of ciphertext. In fact, all of our experiments were evaluated on ciphers of 256 characters or shorter. 2.3 Ciphertextsegmentation Different automatic methods have been proposed for solving substitution ciphers, e.g. (Hart, 1994; Olson, 2007; Ravi and Knight, 2008; Corlett and Penn, 2010; Nuhn et al., 2013; Nuhn and Knight, 2014; Hauer et al., 2014). However, all of these methods assume that cipher elements are clearly 17 segmented (i.e., that token boundaries are well established). Many historical documents, however, are enciphered as continuous sequences of digits that hide token boundaries (Lasry et al., 2020). An example cipher (the IA cipher) is shown in Figure 1.4 (Megyesi et al., 2020). Solving those ciphers is very challenging since it is not possible to directly search for the key without finding substitution units. In Chapter 5, we present a novel approach to segment ciphertext in an unsupervised way and only using the ciphertext. We test our methods on both synthetic and real historical ciphers. Lasry et al. (2020) present an extensive study on papal ciphers from the 16th to 18th century. Those ciphers are numerical substitution ciphers that need to be segmented. For segmentation, Lasry et al. (2020) create a set of segmenters (called “parsers” in the paper) from a collection of known cipher keys. Then they test the cipher in hand to see if any of the previously created segmenters can be a good fit. Our method, by contrast, is not limited to existing keys. In fact, our method is completely unsupervised and only uses ciphertext as input. 2.4 Transcription Transcribing historical documents is a challenging task. Given the large number of historical ciphers, manual transcription is time-consuming, expensive and prone to errors. Thus, we would like to automate this process using Optical Character Recognition (OCR). Handwriting recognition is traditionally divided into two types;on-line andoff-line recog- nition (Liwicki et al., 2012). On-line handwriting recognition systems record a time ordered sequence of coordinates that represent the movement of the pen-tip. On the other hand, off-line recognition systems work only on an image of the text. In our case, we are facing an off-line 18 recognition problem where we try to decipher historical manuscripts from available scanned images of handwritten text. Early OCR research proposed unsupervised identification and clustering of characters (Nagy et al., 1987; Fang and Hull, 1995; Ho and Nagy, 2000; Huang et al., 2007). Those unsupervised methods were abandoned in favor of supervised techniques with the abundance of manually transcribed training data. More recent handwriting recognition systems use supervised or semi- supervised approaches, e.g., (Smith, 2007; Kae et al., 2010; Kluzner et al., 2011; Liwicki et al., 2012). Such supervised methods are clearly not an option for us since we have no training data for our historical cipher collection. Annotating data is very expensive and especially hard for degraded documents. Moreover, ciphers usually use unique symbol sets and have a great variance in handwriting styles. This makes it hard to reuse any annotated data, which makes supervised methods an unfeasible option. Other works on OCR target printed, typeset documents (Berg-Kirkpatrick et al., 2013; Berg- Kirkpatrick and Klein, 2014). These methods model different types of noise, including ink bleeds and uneven baselines. However, these methods expect general consistency in font and non- overlapping characters, which is not the case in handwritten historical ciphers. All ciphers in our collection are handwritten. And as mentioned previously, we do not have any labeled training data for linking cipher characters to images when we attack a new cipher. So, we take an unsupervised character segmentation and clustering approach instead. We report our OCR experiments in chapter 6. 19 Chapter3 CrackingHistoricalCiphers As discussed in Chapter 1, real ciphers pose great challenges for automatic decipherment. In this chapter, we would like to start simple and build our way up to attack real historical ciphers. We first establish decipherment methods that work on syntheticciphers (i.e. ciphers that we create and know the key for). We describes our experiments, datasets, methods, and results. Then we move to real ciphers. We show that our proposed methods can crack real historical ciphers. 3.1 Introduction Unsupervised learning has played a major role in many advances in natural language processing. Even though we are witnessing great interest in supervised techniques, encouraged by high computing power and the vastness of training data, we still face many problems where supervised learning is not an option. These include deciphering unknown scripts, like the Voynich manuscript, or enciphered texts such as the Zodiac Killer ciphers. In this chapter, we introduce our first decipherment approach using the noisy channel model. We apply our methods on synthetic ciphers and a real historical cipher. 20 kac butnqymkupqmr tckauv ql m tckauv ux rqyzjqlkqb mymrdlql kamk ql jlcv ku lkjvd kcfkl gaqba mpc gpqkkcy qy my jyeyugy rmyzjmzc myv ku lkjvd kac rmyzjmzc qklcrx gacpc kac jyeyugy rmyzjmzc aml yu unhqujl up ipuhcy gcrrjyvcplkuuv brulc pcrmkqhcl myv gacpc kacpc mpc xcg nqrqyzjmr kcfkl gaqba tqzak ukacpgqlc amhc nccy jlcv ku acri jyvcplkmyv kac rmyzjmzc Figure 3.1: An example synthetic monoalphabetic cipher. 3.2 SyntheticCiphers We start our experiments with a set of seven synthetic monoalphabetic ciphers. Our goal is to automatically decipher them. These ciphers range in difficulty from long ciphers with spaces to very short ciphers with spaces removed. For some ciphers, we hide plaintext language information to make the problem more challenging. Figure 3.1 shows an example synthetic cipher. Table 3.1 shows a summary of the lengths, types and properties of the seven synthetic ciphers. The full set of ciphers is shown in appendix A. Cipher No. # chars Spaces? Language 1 353 yes English 2 150 yes English 3 653 no English 4 128 yes Unspecified 5 107 yes Unspecified 6 331 no Unspecified 7 168 no Unspecified Table 3.1: Summary of the properties of the seven synthetic ciphers we used for our decipherment experiments. The full set of ciphers is shown in appendix A. 3.2.1 Datasets Since we are targeting historical ciphers, we start by collecting historical text for various European languages. We scrape historical text from Project Gutenberg for 15 languages, namely: Spanish, 21 Latin, Esperanto, Hungarian, Icelandic, Danish, Norwegian, Dutch, Swedish, Catalan, French, Portuguese, and Finnish. We use Wikipedia dumps to get data for English. This completes our dataset that we will be using to build language models. Table 3.2 summarizes our datasets. language # of words # of characters Catalan 915,595 4,953,516 Danish 2,077,929 11,205,300 Dutch 30,350,145 177,835,527 English 48,041,703 289,170,305 Esperanto 315,423 2,079,649 Finnish 22,784,172 168,886,663 French 39,400,587 226,310,827 German 3,273,602 20,927,065 Hungarian 497,402 3,145,451 Icelandic 72,629 377,910 Italian 4,587,027 27,786,754 Latin 1,375,804 8,740,808 Norwegian 706,435 3,673,895 Portuguese 10,841,171 62,735,255 Spanish 20,165,731 114,663,957 Swedish 3,008,680 16,993,146 Table 3.2: Summary of data sets obtained from Project Gutenberg and English Wikipedia. 3.2.2 Deciphermentmethods Let’s assume for now that we know the plaintext language of the cipher. To find the actual plaintext, we use the noisy-channel model (Knight et al., 2006a). Figure 3.2 depicts the noisy-channel formulation of the decipherment problem. In this formulation, we think about agenerativestory of how the ciphertext was created. To create an English 1:1 substitution cipher, we can imagine that: (a) Someone first came up with some English text (the plaintext). 22 (b) Then they enciphered each plaintext character according to a 1:1 substitution table (the key) (Figure 3.2(a)). kac butnqymkupqmr tckauv ql m tckauv ux rqyzjqlkqb mymrdlql kamk ql jlcv ku lkjvd kcfkl gaqba mpc gpqkkcy qy my… English Speaker Plain Cipher am bn cb dx ec …… kac butnqymkupqmr tckauv ql m tckauv ux rqyzjqlkqb mymrdlql kamk ql jlcv ku lkjvd kcfkl gaqba mpc gpqkkcy qy my… the combinatorial method is a method of linguistic analysis that is used to study texts which are written in an… (a) A generative story of how the ciphertext was created. We imagine that someone first came up with some English text (the plaintext). Then they enciphered each plaintext character according to a 1:1 substitution table (the key) Language Model P(p) Channel Model P(c|p) kac butnqymkupqmr tckauv ql m tckauv ux rqyzjqlkqb mymrdlql kamk ql jlcv ku lkjvd kcfkl gaqba mpc gpqkkcy qy my… kac butnqymkupqmr tckauv ql m tckauv ux rqyzjqlkqb mymrdlql kamk ql jlcv ku lkjvd kcfkl gaqba mpc gpqkkcy qy my… ??? (b) In a ciphertext-only attack, we only have the ciphertext, and we want to find the plaintext. We model the plaintext generation process by an English language modelP(p) that assigns a probability to each English stringp, and we model the encipherment process by a channel modelP(c|p) that assigns a probability for each character substitution Figure 3.2: The noisy-channel formulation of the decipherment problem. In a ciphertext-only attack, we only have the ciphertext, and we want to find the plaintext. We model the plaintext generation process by an English language modelP(p) that assigns a probability to each English stringp, and we model the encipherment process by a channel model P(c|p) that assigns a probability for each character substitution (Figure 3.2(b)). Our objective is to find the plaintext p that maximizesP(p|c) for a given cipherc. That is: argmax p P(p|c) 23 By Bayes rule: argmax p P(p|c) = argmax p P(p)∗ P(c|p) P(c) (3.1) SinceP(c) does not affect the choice of p (i.e.P(c) is constant), the equation becomes: argmax p P(p|c) = argmax p P(p)∗ P(c|p) (3.2) So, we basically need to build two probabilistic models; a language modelP(p) and a channel model P(c|p). Language models could be built and trained independently on any plaintext language data. The channel model explains how plaintextp becomes ciphertextc. This could be a two-dimensional substitution table. To estimate the parameters of the channel model, we use the expectation-maximization algorithm (EM) (Dempster et al., 1977). Our objective function is to maximizeP(c), which (by the law of total probability) is: P(c) = X p P(p)∗ P(c|p) (3.3) Once we have the trained models, we can use the Viterbi algorithm to find the plaintext that maximizesP(p|c) as given by equation 3.2. 24 We implement all our models as a cascade of finite-state machines (FSMs). We use the finite- state toolkit,Carmel, to do the EM training of the channel model and the final Viterbi decoding (Graehl, 2010). In summary, we take the following steps to attack ciphers: 1. Build letter language models: We build letter-based language models for the 16 European Languages that we collected historical texts for: Catalan, Danish, Dutch, English, Esperanto, Finnish, French, German, Hungarian, Icelandic, Italian, Latin, Norwegian, Portuguese, Spanish, and Swedish. We experiment with different n-gram orders; 2-gram, 3-gram, 4-gram, and 5-gram. We implement these models as finite-state acceptors (FSAs). We use 80% of the data for training, 10% for development, and 10% for testing. We mainly use the development set to smooth our language models. For smoothing, we estimate context-specific backoff parameters by running EM on the development set. 2. Train the channel model to getP(c|p) (EM training): We implement the channel model as a single-state, fully-connected finite-state transducer (FST). Then we use Carmel to run EM training, guided by the pre-trained language model probabilities. To get better decipherment results, we use two techniques described in previous literature; more random restarts and square-rooting language model probabilities during EM training (Ravi and Knight, 2009) 3. Decode the ciphertext to find the plaintext p (Viterbi decoding): Using our trained models, we run the Viterbi algorithm to find the best path (i.e. the plaintext that maximizes P(p|c) as given by equation 3.2). As suggested by Knight and Yamada (1999), we also find that cubing channel probabilities before the final decoding results in better decipherment accuracy. 25 3.2.3 Plaintextlanguageidentification When the plaintext language of the cipher is unknown, decipherment becomes more challenging. To solve this problem, we follow the same decipherment procedure described in section 3.2.2 using one lanugage model at a time. The procedure is as follows: Given ciphertext c, we run our decipherment model against each of the 16 European languages, then we rank candidate languages based on some evaluation metric. Recall from equation 3.3 that EM’s objective function was to maximizeP(c). We can use this post-trainingP(c) to rank candidate plaintext languages. 3.2.4 Deciphermentresults We use the decipherment method described in section 3.2.2 to crack the seven synthetic ciphers in our test set. Table 3.3 shows a summary of our decipherment results. % Error is the percentage of character mistakes in the final decoded message compared to the gold answer. No. # chars Spaces? Language LM n-gram Restarts % Error 1 353 yes English 3 10 1.98% 2 150 yes English 3 25 4.67% 3 653 no English 4 20 1.53% 4 128 yes Spanish 4 20 3.91% 5 107 yes German 5 1 0.00% 6 331 no Swedish 5 1 0.60% 7 168 no Portuguese 5 1 1.80% Table 3.3: Summary of decipherment results on our seven synthetic ciphers. % Error is the percentage of character mistakes in the final decoded message compared to the gold answer. 26 3.3 TheBorgcipher We now move to a real historical cipher. The Borg cipher is a 400-page book digitized by the Biblioteca Apostolica Vaticana. The official name of the cipher is “Borg.lat.898.” We call it the Borg cipher for simplicity. It is believed to date back to the 1600s. The first page of the book seems to be written in Arabic. The rest of the book is completely enciphered in astrological symbols. The book also contains some Latin fragments, and a page of Italian right at the end of the book. Figure 3.3 shows two pages of the book. The Vatican does not have a key associated with this cipher nor any deciphered parts of the book. Figure 3.3: The Borg Cipher. 1 27 3.3.1 Transcription Transcribing the Borg cipher is challenging. First, the book is old, and it has obvious signs of degradation, which makes it hard to read some characters. Figure 3.4 shows a sample page with background noise and ink blotches. Another challenge is that the characters at the book binding area are cut off in the scan. This leaves us with a lot of incomplete words in the middle of the text. Moreover, it is not very clear whether some symbols represent one or more cipher characters. For example, the symbol shown in figure 3.5 looks like a 6 and a9 but could also be one symbol (the Zodiac sign for Cancer). It turned out to be the latter as we will see in the following sections. Such findings can only be assured after deciphering the manuscript. We manually transcribed the first three pages of the cipher. 22 of the 26 cipher letter types appear in the first three pages of the book. Figure 3.6 shows page 0002r of the cipher, along with our transcription. 3.3.2 LanguageID We ran EM decipherment against 16 European languages on the first three pages of the book. Table 3.4 shows the perplexity scores for the top five candidate plaintext languages. The results we got suggested Latin as the plaintext language of the cipher. 1 Images retrieved from the Digital Vatican Library’s official website: http://digi.vatlib.it/view/MSS_ Borg.lat.898 28 Figure 3.4: Page 0166v of the Borg Cipher. The image shows signs of degradation of the manuscript. Is this one or two symbols? Figure 3.5: An excerpt of the Borg Cipher showing a confusing symbol for transcription. 29 R i6861w9hx hmx1x 0d8wcx, i6qvdx 5w9wvxi„ hx qon6qd1 1w9hmw iq„ xn0w 696 16. I [se.] 69xnx 4w9xid8x obx1x dqhmx iw 69whx [an] Z I [vad] 69cw8x„ iw 4w9xid8x 68hmww nmdx88w xqxvxn hdq5xh w88w : 685x 696 Z Y [Esulam propter :] Z Y 6n6qx Z [vi] c6869„ cw ix961o1x i6861x [aro] [: an] Z [5] x94d9v69hdq hqxh6 o19x6 x9 6iwhx 4oqhxn„ nx1x hx hqxvdo x9 8o„ io i68xvo x9 d6nw dx„ hqwo dw8 hwqqwo dxhqw„ 6ho vwx9vw 5d88x69h Figure 3.6: Transcription of the first page of Borg. Square brackets are used to indicate what seems to be cleartext. 30 Language Per-CharacterPerplexity Latin 1.1323 Esperanto 1.1478 English 1.1534 Hungarian 1.1823 Icelandic 1.2170 Table 3.4: Top-5 languages according to perplexity scores from the decipherment of Borg. 3.3.3 Decipherment We used a 5-gram Latin letter language model to train a fully connected FST (channel model) using EM. We used the trained channel model to get the Viterbi decoding of the first three pages of the book. The decoded Latin script that we got reads: calamenti thimi pulegi cardui benedicti rosarum menthe cr ispe anam anisi feniculi obimi urthi ce aneti angeli ce feniculi althee shuille iridis turbit elle albi ana asali galange cinamomi calami infundantur trita omnia in aceti fortissimi ti triduo in loco calido in uase uitreo uel terreo uitre to deinde bulliant in uase fictili uitre ato ad casum medietatis fieri coleture adde sachari mellis despum? ti fiatserupus hui arpmatibetur cum cr cimacis cinamomi bin miberis suspe datur in saculo intus et seruetur usui nos sumus experti h? si tollatur puluis rubeorum et croci et uino bibatur subtili statim tollit tremore cordis magnum secretum indolorem mamillarum pellistalpe superposita milabilis est si permiseris talpam mori in manu tene do oculos irsius con tra radios solis si tetigeris cumilla manu mamillam dolentem messat dolor uxor passa est apostemata mamilla 31 rum ush ad mortem et tale adposuit emplastrm tactum lacte rani huod est pingue supernatans lacti posthuam stetelit ad tempus et cum creta communi et superpone... 3.3.4 Translation To get a quick translation of the Latin text, we used the online machine translation system; Google Translate. The first three pages of the book translate to: calamenti thimi pulegi artichoke blessed roses menthe cr ispe anam anise fennel, dill angels engage in further ce ce fennel Althea shuille elle white iris turbit ana Asali galange cinnamon infused branches After all these in the three days in a warm place in a dish of vinegar has been very strong and ti glass or glass to frighten and then boil in an earthen vessel fired a given half a chance to add to the coleture Zechariah honey despair? ti fiatserupus hui arpmatibetur with cr cimacis cinnamon bin miber suspe is kept in a purse inside and we use We experienced h? If we take away the dust of red and saffron and wine drinking immediately takes a subtle trembling heart big secret absence of breasts pellistalpe spread milabilis if you allow it to die in the hands of a mole do hold up irsi con tra rays of the sun if you touch with Him the painful breast messar pain she suffered breast abscesses rum ush to death thus giving a plaster floating touch milk fat is a veteran Huod milk Posthumus stetelit common with chalk at a time, and on top... This initial translation of the text seemed to suggest that this was a medical book from the early modern period. We could see some sentences that describe recipies like “all these in the three days in a warm place in a dish of vinegar.” 32 We sent the text to an expert in pharmacological and medical interpretations of Latin from the 16th-17th centuries and got a refined translation of the first three pages. The refined translation is: Take lesser calamint, thyme, pennyroyal, St. Benedict’s thistle, roses and wrinkled- leaf mint, one handful of each; aniseed, seed of fennel, basil, nettle and dill, half a drachm of each; roots of angelica, fennel, marsh mallow, sea squill, turbith and white hellebore, two ounces of each; of green spurge<propter> two ounces, of hazelwort six drachms; galangal, cinnamon and calamus, half a drachm of each. Everything is grated, and left for three days, on a warm place, in ten pounds of strong vinegar, in a vessel of glass or glazed earthenware. Then, you boil it down in a glazed earthenware pot, to half its volume. Strain; add sugar and despumated honey, twenty ounces of each; this should become a syrup, which is spiced with saffron, mace, cinnamon and ginger, two drachms of each, which is suspended in a small bag inside the vessel. It is saved for future needs. For trembling of the heart We have noticed that if you take a powder of red coral and saffron, and this is drunk in a fine wine, it will immediately stop trembling of the heart. For boils and pains of the breasts An important secret (in this context = panacea) for breast pains: application of mole skin is marvellous; if you have let the mole die while holding it in your hand with its eyes towards the sun, and with that same hand covered the ailing breast, the pain goes away. The wife of Hans Stoldis (?) suffered from boils in her breasts so badly, that she<almost> died, and he used a plaster made from<lac ran> - that is 33 fat that floats on the surface of milk which has been left for some time – and common chalk; this was applied... 3.3.5 TheKey The key that we got from the automatic decipherment was almost perfect. There was one character that seldom appeared in the first pages of the book, so it was hard for the machine to decide on how to decipher it. The Latin expert pointed out another interesting feature of the text, which is the use of abbreviations for measurements. For example, floreni is a measurement for the gold florin. The word “ana” means “of each” in recipies. Table 3.5 shows the key that we got from automatic decipherment, along with the abbreviations interpreted by the Latin expert. Cipher Symbol Transcription DecodedLatin Cipher Symbol Transcription DecodedLatin a V 4 F b Z 5 B c G 6 A d U 8 L h T 9 N i C O K k Y M Q m H H floreni n S T libram o O W suffix: -n or -m q R Z unica or drachmam v D I 1 w E Y II? x I AU ana y X , , 0 P „ - 1 M ? ? Table 3.5: The transcription scheme and key of the Borg cipher. 34 3.3.6 Whatthebookisabout With the help of the DECODE team from Uppsala University in Sweden, we transcribe the whole manuscript (408 pages). We also get the full translation from a Latin expert. We release the full transcription, automatic solution, and translation of the 408 pages of the manuscript. 2 Pages 0002r - 0027v discuss pharmacology, symptoms of ilness, and treatments for various diseases. Starting from page 0028r, the book discusses warefare with firebombs. Here is a translated excerpt from pages 0028r - 0029r: Nectanebo says to Alexander: O, Alexander, may you be regarded as a virtuous king, and may you destroy your enemies with fire; I send you various kinds of fire to burn your enemies, whether on land or at sea. The first kind of fire Take 1 pound of the purest sandarac, or vernix, one pound<[..]rmo>, liquid, and after pestling, put it in a glazed earthenware pot, and seal with lute, then [..] fire until liquid, [..] this liquid and an equal amount of [..] the sign of lique- faction is, that on a wooden stick, inserted through the opening, the matter should resemble butter. Then, you pour it over an equal amount of greek [..] (this sign should mean ’tar’ or ’pitch’) or colophonium. This may not be done indoors, because of the danger of fire. When you want to use it, however, take a bag of goat skin, which you inflate, and smear with said oil in- and outside, and tie the bag to a spear, and place on that a piece of wood. The iron should touch the bag, where the wood, when ignited, [..] the said preparation 2 https://cl.lingfil.uu.se/~bea/borg/ 35 set on fire, and falls down over the sea, and by the wind is carried towards the enemies, and burns them, and water cannot extinguish this. The second kind of fire is this: Take one scruple of balsam, one pound of the pith of ferula cane, one scruple of sulphur, one scruple of liquefied duck fat, and mix at the same time, and apply on an artfully made arrow; when this has been ignited, shoot the arrow towards the mountains; and the places where it has fallen, this concoction will set on fire, and water can not extinguish it. The third kind of fire... We show more pages and translations of the “Borg" cipher in appendix B. 3.4 Conclusion In this chapter, we present our first decipherment experiments using the noisy-channel model. We train 16 language models on historical text and use them for decipherment. We test our method on seven synthetic ciphers and obtain an average decipherment accuracy of 98%. We test our method on the Borg cipher, a real historical cipher, and show that our method is able to crack the Borg cipher using the first three pages of the manuscript. We release the full transcription, automatic solution, and translation of the 408 pages of the manuscript. 36 Chapter4 MultilingualDecipherment Decipherment of historical ciphers is a challenging problem. The language of the target plaintext might be unknown, and ciphertext can have a lot of noise. State-of-the-art decipherment methods use beam search and a neural language model to score candidate plaintext hypotheses for a given cipher, assuming the plaintext language is known. We propose an end-to-end multilingual model for solving simple substitution ciphers. We test our model on synthetic and real historical ciphers and show that our proposed method can decipher text without explicit language identification while still being robust to noise. 4.1 Introduction In the previous chapter, we used the noisy-channel model to crack substitution ciphers. However, since the noisy-channel model assumes that plaintext language is known, we had to decipher the text using multiple language models until we found a reasonable decipherment of the text using the Latin language model. 37 More recent decipherment methods use beam search and a neural language model to score candidate plaintext hypotheses for a given cipher (Kambhatla et al., 2018). However, this approach also assumes that the target plaintext language is known. Other work that both identifies language and deciphers relies on a brute-force guess-and-check strategy (Knight et al., 2006b; Hauer and Kondrak, 2016). We ask: Can we build an end-to-end model that deciphers directly without relying on a separate language ID step? In this chapter: • We propose an end-to-end multilingual decipherment model that can solve monoalphabetic substitution ciphers without explicit plaintext language identification, which we demonstrate on ciphers of 14 different languages. • We conduct extensive testing of the proposed method in different realistic decipherment con- ditions; different cipher lengths, no-space ciphers, and ciphers with noise, and demonstrate that our model is robust to these conditions. • We apply our model on synthetic ciphers as well as on the Borg cipher. We show that our multilingual model can crack the Borg cipher using the first 256 characters of the cipher. 4.2 TheDeciphermentProblem In this chapter, we focus on solving monoalphabetic substitution ciphers. We follow Nuhn et al. (2013) and Kambhatla et al. (2018) and use machine translation notation to formulate our problem. We denote the ciphertext asf N 1 =f 1 ...f j ...f N and the plaintext ase M 1 =e 1 ...e i ...e M . 1 1 Unless there is noise or space restoration,N =M; see Sections 4.5.4 and 4.5.2. 38 In amonoalphabeticsubstitutioncipher, plaintext is encrypted into a ciphertext by re- placing each plaintext character with a unique substitute according to a substitution table called the key. For example: the plaintext word “doors” would be enciphered to “KFFML” using the substitution table: Cipher Plain K d F o M r L s The decipherment goal is to recover the plaintext given the ciphertext. 4.3 DeciphermentModel Inspired by character-level neural machine translation (NMT), we view decipherment as a sequence- to-sequence translation task. The motivation behind using a sequence-to-sequence model is: • The model can be trained on multilingual data (Gao et al., 2020), making it potentially possible to obtain end-to-end multilingual decipherment without relying on a separate language ID step. • Due to transcription challenges of historical ciphers, ciphertext could be noisy. We would like the model to have the ability to recover from that noise by inserting, deleting, or substituting characters while generating plaintext. Sequence-to-sequence models seem to be good candidates for this task. 39 (a) Input: Example ciphers encoded in random keys. Output: Plaintext in target language. (b) Input: Example ciphers encoded according to frequency ranks in descending order. Output: Plaintext in target language. Figure 4.1: Decipherment as a sequence-to-sequence translation problem. (a) shows the original ciphers being fed to the model. (b) shows the same ciphers after frequency encoding. 4.3.1 DeciphermentasaSequence-to-SequenceTranslationProblem To cast decipherment as a supervised translation task, we need training data, i.e. pairs of <f N 1 , e M 1 > to train on. We can create this data using randomly generated substitution keys (Figure 4.1a). We can then train a character-based sequence-to-sequence decipherment model and evaluate it on held-out text which is also encrypted with (different) randomly generated substitution keys. However, if we attempt this experiment using the Transformer model described in Section 4.3.3, we get abysmal results (see Section 4.5.1 for scoring details). Increasing the amount of training data won’t help; there are 26! ≈ 4× 10 26 possible keys for English ciphers, and even if every key is represented, most of the training data will still be encoded with keys that are not used to encode the test data. In fact, since each training example uses a different key, we cannot assume that a character type has any particular meaning. The fundamental assumption behind embeddings is therefore broken. In the next section, we describe one way to overcome these challenges. 40 4.3.2 FrequencyAnalysis To address the aforementioned challenges, we employ a commonly used technique in cryptanalysis called frequency analysis. Frequency analysis is attributed to the great polymath, Al-Kindi (801-873 C.E.) (Dooley, 2013). This technique has been used in previous decipherment work (Hauer and Kondrak, 2016; Kambhatla et al., 2018). It is based on the fact that in a given text, letters and letter combinations (n-grams) appear in varying frequencies, and that the character frequency distribution is roughly preserved in any sample drawn from a given language. So, in different pairs of <f N 1 ,e M 1 >, we expect the frequency distribution of characters to be similar. To encode that information, we re-map each ciphertext character to a value based on its frequency rank (Figure 4.1b). This way, we convert any ciphertext to a “frequency-encoded” cipher. Intuitively, by frequency encoding, we are reducing the number of possible substitution keys (assuming frequency rank is roughly preserved across all ciphers from a given language). This is only an approximation, but it helps restore the assumption that there is a coherent connection between a symbol and its type embedding. For example, if the letterse andi are the most frequent characters in English, then in any monoalphabetic substitution cipher, they will be encoded as0 or1 instead of a randomly chosen character. 4.3.3 TheTransformer We follow the character-based NMT approach in Gao et al. (2020) and use the Transformer model (Vaswani et al., 2017) for our decipherment problem. The Transformer is an attention-based 41 encoder-decoder model that has been widely used in the NLP community to achieve state-of-the- art performance on many sequence modeling tasks. We use the standard Transformer architecture, which consists of six encoder layers and six decoder layers as described in Gao et al. (2020). 4.4 Data For training, we create monoalphabetic substitution ciphers for 14 languages using random keys. For English, we use English Gigaword (Parker et al., 2011). We scrape historical text from Project Gutenberg for 13 other languages, namely: Catalan, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Latin, Norwegian, Portuguese, Spanish, and Swedish. 2 Table 3.2 summarizes our datasets. Following previous literature (Nuhn et al., 2013; Kambhatla et al., 2018), we lowercase all characters and remove all non-alphabetic and non-space symbols. We make sure ciphers do not end in the middle of a word. We strip accents for languages other than English. 4.5 ExperimentalEvaluation To make our experiments comparable to previous work (Nuhn et al., 2013; Kambhatla et al., 2018), we create test ciphers from the English Wikipedia article about History. 3 We use this text to create ciphers of length 16, 32, 64, 128, and 256 characters. We generate 50 ciphers for each length. We follow the same pre-processing steps to create training data. 2 Our dataset is available athttps://github.com/NadaAldarrab/s2s-decipherment 3 https://en.wikipedia.org/wiki/History 42 We carry out four sets of experiments to study the effect of cipher length, space encipher- ment/removal, unknown plaintext language, and transcription noise. Finally, we test our models on a real historical cipher, whose plaintext language was not known until recently. As an evaluation metric, we follow previous literature (Kambhatla et al., 2018) and use Symbol Error Rate (SER). SER is the fraction of incorrect symbols in the deciphered text. For space restoration experiments (Section 4.5.2), we use Translation Edit Rate (TER) (Snover et al., 2006), but on the character level. We define character-level TER as: TER = # of edits # of reference characters (4.1) where possible edits include the insertion, deletion, and substitution of single characters. When the ciphertext and plaintext have equal lengths, SER is equal to TER. We use FAIRSEQ to train our models (Ott et al., 2019). We mostly use the same hyperparameters as Gao et al. (2020) for character NMT, except that we set the maximum batch size to 10K tokens and use half precision floating point computation for faster training. The model has about 44M parameters. Training on a Tesla V100 GPU takes about 110 minutes per epoch. We train for 20 epochs. Decoding takes about 400 character tokens/s. We use a beam size of 100. Unless otherwise stated, we use 2M example ciphers to train, 3K ciphers for tuning, and 50 ciphers for testing in all experiments. We report the average SER on the 50 test ciphers of each experiment. 4.5.1 CipherLength We first experiment with ciphers of length 256 using the approach described in Section 4.3.1 (i.e. we train a Transformer model on pairs of <f N 1 ,e M 1 > without frequency encoding). As expected, 43 CipherLength 16 32 64 128 256 Beam NLM (Kambhatla et al., 2018) 26.80 5.80 0.07 0.01 0.00 Beam (NLM + FreqMatch) (Kambhatla et al., 2018) 31.00 2.90 0.07 0.02 0.00 Transformer + Freq + separate models (this work) 20.62 1.44 0.41 0.02 0.00 Transformer + Freq + single model (this work) 19.38 2.44 1.22 0.02 0.00 Table 4.1: SER (%) for solving 1:1 substitution ciphers of various lengths using our decipherment method. the model is not able to crack the 50 test ciphers, resulting in an SER of 71.75%. For the rest of the experiments in this chapter, we use the frequency encoding method described in Section 4.3.2. Short ciphers are more challenging than longer ones. Following previous literature, we report results on different cipher lengths using our method. Table 4.1 shows decipherment results on ciphers of length 16, 32, 64, 128, and 256. For the 256 length ciphers, we use the aforementioned 2M train and 3K development splits. For ciphers shorter than 256 characters, we increase the number of examples such that the total number of characters remains nearly constant, at about 512M characters. We experiment with training five different models (one for each length) and training a single model on ciphers of mixed lengths. In the latter case, we also use approx. 512M characters, divided equally among different lengths. The results in Table 4.1 show that our model achieves comparable results to the state-of-the-art model of Kambhatla et al. (2018) on longer ciphers, including perfect decipherment for ciphers of length 256. The table also shows that our method is more accurate than Kambhatla et al. (2018) for shorter, more difficult ciphers of lengths 16 and 32. In addition, our method provides the ability to train on multilingual data, which we use to attack ciphers with an unknown plaintext language as described in Section 4.5.3. 44 4.5.2 No-SpaceCiphers The inclusion of white space between words makes decipherment easier because word boundaries can give a strong clue to the cryptanalyst. In many historical ciphers, however, spaces are hidden. For example, in the Copiale cipher (Figure 1.1a), spaces are enciphered with special symbols just like other alphabetic characters (Knight et al., 2011). In other ciphers, spaces might be omitted from the plain text before enciphering, as was done in the Zodiac-408 cipher (Nuhn et al., 2013). We test our method in four scenarios: 1. Ciphers with spaces (comparable to Kambhatla et al. (2018)). 2. Ciphers with enciphered spaces. In this case, we treat space like other cipher characters during frequency encoding as described in Section 4.3.2. 3. No-space ciphers. We omit spaces in both (source and target) sides. 4. No-space ciphers with space recovery. We omit spaces from source but keep them on the target side. The goal here is to train the model to restore spaces along with the decipherment. Table 4.2 shows results for each of the four scenarios on ciphers of length 256. During decoding, we force the model to generate tokens to match source length. Results show that the method is robust to both enciphered and omitted spaces. In scenario 4, where the model is expected to generate spaces and thus the output length differs from the input length, we limit the output to exactly 256 characters, but we allow the model freedom to insert spaces where it sees fit. The model generates spaces in accurate positions overall, leading to a TER of 1.88%. 45 CipherType TER(%) Ciphers with spaces 0.00 Ciphers with enciphered spaces 0.00 No-space ciphers 0.77 No-space ciphers + generate spaces 1.88 Table 4.2: TER (%) for solving monoalphabetic substitution ciphers of length 256 with different spacing conditions. #lang ca da nl en fi fr de hu it la no pt es sv avg 3 - - - 0.04 - 0.23 - - - - - - 0.39 - 0.29 7 - - - 0.08 - 0.34 0.30 - 1.23 1.38 - 0.48 0.40 - 0.60 14 0.34 1.29 0.79 0.25 0.20 0.20 0.41 0.64 1.52 1.43 0.41 0.69 0.72 0.70 0.68 Table 4.3: SER (%) for solving monoalphabetic substitution ciphers using a multilingual model trained on a different number of languages. Each language is evaluated on 50 test ciphers generated with random keys. 4.5.3 UnknownPlaintextLanguage While combing through libraries and archives, researchers have found many ciphers that are not accompanied with any cleartext or keys, leaving the plaintext language of the cipher unknown (Megyesi et al., 2020). To solve that problem, we train a single multilingual model on the 14 different languages described in Section 4.4. We train on a total of 2.1M random ciphers of length 256 (divided equally among all languages). We report results as the number of training languages increases while keeping the total number of 2.1M training examples fixed (Table 4.3). Increasing the number of languages negatively affects performance, as we expected. However, our experiments show that the 14-language model is still able to decipher 700 total test ciphers with an average SER of 0.68%. Since we are testing on 256-character ciphers, this translates to no more than two errors per cipher on average. 46 4.5.4 TranscriptionNoise Real historical ciphers can have a lot of noise. This noise can come from the natural degradation of historical documents, human mistakes during a manual transcription process, or misspelled words by the author, as in the Zodiac-408 cipher. Noise can also come from automatically transcribing historical ciphers using Optical Character Recognition (OCR) techniques (Yin et al., 2019). It is thus crucial to have a robust decipherment model that can still crack ciphers despite the noise. Hauer et al. (2014) test their proposed method on noisy ciphers created by randomly corrupting log 2 (N) of the ciphertext characters. However, automatic transcription of historical documents is very challenging and can introduce more types of noise, including the addition and deletion of some characters during character segmentation (Yin et al., 2019). We test our model on three types of random noise: insertion, deletion, and substitution. We experiment with different noise percentages for ciphers of length 256 (Table 4.4). We report the results of training (and testing) on ciphers with only substitution noise and ciphers that have all three types of noise (divided equally). We experimentally find that training the models with 10% noise gives the best overall accuracy, and we use those models to get the results in Table 4.4. Our method is able to decipher with up to 84% accuracy on ciphers with 20% of random insertion, deletion, and substitution noise. Figure 4.2 shows an example output for a cipher with 15% noise. The model recovers most of the errors, resulting in a TER of 5.86%. One of the most challenging noise scenarios, for example, is the deletion of the last two characters from the word “its.” The model output the word “i,” which is a valid English word. Of course, the more noise there is, the harder it is for the model to recover due to error accumulation. 47 NoiseType %Noise sub sub,ins,del 5 1.10 2.87 10 2.40 5.87 15 5.28 10.58 20 11.48 16.17 25 17.63 27.43 Table 4.4: TER (%) for solving monoalphabetic substitution ciphers with random insertion, deletion, and substitution noise. These models have been trained with 10% noise. 4.5.5 TheBorgCipher The Borg cipher is a 400-page book digitized by the Biblioteca Apostolica Vaticana (Figure 1.1b). 4 The first page of the book is written in Arabic script, while the rest of the book is enciphered using astrological symbols. The plaintext language of the book is Latin. The deciphered book reveals pharmacological knowledge and other information about that time. We train a Latin model on 1M ciphers and use the first 256 characters of the Borg cipher to test our model. Our model is able to decipher the text with an SER of 3.91% (Figure 4.3). We also try our 14-language multilingual model on this cipher, and obtain an SER of 5.47%. This is a readable decipherment and can be easily corrected by Latin scholars who would be interested in such a text. 4.6 AnagramDecryption To further test the capacity of our model, we experiment with a special type of noise. In this section, we address the challenging problem of solving substitution ciphers in which letters within each word have been randomly shuffled. Anagramming is a technique that can be used 4 http://digi.vatlib.it/view/MSS_Borg.lat.898. 48 Source 3 2 11 11 2 6 4 15 0 _ 16 0 1 6 _ d 20 12 9 i5 2 4 3 1 _ 2 3 _ d 15 0 3 6 _ 2 s22 _ 18 i16 0 9 9 _ 2 1 _ 6 13 0 _ 1 4 i7 19 3 4 5 4 10 2 3 i13 10 0 _ 7 5 _ 8 d 5 5 0 11 0 3 6 _ 10 2 14 1 0 i21 1 _ 2 3 8 _ 0 5 5 0 10 6 1 i0 _ 13 4 1 6 7 s5 4 2 3 s6 _ 2 9 1 7 i18 _ 8 0 16 2 6 0 _ 6 13 0 _ 3 2 6 14 d 0 _ s3 5 _ d 4 1 6 7 d 17 _ s5 3 8 _ 4 d d _ 14 1 0 5 s0 9 3 0 1 1 _ 16 17 _ 8 i5 4 1 10 14 1 1 4 s23 19 _ s2 13 0 _ 1 s11 14 s24 17 _ 7 5 _ 6 13 i21 0 _ 8 4 1 10 4 12 9 4 3 0 _ 2 1 _ 2 i7 3 _ 0 3 8 _ 4 3 _ s5 6 1 0 s14 s12 _ 2 3 8 _ 1 d _ 2 _ 18 d 17 _ 7 i20 5 i9 _ 12 11 7 15 4 8 4 s2 19 _ 12 0 11 i12 0 d d 10 d 4 15 0 Target n a r r a t i v e _ b e s t _ e x p l a i n s _ a n _ e v e n t _ a s _ w e l l _ a s _ t h e _ s i g n i f i c a n c e _ o f _ d i f f e r e n t _ c a u s e s _ a n d _ e f f e c t s _ h i s t o r i a n s _ a l s o _ d e b a t e _ t h e _ n a t u r e _ o f _ h i s t o r y _ a n d _ i t s _ u s e f u l n e s s _ b y _ d i s c u s s i n g _ t h e _ s t u d y _ o f _ t h e _ d i s c i p l i n e _ a s _ a n _ e n d _ i n _ i t s e l f _ a n d _ a s _ a _ w a y _ o f _ p r o v i d i n g _ p e r s p e c t i v e Output n a r r a t i v e _ b e s t _ e x p l a i n s _ a n _ e v e n t _ a s _ w e l l _ a s _ t h e _ s i g n i f i c a n c e _ o f _ d i f f e r e n t _ c a u s e s _ a n d _ e f f e c t i v e _ h i s t o r i a n s _ a l s o _ d e b a t e _ t h e _ n a t u r e _ o f _ v i s i t o r s _ a n d _ i _ u s e f u l n e s s _ b y _ d i s c u s s i n g _ t h e _ s t u d y _ o f _ t h e _ d i s c i p l i n e _ a s _ a n _ e n d _ i n _ i t s e l f _ a n d _ a s _ a _ w a y _ o f _ p r o v i d i n g _ p e r s p e c t i v Figure 4.2: Example system output for a cipher with 15% random noise (shown in red). Substitutions, insertions, and deletions are denoted by letters s, i, and d, respectively. The system recovered 34/40 errors (TER is 5.86%). Highlighted segments show the errors that the system failed to recover from. to further disguise substitution ciphers by permuting characters. Various theories about the mysterious Voynich Manuscript, for example, suggest that some anagramming scheme was used to encode the manuscript (Reddy and Knight, 2011). Hauer and Kondrak (2016) propose a two-step approach to solve this problem. First, they use their monoalphabetic substitution cipher solver (Hauer et al., 2014) to decipher the text. The solver is based on tree search for the key, guided by character-level and word-level n-gram language models. They adapt the solver by relaxing the letter order constraint in the key mutation component of the solver. They then re-arrange the resulting deciphered characters using a word trigram language model. We try a one-step, end-to-end anagram decryption model. In our sequence-to-sequence formu- lation, randomly shuffled characters can confuse the training. We thus represent an input cipher as a bag of frequency-mapped characters, nominally presented in frequency rank order (Figure 4.4). 49 Figure 4.3: The first 132 characters of the Borg cipher and its decipherment. Errors are underlined. Correct words are: pulegi, benedicti, crispe, ozimi, and feniculi. We use the English Gigaword dataset to train a 256 character model on the sorted frequencies and test on the aforementioned test set of 50 ciphers (after applying random anagramming). Following Hauer and Kondrak (2016), we report word accuracy on this task. Our model achieves a word accuracy of 95.82% on the 50 Wikipedia ciphers. Hauer and Kondrak (2016) report results on a test set of 10 long ciphers extracted from 10 Wikipedia articles about art, Earth, Europe, film, history, language, music, science, technology, and Wikipedia. Ciphers have an average length of 522 characters. They use English Europarl to train their language models (Koehn, 2005). To get comparable results, we trained a model on ciphers of length 525 created from the English side of the Spanish-English Europarl dataset. Our model achieved a word accuracy of 96.05% on Hauer and Kondrak’s test set. Training on English Gigaword gave a word accuracy of 97.16%, comparable to the 97.72% word accuracy reported by 50 (1)t h e _ i n v e n t i o n _ o f _ w r i t i n g _ s y s t e m s (2)j c z _ m r b z r j m k r _ k f _ w u m j m r e _ a o a j z g a (3)c j z _ k z m r b r j m r _ f k _ e w u j m m r _ z g o a j a a (4)6 0 3 _ 5 3 1 2 7 2 0 1 2 _ 8 5 _ 11 9 10 0 1 1 2 _ 3 13 12 4 0 4 4 (5)0 3 6 _ 0 1 1 2 2 2 3 5 7 _ 5 8 _ 0 1 1 2 9 10 11 _ 0 3 4 4 4 12 13 (6)t h e _ i n v e n t i o n _ o f _ b r i t a i n _ s y s t e m s Figure 4.4: Example anagram encryption and decryption process: (1) original plaintext (2) after applying a 1:1 substitution key (3) after anagramming (this is the ciphertext) (4) after frequency encoding (5) after sorting frequencies. This is fed to Transformer (6) system output (errors are highlighted). Hauer and Kondrak (2016). This shows that our simple model can crack randomly anagrammed ciphers, which hopefully inspires future work on other cipher types. 4.7 Conclusion In this chapter, we present an end-to-end decipherment model that is capable of solving monoalpha- betic substitution ciphers without the need for explicit language identification. We use frequency analysis to make it possible to train a multilingual Transformer model for decipherment. Our method is able to decipher 700 ciphers from 14 different languages with less than 1% SER. We apply our method on the Borg cipher and achieve 5.47% SER using the multilingual model and 3.91% SER using a monolingual Latin model. In addition, our experiments show that these models are robust to different types of noise, and can even recover from many of them. 51 Chapter5 CiphertextSegmentation As discussed in previous chapters, deciphering historical substitution ciphers is a challenging problem. Example problems that have been previously studied include detecting cipher type, detecting plaintext language, and acquiring the substitution key for segmented ciphers. However, attacking unsegmented ciphers is still a challenging task. Segmentation (i.e. finding substitution units) is the first step towards cracking those ciphers. In this chapter, we propose the first automatic methods to segment those ciphers using Byte Pair Encoding (BPE) and unigram language models. Our methods achieve an average segmentation error of 2% on 100 randomly-generated monoalphabetic ciphers and 27% on 3 real homophonic ciphers. We also propose a method for solving non-deterministic ciphers with existing keys using a lattice and a pretrained language model. Our method leads to the full solution of the IA cipher; a real historical cipher that has not been fully solved until this work. 52 5.1 Introduction Deciphering historical substitution ciphers has attracted attention from the natural language processing community. Example work includes (Ravi and Knight, 2008; Corlett and Penn, 2010; Nuhn et al., 2013; Nuhn and Knight, 2014; Hauer et al., 2014; Kambhatla et al., 2018; Aldarrab and May, 2021). However, these methods all assume that cipher elements are clearly segmented (i.e., that token boundaries are well established). Many historical documents, however, are enciphered as continuous sequences of digits that hide token boundaries (Lasry et al., 2020). An example cipher (the IA cipher) is shown in Figure 5.1 (Megyesi et al., 2020). Solving those ciphers is very challenging since it is not possible to directly search for the key without finding substitution units. We use the termnumericalciphers to refer to unsegmented substitution ciphers that use a numerical symbol set. 1 In this chapter: • We propose novel unsupervised methods to segment numerical cipherswithnoexisting keys using Byte Pair Encoding (BPE) (Gage, 1994) and unigram language models (Kudo, 2018). • We conduct extensive testing of our methods on different cipher types. We report results on synthetic and real historical ciphers and show how performance varies with cipher type and length. Our methods achieve an average segmentation error of 2% on 100 randomly- generated monoalphabetic ciphers and 27% on 3 real homophonic ciphers. 1 The proposed methods can of course be applied to any unsegmented substitution cipher, regardless of the chosen symbol set. 53 Figure 5.1: The IA cipher (16th century). 2 • We propose the first model to segment non-deterministic numerical ciphers withexisting keys using a segmentation lattice and a pretrained language model. Our method unveils the content of the IA cipher; a letter from the 16th century that has not been revealed until this work. 5.2 ProblemDefinition A substitution cipher is a cipher that is created by substituting each plaintext character with another character according to a substitution table called thekey. We define major terms in the following subsections. 5.2.1 Substitutiontypes In this chapter, we focus on two types of substitution ciphers: Monoalphabetic and homophonic ciphers. Monoalphabetic ciphers are created by replacing each plaintext character with a unique substitute using a 1→1 substitution key. Homophonic ciphers are created by replacing each plaintext character with one of multiple possible substitutes using a 1→M substitution key. 54 Figure 5.2: An example homophonic key from the Vatican Secret Archives (16th century). 3 The added highlight in the top section shows that, e.g. i can be substituted with54 or74. The bottom section contains nomenclature elements. For example, the key shown in Figure 5.2 contains a homophonic substitution table (the top part). As shown in the figure, each plaintext character (e.g. i) can be substituted with one of multiple characters (e.g.54 or74). It is common to encipher vowels with more than one character, which makes homophonic ciphers harder to crack. 5.2.2 Cipherelements Cipher elements are substitution units that correspond to plaintext elements according to a cipher key. There are three main types of cipher elements in historical ciphers: • Regularelements: These elements usually encode letters, common syllables, or preposi- tions. In the example key shown in Figure 5.2, the top part defines regular cipher elements. • Nomenclature elements: This refers to elements in a key that represent whole words (often proper names). In Figure 5.2, the second part defines nomenclature elements. 2 https://de-crypt.org/decrypt-web/RecordsView/189?showdetail= 55 • Nulls: These are cipher elements that donot correspond to any plaintext word or character. Nulls are usually used in ciphers to confuse cryptanalysts. Sometimes, nulls are used for a purpose. For example, they could be used to mark the beginning of nomenclature elements in numerical ciphers. 5.2.3 Fixedandvariable-lengthciphers Numerical ciphers can be classified as fixed or variable-length ciphers. In fixed-length ciphers, regular elements have the same length (i.e. the same number of digits). However, in variable-length ciphers, regular elements can be of different lengths. For example, the letter a might be enciphered as1,12, or121. 5.2.4 Ciphertextsegmentation Numerical ciphers impose a special challenge, in that they hide cipher element boundaries. For example, in the numerical cipher shown in Figure 5.1, it is unclear which digits represent substitu- tion units. Identifying substitution units, which we callsegmentation, is a challenging task that is necessary to solve these ciphers. Another challenge in solving numerical ciphers is that the segmentation could be non-deterministic. For example, a cipher can have these substitutions in its key: Cipher Plain 2 a 22 n 8 d which means that the ciphertext2228 can be segmented as: 56 CipherSegmentation Plain 2 | 2 | 2 | 8 a a a d 2 | 22 | 8 a n d 22 | 2 | 8 n a d Such ciphers are callednon-deterministicciphers. Deterministicciphers, on the other hand, only have one possible segmentation according to their keys. In this chapter, we focus on the problem of segmenting numerical ciphers. We look at two cases for numerical cipher segmentation depending on whether or not a key exists for the cipher in hand. The following sections describe our proposed methods for each case. 5.3 SegmentingCipherswithnoExistingKeys This is the first (and the more challenging) case; Segmenting a numerical cipher with no existing key. In this case, all we have is a sequence of digits (e.g. Figure 5.3). To solve the cipher, we first need to segment ciphertext before trying to find the substitution key. In this section, we describe our proposed methods for ciphertext segmentation. 5.3.1 Baselines We first try two baselines: 1-digit and 2-digit segmentation. We remove line breaks and consider the text as one long sequence of digits. In 1-digit segmentation, we split ciphertext into individual digits. In 2-digit segmentation, we split ciphertext into two-digit elements (except the last digit if the number of digits in the cipher is odd). The latter is a stronger baseline since we notice that most cipher elements in historical ciphers are two digits long. 3 https://de-crypt.org/decrypt-web/RecordsView/206?showdetail= 57 5.3.2 BytePairEncoding(BPE) Our first proposed method for cipher segmentation is Byte Pair Encoding (BPE). BPE is a simple compression algorithm that has been used for many natural language processing tasks (Gage, 1994; Sennrich et al., 2016). In BPE, the most frequent pair of bytes is iteratively replaced with a single, unused byte to represent the replaced pair. The motivation behind using BPE for our problem is that the digits that belong to the same cipher element have high mutual information, so we would like them to be grouped together. 5.3.3 Unigramlanguagemodel One downside of BPE is that it is a greedy algorithm that employs a deterministic symbol replace- ment strategy. BPE does not provide multiple possible segmentations with probabilities. As we notice from our experiments (Section 5.5.1), the resulting BPE segmentation leaves many singleton digits unpaired. To mitigate this problem, we use the subword segmentation algorithm proposed by Kudo (2018), which is based on a unigram language model. This algorithm provides candidate segmentations with probabilities. The unigram language model assumes that each subword occurs independently. Thus, the probability of a subword sequence is the product of the subword probabilities. The probabilities are iteratively estimated using the Expectation Maximization (EM) algorithm. The most probable subword segmentation is then found by the Viterbi algorithm. We evaluate the two baselines and our proposed methods on synthetic and real historical ciphers. The following sections describe our datasets, experiments, and results. 58 Cipher Length Types Tokens 1-dig Tokens 2-dig Tokens 1+2-dig Tokens S304 1,258 82 675 150 (22%) 496 (73%) 646 (96%) C13 1,879 97 917 0 (0%) 872 (95%) 872 (95%) F283 2,239 50 1,050 1 (0%) 979 (93%) 980 (93%) Table 5.1: Statistics of the ciphers obtained from the DECRYPT database. Length is the number of digits in the cipher. Types and tokens are those of cipher elements, not individual cipher digits. 5.4 Data To evaluate our methods on monoalphabetic numerical ciphers, we create synthetic ciphers from English Wikipedia. We notice that most historical ciphers in the DECRYPT collection are two pages long and contain about 2K characters, so we choose cipher length 2,048 for our experiments. We create 100 English ciphers using randomly generated keys. We use the numbers from 0 to 99 as possible cipher elements. This creates variable-length ciphers when single digits are chosen in the key. We report the average scores of the 100 ciphers for each experiment. For evaluation on real historical ciphers, we use 3 ciphers from the the Vatican Secret Archives, retrieved from the DECRYPT database (Megyesi et al., 2020; Lasry et al., 2020). Table 5.1 shows cipher statistics. We use ciphers C13, S304, and F283 (Figure 5.3). For these ciphers, human transcriptions and gold segmentations are available on the DECRYPT database. 5.5 ExperimentalEvaluation We carry out three types of experiments. First, we start with the simplest cipher type; monoal- phabetic ciphers with spaces. The existence of spaces indicates word boundaries, which gives some clues on how to segment the ciphertext. Second, we remove spaces and try to segment the 4 https://de-crypt.org/decrypt-web/RecordsList 59 a) The S304 cipher (18th century). b) The C13 cipher (17th century). c) The F283 cipher (16th century). Figure 5.3: Three historical ciphers from the Vatican Secret Archives. 4 60 same monoalphabetic ciphers, which is expected to be a harder task. Third, we experiment with segmenting homophonic ciphers, which is the most challenging case discussed in this chapter. We apply our proposed segmentation methods on both synthetic and real historical ciphers. We also study the effect of cipher length on segmentation quality. For evaluation, we use two metrics; F1 and Segmentation Error Rate (SegER). We use F1 to measure how good an algorithm is at finding cipher elements. For each algorithm, we compare the learned vocabulary with the gold vocabulary from the gold segmentation and report F1 scores. We use SegER to evaluate the resulting segmented ciphertext after applying the segmentation algorithm. We define SegER as: SegER = # of edits # of reference segments (5.1) where possible edits include the insertion, deletion, and substitution of single segments. We use the SentencePiece implementation of BPE and unigram language model (Kudo and Richardson, 2018). For homophonic ciphers, we set the vocabulary size to the maximum number found by the unigram language model. For monoalphabetic ciphers, we set vocabulary size to 36 (26 maximum possible 2-digit elements for 1-1 substitutions + 10 singleton digits). We use the default settings in SentencePiece, but we set character coverage to 100%. We learn subwords from raw unsegmented ciphertext, represented as continuous sequences of digits. We keep line-breaks as they appear on cipher scans since we notice that line-breaks usually do not cut through a cipher element in historical ciphers. 61 Model w/spaces w/ospaces 1-dig baseline 13.62 13.62 2-dig baseline 56.77 47.91 BPE 64.92 63.59 BPE 2 80.95 79.59 Unigram LM 58.41 56.64 Unigram LM 2 80.41 80.71 Table 5.2: Average F1 % (↑) for segmenting 100 synthetic ciphers using different models. In BPE 2 and Unigram LM 2, maximum piece length is set to 2. 5.5.1 Monoalphabeticcipherswithwordspaces We first experiment with monoalphabetic ciphers with spaces. We test our methods on the 100 synthetic ciphers described in Section 5.4. Table 5.2 (first column) shows F1 scores for all models. As expected, the 2-digit baseline is better than the 1-digit baseline, with an F1 score of 57% as opposed to less than 14% for the 1-digit baseline. BPE performs better than both 1-digit and 2-digit baselines, with an F1 score of about 65%. As noted in Section 5.3.1, most cipher elements in historical ciphers are one or two digits long. In our random sample of three historical ciphers, about 95% of cipher tokens are one and two-digit (as shown in Table 5.1). Longer elements appear less often (less than 5% of tokens in our test ciphers). Thus, we limit BPE piece length to a maximum of 2 digits. This improves the F1 score to about 81%, with an improvement of about 25% over default BPE. We call this model “BPE 2” in Table 5.2. We then apply the unigram language model of Kudo (2018). We notice that the default unigram language model is subpar to default BPE. However, adding the 2-digit heuristic, we get an F1 score of 80% using the unigram language model (Called “Unigram LM 2” in Table 5.2), which is comparable to BPE 2. 62 Model w/spaces w/ospaces 1-dig baseline 181.05 181.05 2-dig baseline 23.20 49.89 BPE 34.32 36.74 BPE 2 10.95 13.72 Unigram LM 40.28 41.61 Unigram LM 2 2.45 2.70 Table 5.3: Average SegER % (↓) for segmenting 100 synthetic ciphers using different models. We then apply the learned vocabularies to segment the ciphertexts. Table 5.3 (first column) shows SegER scores for all models. Default BPE does not perform better than the 2-digit baseline in terms of the resulting segmentation quality. However, BPE 2 gives a 53% improvement over the 2-digit baseline, with a SegER of 10.95%. The unigram language model with the 2-digit heuristic gives the best result of all models with a SegER of 2.45%. To better explain the motivation behind using the unigram language model, we show example BPE 2 errors in Figure 5.4. These two examples come from the same cipher. For this cipher, BPE learned the right vocabulary elements of 17,71, and77. However, since77 is the most frequent of all, BPE always prefers to merge the two 7s first. This early merge results in two unmerged single digits (1 and 7 in the first example and two 1s in the second example). This way, BPE 2 misses the correct merges of 17 and71. The unigram language model, on the other hand, looks at the overall score of segmentation candidates and chooses the most probable one according to unigram frequencies. In both examples, Unigram LM 2 does a better job at segmenting ciphertext. To study the effect of cipher length on segmentation quality, we create a monoalphabetic substitution cipher with variable-length cipher elements from English text. The cipher’s length is 16,384 characters. We start by testing our model on the first 128 characters of the text, then we increase the length by a power of 2 until we reach 16,384. Figure 5.5 shows segmentation 63 Gold 86 1 17 77 65 39 BPE 2 86 1 1 77 7 65 39 Unigram LM 2 86 1 17 77 65 39 Gold 65 17 77 71 BPE 2 65 1 77 77 1 Unigram LM 2 65 17 77 71 Figure 5.4: Example BPE segmentation errors (incorrect merges underlined). In both examples, BPE chooses the wrong merges as it goes greedily from left to right. Unigram LM, on the other hand, looks at candidate segmentations and chooses the highest scoring candidate based on segmentation probabilities. results for different cipher lengths. As expected, segmentation quality improves as cipher length increases. We notice the largest improvement going from 128 to 256 characters. Segmentation quality keeps improving until it almost plateaus after 2,048 characters. 5.5.2 Monoalphabeticcipherswithoutwordspaces We test our methods on the same set of 100 synthetic ciphers after removing spaces. To resemble real historical ciphers, we break the ciphertext into 43-character lines. The number of characters per line varies from one cipher to another, but as an approximation, we choose the average number of characters per line in a random sample of real ciphers. As shown in Tables 5.2 and 5.3 (second column), F1 and SegER scores for no-space monoalpha- betic ciphers are generally slightly worse than ciphers with spaces. Our best performing model (Unigram LM 2) achieves a SegER of 2.7% on no-space monoalphabetic ciphers, which is very close to the 2.45% on the same ciphers with spaces. 64 Figure 5.5: F1 % (↑) and SegER % (↓) for segmentation of different cipher lengths. 5.5.3 Homophonicciphers We test our segmentation methods on three real homophonic ciphers: S304, C13, and F283 (Table 5.1). Note that S304 is the shortest and F283 is the longest of these ciphers (F283 is almost twice as long as S304). As with previous experiments, we first evaluate how good the learned vocabularies are. Table 5.4 shows F1 scores for different models. We notice that the 2-digit baseline is a strong baseline since most cipher elements are 2-digit in these historical ciphers. Our BPE and unigram models with the 2-digit heuristic give comparable results on ciphers S304 and C13, with an F1 score of 60-65%. BPE 2 and Unigram LM 2 give the highest F1 score on cipher F283. Overall, Unigram LM 2 is the best performing model, with an average F1 score of 60% over all three ciphers. 65 Model CipherName S304 C13 F283 1-dig baseline 6.52 0.00 3.28 2-dig baseline 63.95 64.95 41.94 BPE 34.94 51.28 41.24 BPE 2 61.45 64.62 53.61 Unigram LM 18.07 30.77 41.24 Unigram LM 2 60.24 64.62 53.61 Table 5.4: F1 % (↑) for segmenting three real homophonic ciphers using different models. Model CipherName S304 C13 F283 1-dig baseline 164.15 204.91 213.14 2-dig baseline 60.00 41.11 64.19 BPE 78.07 51.80 50.10 BPE 2 63.11 46.02 38.29 Unigram LM 84.59 72.85 38.19 Unigram LM 2 46.67 20.83 14.95 Table 5.5: SegER % (↓) for segmenting three real homophonic ciphers using different models. We then evaluate the resulting segmentation for the three real ciphers. Table 5.5 shows SegER scores for our models. The 2-digit baseline is much better than the 1-digit baseline on these historical ciphers, with an average improvement of more than 70%. As we have seen in our synthetic, monoalphabetic cipher experiments, restricting piece length to a maximum of 2 improves performance for BPE and Unigram LM. With the 2-digit heuristic, SegER improves by an average of 18% and 58% for BPE and Unigram LM, respectively. While we could not find previously published work on this problem, we can see that our best method (Unigram LM 2) achieves an average SegER of 27% on the three real homophonic ciphers, with the best score of 14% on the longest, 1,050-token F283 cipher. 66 5.6 SegmentingNon-DeterministicCipherswithanExisting Key We now consider the second case: Suppose we have a cipher and a key, but the cipher is non- deterministic. This case can arise in practice when the key of the cipher is found while combing through historical archives, for example. Alternatively, the key could have been found by a cryptanalyst by solving a part of the cipher. Although the cipher key exists in these scenarios, the non-deterministic segmentation makes it impossible to directly apply the key to recover the plaintext (Recall the ambiguous segmentation example of the wordand from Section 5.2.4). In this case, it is very challenging to manually recover the whole plaintext, especially when the cipher is very long. 5.6.1 Latticesegmentation We take as an example the IA cipher (Figure 5.1), which we retrieved from the DECRYPT database (Megyesi et al., 2020). The first few lines of this 16th-century cipher were deciphered in 2019. However, since the cipher is non-deterministic, the remaining ciphertext (more than 200 lines) has not yet been deciphered. This is a real use case for our proposed method; a real historical cipher with an existing key but with a non-deterministic segmentation. For example, consider this part of the IA cipher key: Cipher Plain Cipher Plain 0 e 22 p 2 o 24 r 4 a 25 t 5 s 67 2 5 4 2 2 0 2 4 25 4 2 2 0 2 4 2 5 4 22 0 2 4 2 5 4 2 2 0 24 25 4 22 0 2 4 25 4 2 2 0 24 2 5 4 22 0 24 25 4 22 0 24 Figure 5.6: Example segmentation ambiguity for the IA cipher. A short 8-digit segment produces 8 possible segmentations according to the key. The number of candidate segmentations increases exponentially with respect to cipher length. Figure 5.6 shows a short 8-digit part of the IA cipher. As shown in the Figure, this part can be segmented in 8 possible ways according to the key. The number of candidate segmentations increases exponentially with respect to cipher length. To solve this problem, we create a lattice to model all possible segmenations of the cipher using the existing key. Then we use a pretrained language model to choose the best possible segmentation (i.e. the segmentation that gives the most probable plaintext according to the language model). For the segmentation lattice, we create a Finite-State Transducer (FST) that models the possible merges of cipher symbols. Figure 5.7 shows part of the FST. The shown transitions model the ambiguity of segmenting the digits2 and4. According to the key, these two digits can be merged to become24 (plaintextr) or stay unmerged (plaintext letterso anda, respectively). We create another FST to model the key (shown in Figure 5.8). We train a 5-gram character Italian language model on the historical data described in Sec- tion 4.4. Composing the language model, key FST, and segmentation FST creates a lattice of all 68 Figure 5.7: Part of the segmentation FST for the IA cipher. This part models the possiblity of merging the digits 2 and 4 to become 24 (corresponds to letter r in the key) vs. keeping them unmerged (letterso anda in the key). *e* is used to indicate the empty string. Figure 5.8: Part of the key FST for the IA cipher. This part models the substitutions: 2→o, 4→a, and 24→r. possible decipherments of the text. We use the Carmel finite-state toolkit to find the most probable plaintext according to the language model (Graehl, 2010). We use character-level Translation Edit Rate (TER) as our evaluation metric. TER is the character-level Levenshtein distance between system output and gold solution, divided by the number of characters in the gold solution. Verifying the resulting plaintext by a native Italian speaker, our method achieves a TER of 1.12%, which means that our model’s output is almost 99% correct. 5 5.6.2 TheIAcipher The resulting plaintext revealed a letter that the bishop of Senigallia sent to the Pope from Lisbon in 1536. 5 Prior to this experiment, we do not believe this plaintext had been known since 1536. 69 Cipher Plain 0 or. e 19 (nomenclature element) 26 x ∴ used afterdu to meanducati Table 5.6: Corrections/additions to the IA cipher key discovered by our approach. The cipher is 11,026 characters. The key from DECRYPT included 21 cipher elements. However, decoding the rest of the letter revealed 4 more cipher elements (shown in Table 5.6). We found out that the lettere can be enciphered as0 or. in this cipher. Cipher element19 seems to encode a nomenclature code. Cipher element26 encodes the letterx. We also found that the symbol∴ is used after the lettersdu to meanducati, the currency used during that time. We found out that there are human transcription errors in the transcription from DECRYPT. In total, we corrected 30 transcription errors in this cipher. There also seem to be some errors in the original manuscript. Such errors can result from spelling mistakes or substitution mistakes during encipherment, for example. For those errors, we do not change the original ciphertext and consider the text as is. The full solution of the IA cipher is shown in Appendix C. 5.7 Conclusion In this chapter, we present automatic methods for segmenting numerical substitution ciphers. We propose a method for solving non-deterministic substitution ciphers with existing keys using a lattice and a pretrained language model. Our method achieves a TER of 1.12% on the IA cipher, a real historical cipher that has not been fully solved until this work. 70 We also propose a novel approach to segment numerical ciphers with no existing keys using subword segmentation algorithms. We use BPE and unigram language models as unsupervised methods to learn substitution units. We add a 2-digit heuristic based on historical cipher analysis. Our best method is able to segment 100 randomly generated monoalphabetic ciphers with an average SegER of less than 3%, while still being robust to removing spaces. We test our methods on 3 real homophonic ciphers from the 16th-18th centuries. Our best method achieves an average SegER of 27%, with a SegER of 14% on the F283 cipher. To the best of our knowledge, this is the first work on automatically segmenting numerical substitution ciphers. 71 Chapter6 DecipheringfromImages So far, we have been focusing on deciphering manually transcribed historical manuscripts. In this chapter, we discuss decipherment from images. We describe our models, experiments, and results on different ciphers. 6.1 OCRchallenges As we have discussed in chapter 3, the transcription process is very challenging for humans, and it is even more challenging for computers. One challenge we face with ciphers is that they are usually written in an unknown alphabet or in symbols, as opposed to known alphabet like documents written in English or Arabic, for example. It is thus unclear what the end result of a transcription should be. Secondly, we are targeting handwritten documents, not typed historical documents. There are many more irregularities and variance in handwriting styles in handwritten documents than in typed ones. In addition, characters are usually not perfectly aligned as they are in typed text. Characters also have variable sizes throughout handwritten documents, which poses a big challenge for automatic character segmentation. Figure 6.1 shows two pages from the 72 Borg cipher. As the images show, many characters touch and have variable sizes and shapes. Also, there are signs of document degradation and ink blotches, which make image processing even harder. Figure 6.1: Two pages from the Borg cipher. Images show many challenges for OCR, like different handwriting styles, scratches, variable character sizes, and background noise. 6.2 OCRmodel We experiment with an unsupervised model for deciphering from images. To decipher a cipher image, we take the following steps: 73 1. Character segmentation: clip the original image into smaller images of single characters. 2. Character clustering: cluster the segmented character images based on shape similarity. Then, substitute each character image with its cluster ID. 3. Decipherment: Decipher the sequence of cluster IDs using the decipherment methods described in chapter 3. The next two sections describe steps 1 and 2. We will use the first pages of the Borg cipher to illustrate the results of each step. 6.3 Charactersegmentation The goal here is to clip the original image into smaller images of single characters. To find individual characters, we create a generative story for how the image was generated. We represent the image by the number of black pixels in each row/column. We create two stories; one for generating rows and the other for generating characters in each row. For row separation, our goal is to find separator rows that cut through the smallest number of black pixels. We represent an image by a sequence of integers, one for each row, representing how many black pixels are on that row. Now, we are trying to explain that sequence. So, we create this generative story: Parameters: mean, stdev, stdev2, p. 1. Pick the number of rowsn according to a normal distributionN(mean,stdev). 2. fori = 1 ton: 74 (a) Pick a heighth i for rowi according to another normal distributionH(mean2,stdev2). Note that: mean2 = totalnumberof pixelrows numberof rows (b) forj = 1 toh i : Output an integer according to a uniform distribution. (c) Output an integer according to a geometric distributionG(p). The integer output in2(c) is the separator row. No matter howp is set, step2(c) will prefer to output a small number than a large number (i.e. minimize the number of black pixels in each separator row). We create a similar story for segmenting characters in each row. For character separation, our goal is to find separator columns that cut through the smallest number of black pixels. We represent a row image by a sequence of integers, one for each column, representing how many black pixels are on that column. Now, we are trying to explain that sequence. So, we create this generative story: Parameters: mean, stdev, stdev2, p. 1. Pick the number of charactersn according to a normal distributionN(mean,stdev). 2. fori = 1 ton: (a) Pick a width w i for character i according to another normal distribution H(mean2,stdev2). Note that: mean2 = totalnumberof columns numberof characters 75 (b) forj = 1 tow i : Output an integer according to a uniform distribution. (c) Output an integer according to a geometric distributionG(p). The integer output in2(c) is the separator column. No matter howp is set, step2(c) will prefer to output a small number than a large number (i.e. minimize the number of black pixels in each separator column). We manually set the values for these parameters. We implement our generative story as a composition of a finite-state-acceptor (FSA) and a finite-state-transducer (FST). Figures 6.2 and 6.3 show our FSA and FST for the character generation story. We use the finite-state toolkit, Carmel, to determine the Viterbi state sequence of maximum probability (Graehl, 2010). From that Viterbi state sequence, we can see which rows are separator rows. Figure 6.4 shows character segmentation results for the first page of Borg. ε:c / 1 ε:c / 1 ε:c / 1 ε:ε / P(n) P(n) = number of characters in row (normal distr.) . . . ε:ε / P(1) ε:ε / P(2) Figure 6.2: An FSA for generating characters in a row. We generate n characters with probability P(n) (normal distr.). 76 c:ε / P(w) ε:b / P(b) ε:s / P(s) P(w) = number of columns for character (normal distr.) P(b) = number of black pixels in column of character (uniform distr.) P(s) = number of black pixels crossed by separator (geometric distr.) . . . c:ε / P(1) c:ε / P(2) Figure 6.3: An FST for generating the number of black pixels in each column. For each character, we generate character columns (normal distr.), followed by a separator column (geometric distr.). 6.4 Characterclustering To cluster the segmented character images, we take the following steps: 1. Compute pairwise similarity among character images: We compute the pairwise similarity matrix using the signal tool, correlate2d from Scipy (Jones et al., 2001). 2. Run K-means clustering on the similarity matrix: We use the KMeans implementation fromScikit (Pedregosa et al., 2011). The package provides thefit-predict() method, which computes cluster centers and predicts the closest cluster index for each sample. Table 6.1 shows seven randomly selected clusters we get from clustering the first three pages of Borg (with K=26). Note that we get some clean clusters (like clusters 14 and 19), but we also got noisy clusters (like clusters 2 and 7). Many of these clustering errors are due to clipping errors 77 Figure 6.4: Character segmentation results for the first page of Borg. 78 from the challenging character segmentation step. In addition, some characters are very rare, and thus, get assigned to one of the larger character clusters. Other factors like size and inking level also seem to affect cluster assignments for characters with the same shape. 79 ClusterID CharacterImages 2 7 14 17 18 19 20 Table 6.1: Seven randomly selected clusters we get from clustering the first three pages of Borg (with K=26). 80 6.5 DeciphermentResults After getting the cluster ID sequence, we can proceed to decipherment. We experiment with two ciphers: cipher#3 from our synthetic cipher collection (a 300dpi scanned image of the cipher printed in Courier font) and the first three pages of the Borg cipher. Table 6.2 shows decipherment results from automatic transcription, compared to decipherment results from manual transcription (as we have previously shown in chapters 3 and 3). In this table, decipherment error is the edit distance between the gold string and the system output. To further investigate the effect of the character segmentation step, we also experiment with manually segmenting characters of the first three pages of Borg. We run automatic clustering and decipherment on the manually segmented character images. The result is shown in the last row in table 6.2. The results indicate that automatic segmentation and clustering are very successful if the input is clean, but they are very challenging when the input is a noisy handwritten document. Character segmentation seems to be the major bottleneck of the system since we were able to decipher Borg with 80% accuracy from perfectly segmented character images. Note that deciphering from images hides word divisions and might also turn 1:1 substitution ciphers into homophonic ones as a result of over-clustering. This leads us to the discussion of how we can evaluate OCR output, which we discuss in the next section. 81 Decipherment Error from Au- tomatic Transcription Decipherment Error from Man- ual Transcription Cipher#3 (typed) 13 (1.99%) 10 (1.53%) Borg (3 pages) 863 (72.83%) 36 (4.14%) Borg (manual char seg.) 227 (20.43%) 36 (4.14%) Table 6.2: Summary of decipherment results from automatic vs. manual transcription. Decipher- ment error is the edit distance between the gold string and the system output. The last row shows the result of deciphering the first three pages of Borg from manually segmented character images. 6.6 Transcriptionerror So far, we have been using decipherment accuracy to evaluate our system. However, this measure only evaluates end-to-end decipherment results and does not give feedback on the performance of each part of the system (i.e. how much of the error we get is from OCR and how much is from decipherment). Thus, we need a method to evaluate automatic transcription accuracy. Our OCR system reads an image and outputs a sequence of cluster IDs, each representing one character. We want to evaluate our system output (automatic transcription) against the gold transcription (manual transcription). For example: Gold: t i m i (manual transcription) System output: 4 2 2 (cluster ID sequence) Since our system output does not use the same alphabet as the gold transcription, we cannot directly compute string edit distance. We first need to find a substitution scheme between our system output and the gold alphabet. We are looking for a special substitution scheme; the substitution thatminimizes the edit distance between the two strings. In our previous example, we can substitute 4 with t and 2 with i. Making these substitutions gives an edit distance of 1. We 82 call this substitution scheme an optimal assignment because it is the assignment that gives the minimum edit distance between the two strings (which is 1). Inspired by the approach described by Spiegler and Monson (2010), we use ILP to compute transcription accuracy. Spiegler and Monson (2010) find the optimal assignment by global counting over<system output, gold> pairs (each representing the morphological analysis of a single word from a test set). In our case, we only have one pair of strings instead of a set of<system output, gold> pairs. So, we formulate our integer program as follows: Given 2 strings: Gold string: g j :g 1 g 2 ... g m (manual transcription) Cluster ID sequence: c i :c 1 c 2 ... c n (system output) Find the edit distance between the two strings under the optimal assignment. 1 Variables(binary): ins i,j insert characterg j after characterc i del i,j delete characterc i match i,j match charactersc i andg j link c,g link charactersc andg maximize: X i X j match i,j 1 We will refer toc i andg j as string characters for simplicity, with the understanding that they could be two-digit cluster IDs, for example (actually, they could be drawn from any set of symbols that we come up with). 83 subjectto: ∀c : X g link c,g ≤ 1 (6.1) ∀g : X c link c,g ≤ 1 (6.2) ∀match i,j :match i,j ≤ link c i ,g i (6.3) ∀(c i ,g j ) :ins i,j +del i,j +match i,j =ins i,j+1 +del i+1,j +match i+1,j+1 (6.4) match 0,0 = 1 (6.5) Constraint 6.1 ensures that every cluster ID type is assigned to a maximum of one gold character type. Constraint 6.2 ensures that every gold character type is linked to a maximum of one cluster ID type. Constraint 6.3 ensures that we can only match two characters if there is a link between them (i.e. they can be substituted for each other under the optimal assignment). If we think of our search space as a grid of possible string edits (Figure 6.5), then we can think of computing edit distance as a network flow problem. We start with an initial flow value of 1 (enforced by constraint 6.5). Then we maintain a single sequence of string edits with constraint 6.4 (conservation of flow, i.e. the total flow entering a node (c i ,g j ) must equal the total flow leaving (c i ,g j ) for all nodes(c i ,g j )). Figure 6.5 shows how edit distance could be computed for our example strings using this integer program. Dotted lines represent the search space for our integer program. The solid path shows the sequence of string edits that we need to perform under the optimal assignments. With this ILP formulation, we enforce the optimal assignment to be a 1:1 mapping between system output and gold characters. However, we can relax the constraints to allow for a M:1 84 mapping between system output and gold characters. This is equivalent to turning a simple 1:1 substitution cipher to a homophonic cipher. Since our decipherment method targets both kinds of ciphers, we decide to allow M:1 mappings by removing the constraint given by 6.2. We use the Gurobi Optimizer to solve this integer program (Gurobi Optimization, Inc., 2016). 2 c 1 c 2 c 3 g 1 g 2 g 3 g 4 del 1,0 ins 0,1 match 1,1 4 2 2 t i m i match 2,2 ins 2,3 match 3,4 (Example system output) (Example gold string) Optimal assignment: link 4,t = 1 link 2,i = 1 Figure 6.5: An example of how edit distance could be computed using our integer program. Dotted lines represent the search space for our integer program. The solid path shows the sequence of string edits that we need to perform under the optimal assignment. One challenge we face with ILP is that it is very slow. Comparing two 80-character strings takes more than 24 hours on a 2.2 GHz Intel Core i7 with 16 GB RAM. Computing longer strings might not be possible with this method. To handle this, we make a slight modification to our integer program. Instead of comparing the two strings all at once, we take the output of the OCR 2 Our code is available at: https://github.com/NadaAldarrab/EDist-ILP 85 system and compare it to the gold transcription line-by-line. This reduces the number of variables and the search space (Figure 6.6). We still need to have a large number of variables to ensure consistent matching throughout the whole cipher, but that is a much smaller number of variables than what we need to compare the whole strings all at once. c 1 c 2 c 3 g 1 g 2 g 3 g 4 del 1,0 ins 0,1 match 1,1 4 2 2 t i m i match 2,2 ins 2,3 match 3,4 (Example system output) (Example gold string) Optimal assignment: link 4,t = 1 link 2,i = 1 link 3,p = 1 link 9,l = 1 match 4,5 c 4 c 5 c 6 3 2 9 g 5 g 6 g 7 p u l match 6,7 ins 4,6 del 5,6 System output: Gold: 4 2 2 t i m i 3 2 9 p u l line#2 line#1 line#2 line#1 Figure 6.6: An example of computing edit distance line-by-line. Dotted lines represent the search space for our integer program. The solid path shows the sequence of string edits that we need to perform under the optimal assignment. 86 Table 6.3 shows decipherment results, with transcription error computation. Table 6.4 details the properties of the ciphers that we get from OCR, compared to the gold ciphers. Our integer program gives a quantitative measure if OCR has introduced some homophonicity to the cipher and/or did not represent some cipher characters (according to the optimal M:1 mapping). Transcription Er- ror Decipherment Error from Automatic Transcription Decipherment Error from Manual Transcription Cipher#3 (typed) 37 (5.67%) 13 (1.99%) 10 (1.53%) Borg (3 pages) 57 (56.44%) 863 (72.83%) 36 (4.14%) Borg (manual seg.) 347 (28.61%) 227 (20.43%) 36 (4.14%) Table 6.3: Summary of decipherment results from automatic vs. manual transcription. Tran- scription error is the edit distance of the optimal assignment between system output and gold transcription. Decipherment error is the edit distance between the gold string and the system output. The last row shows the result of deciphering the first three pages of Borg from manually segmented character images. Transcription Error Introduced Homophonic Characters Unrepresented Cipher Characters Cipher#3 (typed) 37 (5.67%) 4 1 Borg (3 pages) 57 (56.44%) 6 6 Borg (manual seg.) 347 (28.61%) 4 5 Table 6.4: Properties of the ciphers that we get from OCR, compared to the gold ciphers. Our integer program gives a quantitative measure if OCR has introduced some homophonicity to the cipher and/or did not represent some cipher characters. 6.7 Conclusion In this chapter, we present an unsupervised end-to-end decipherment pipeline aimed at deciphering from cipher images. We experiment with deciphering printed text images and handwritten historical text images. We present a generic tool for measuring the edit distance between strings 87 of different vocabularies using integer linear programming, and we use this tool to evaluate transcription accuracy. Our experiments show that decipherment is possible for clearly-segmented cipher characters, but is still challenging for ciphers with character overlaps. This opens up many directions for future work, which we discuss in the next chapter. 88 Chapter7 ConclusionsandFutureDirections Decipering historical manuscripts is an intriguing challenge. Every cipher is a unique story, with a unique combination of language, system, and key. Building general-purpose automatic solvers remains a big goal we strive to achieve. It is our hope that this work has contributed towards this goal by addressing some decipherment problems and applying those methods on real historical ciphers. 7.1 Conclusions This thesis has been mainly focused on developing unsupervised methods to extract patterns from a small piece of data to perform a real-world task. Taking historical decipherment as an example, we have presented decipherment methods and experiments on monoalphabetic, homophonic and numerical substitution ciphers. We have worked on a real historical cipher collection. Applying our decipherment methods resulted in automatically cracking two historical ciphers; the Borg cipher and the IA Cipher. Despite human transcription errors and missing characters from the 89 original ciphers, our automatic decipherment was still robust and could yield almost perfect plaintexts. We have presented a multilingual sequence-to-sequence decipherment model for solving monoalphabetic ciphers. Our method is able to decipher 700 synthetic ciphers from 14 different languages with less than 1% character errors and less than 6% character errors on the Borg cipher. In addition, our experiments have shown that our models are robust to different types of noise, and can even recover from many of them. We have proposed novel unsupervised methods for automatic ciphertext segmentation for ciphers with no existing keys. Our methods are able to segment 100 randomly generated monoalphabetic ciphers with an average segmentation error of less than 3%, while still being robust to removing spaces. Testing our methods on 3 real homophonic ciphers, we have achieved an average segmentation error of 27%, with a segmentation error of 14% on the F283 cipher. We have presented an unsupervised end-to-end decipherment pipeline aimed at deciphering from cipher images. We have experimented with deciphering printed text images and handwrit- ten historical text images. We have also presented an integer linear programming method for evaluating transcription accuracy. Our experiments have shown that decipherment is possible for clearly-segmented cipher characters, but is still challenging for ciphers with character overlaps. This opens up many directions for future work, which we discuss in the next section. 90 7.2 Futuredirections A major part of the results of this work is the new directions it opened up for future research. In this section, we discuss some of the next steps we think are worth investigating, including system improvements and applications in other domains. 7.2.1 Systemimprovements In this work, we have seen that casting decipherment as a sequence-to-sequence translation task has resulted in highly accurate decipherments, overcoming the plaintext language identification problem, and building robust models to decipher noisy ciphers. We hope that this work drives more research in the application of contextual neural models to solve other cipher types, e.g. homophonic and polyalphabetic ciphers. Ciphertext segmentation is another promising direction. We have proposed novel unsupervised methods for segmenting ciphers using only a short piece of ciphertext (fewer than 2K characters). Future work can target developing special algorithms for homophonic and non-deterministic ciphers, which are very challenging, especially when the ciphertext is very short. Deciphering from images is still an open problem. There are many ways to further improve our end-to-end system. One is building better models for character segmentation. Since our experiments have shown that character segmentation is a major bottleneck for the system, it should be a great target for improvements. Another possible direction is trying different algorithms for unsupervised character clustering. Of course, our goal is to minimize automatic transcription error, and ultimately, get more accurate decipherments. 91 It would also be interesting to try lattice decipherment instead of string deciphement. The idea here is that the OCR system gives the deciphement system a lattice of possible cluster IDs and lets decipherment chooses the best sequence, guided by a language model. This might help with edge cases where the OCR part does not have enough information to decide on cluster assignments. In this work, we have seen that historical ciphers pose a great challenge for OCR. A next step could be a semi-supervised approach for historical OCR. Since we have released annotated data for historical ciphers, this could be helpful for few-shot OCR approaches. 7.2.2 Beyondhumanlanguage Another interesting direction for future research is expanding the scope of this work to include deciphering non-human languages. Birds, cats, whales and many living creatures around us have their own ways of communication. Yet, we only know very little about them. A major challenge in decoding non-human languages is the difficulty of obtaining data as for human languages. In addition, it is hard to annotate data since we do not know what those communications mean to begin with. This work presents unsupervised methods for decoding enciphered communications, which can be adapted for similar tasks. For example, dealing with non-human languages, we first need to identify what constitutes meaningful units in those languages, which is a similar problem to segmenting ciphertext into meaningful units. Extending the reach of scientific research to that end can have potential benefits for biologists, doctors, and humanity at large. 92 7.2.3 Adaptationtootherdomains As we have discussed in Chapter 1, decipherment methods have been used to build more efficient machine translation systems, requiring less time and less training data. Applying decipherment methods to build more efficient systems for other tasks is an open direction for research. The general theme that this work falls into is unsupervised learning of patterns from sequence data. We have taken historical decipherment as an example to build technology that can learn to perform useful tasks from sequences of symbols/digits. One intriguing direction to explore is decoding genomic sequences, which can open many doors for scientific discovery. Humans will probably continue to discover more undeciphered communications. Or we might be visited by extraterrestrials and need to decipher their language. Who knows? In the end, we still have a lot of data and communications in hand already, but we cannot interpret them. We need to understand existing patterns to make sense of those communications. It is our dream that one day, we will be able to provide humanity with advanced technology that allows them to automatically decode historical manuscripts, non-human languages, genomic sequences, and more. 93 Bibliography Nada Aldarrab, Kevin Knight, and Beáta Megyesi. 2017. The Borg.lat.898 cipher. Nada Aldarrab and Jonathan May. 2021. Can sequence-to-sequence models crack substitution ciphers? InProceedingsofthe59thAnnualMeetingoftheAssociationforComputationalLinguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7226–7235, Online. Association for Computational Linguistics. Nada Aldarrab and Jonathan May. 2022. Segmenting numerical substitution ciphers. Arnau Baró, Jialuo Chen, Alicia Fornés, and Beáta Megyesi. 2019. Towards a generic unsuper- vised method for transcription of encoded manuscripts. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, pages 73–78. Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. 2013. Unsupervised transcription of historical documents. In ACL (1), pages 207–217. The Association for Computer Linguistics. Taylor Berg-Kirkpatrick and Dan Klein. 2013. Decipherment with a million random restarts. In Proceedings of the2013 Conference onEmpirical Methods in NaturalLanguage Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 874–878. Taylor Berg-Kirkpatrick and Dan Klein. 2014. Improved typesetting models for historical OCR. In Proceedingsofthe52ndAnnualMeetingoftheAssociationforComputationalLinguistics(Volume2: Short Papers), pages 118–123, Baltimore, Maryland. Association for Computational Linguistics. Alan Clements. 2013. ComputerOrganizationandArchitecture: ThemesandVariations. Cengage Learning. Eric Corlett and Gerald Penn. 2010. An exact A* method for deciphering letter-substitution ciphers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1040–1047, Uppsala, Sweden. Association for Computational Linguistics. Arthur Dempster, Nan Laird, and Donald Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. 94 John F. Dooley. 2013. A Brief History of Cryptology and Cryptographic Algorithms. Springer International Publishing. Qing Dou and Kevin Knight. 2012. Large scale decipherment for out-of-domain machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing andComputationalNaturalLanguageLearning, EMNLP-CoNLL ’12, pages 266–275, Stroudsburg, PA, USA. Association for Computational Linguistics. Qing Dou and Kevin Knight. 2013. Dependency-based decipherment for resource-limited machine translation. In EMNLP, pages 1668–1676. ACL. Qing Dou, Ashish Vaswani, and Kevin Knight. 2014. Beyond parallel data: Joint word alignment and decipherment improves machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 557–565, Doha, Qatar. Association for Computational Linguistics. Qing Dou, Ashish Vaswani, Kevin Knight, and Chris Dyer. 2015. Unifying Bayesian inference and vector space models for improved decipherment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on NaturalLanguageProcessing(Volume1: LongPapers), pages 836–845, Beijing, China. Association for Computational Linguistics. Chi Fang and Jonathan J. Hull. 1995. Modified character-level deciphering algorithm for OCR in degraded documents. In Document Recognition II, San Jose, CA, USA, February 5, 1995, volume 2422 of SPIE Proceedings, pages 76–83. SPIE. Philip Gage. 1994. A new algorithm for data compression. C Users J., 12(2):23–38. Yingqiang Gao, Nikola I. Nikolov, Yuhuang Hu, and Richard H.R. Hahnloser. 2020. Character-level translation with self-attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1591–1604, Online. Association for Computational Linguistics. Aidan N. Gomez, Sicong Huang, Ivan Zhang, Bryan M. Li, Muhammad Osama, and Lukasz Kaiser. 2018. Unsupervised cipher cracking using discrete GANs. CoRR, abs/1801.04883. Jonathan Graehl. 2010. Carmel finite-state toolkit. Sam Greydanus. 2017. Learning the enigma with recurrent neural networks. CoRR, abs/1708.07576. Gurobi Optimization, Inc. 2016. Gurobi Optimizer Reference Manual. George W. Hart. 1994. To decode short cryptograms. Commun. ACM, 37(9):102–108. Bradley Hauer, Ryan Hayward, and Grzegorz Kondrak. 2014. Solving substitution ciphers with combined language models. InProceedingsofCOLING2014,the25thInternationalConference 95 onComputationalLinguistics: TechnicalPapers, pages 2314–2325, Dublin, Ireland. Dublin City University and Association for Computational Linguistics. Bradley Hauer and Grzegorz Kondrak. 2016. Decoding anagrammed texts written in an unknown language and script. TACL, 4:75–86. Tin Kam Ho and George Nagy. 2000. OCR with no shape training. InProceedings15thInternational Conference on Pattern Recognition. ICPR-2000, volume 4, pages 27–30 vol.4. Gary B. Huang, Erik G. Learned-Miller, and Andrew McCallum. 2007. Cryptogram decoding for OCR using numerization strings. In 9th International Conference on Document Analysis and Recognition (ICDAR 2007), 23-26 September, Curitiba, Paraná, Brazil, pages 208–212. Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001. SciPy: Open source scientific tools for Python. Andrew Kae, Gary Huang, Carl Doersch, and Erik Learned-Miller. 2010. Improving state-of-the- art ocr through high-precision document-specific modeling. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1935–1942. Nishant Kambhatla, Logan Born, and Anoop Sarkar. 2022. CipherDAug: Ciphertext based data augmentation for neural machine translation. In Proceedings of the 60th Annual Meeting of theAssociationforComputationalLinguistics(Volume1: LongPapers), pages 201–218, Dublin, Ireland. Association for Computational Linguistics. Nishant Kambhatla, Anahita Mansouri Bigvand, and Anoop Sarkar. 2018. Decipherment of substitution ciphers with neural language models. In Proceedings of the 2018 Conference on EmpiricalMethodsinNaturalLanguageProcessing, pages 869–874, Brussels, Belgium. Association for Computational Linguistics. Vladimir Kluzner, Asaf Tzadok, Dan Chevion, and Eugene Walach. 2011. Hybrid approach to adaptive ocr for historical books. In2011InternationalConferenceonDocumentAnalysisand Recognition, pages 900–904. Kevin Knight, Beáta Megyesi, and Christiane Schaefer. 2011. The copiale cipher. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pages 2–9, Portland, Oregon. Association for Computational Linguistics. Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. 2006a. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, COLING-ACL ’06, pages 499–506, Stroudsburg, PA, USA. Association for Computational Linguistics. Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. 2006b. Unsupervised analysis for decipherment problems. InProceedingsoftheCOLING/ACL2006MainConferencePosterSessions, 96 pages 499–506, Sydney, Australia. Association for Computational Linguistics. Kevin Knight and Kenji Yamada. 1999. A computational approach to deciphering unknown scripts. Inin: ProceedingsoftheACLWorkshoponUnsupervisedLearninginNaturalLanguageProcessing. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Conference Proceedings: the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand. AAMT, AAMT. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics. George Lasry, Beáta Megyesi, and Nils Kopal. 2020. Deciphering papal ciphers from the 16th to the 18th century. Cryptologia, 0(0):1–62. Marcus Liwicki, Alex Graves, and Horst Bunke. 2012. Neural networks for handwriting recognition. In Marek R. Ogiela and Lakhmi C Jain, editors,ComputationalIntelligenceParadigmsinAdvanced Pattern Classification , pages 5–24. Springer Berlin Heidelberg, Berlin, Heidelberg. Giacomo Magnifico. 2021. Lost in transcription: Evaluating clustering and few-shot learningfor transcription of historical ciphers. Beáta Megyesi, Bernhard Esslinger, Alicia Fornés, Nils Kopal, Benedek Láng, George Lasry, Karl de Leeuw, Eva Pettersson, Arno Wacker, and Michelle Waldispühl. 2020. Decryption of historical manuscripts: the decrypt project. Cryptologia, 44(6):545–559. Julien Meyer. 2015. WhistledLanguagesAWorldwideInquiryonHumanWhistledSpeech. Springer. George Nagy, Sharad Seth, and Kent Einspahr. 1987. Decoding substitution ciphers by means of word matching with application to OCR. IEEE Trans. Pattern Anal. Mach. Intell., 9(5):710–715. Malte Nuhn and Kevin Knight. 2014. Cipher type detection. In Proceedings of the 2014 Conference onEmpiricalMethodsinNaturalLanguageProcessing(EMNLP), pages 1769–1773, Doha, Qatar. Association for Computational Linguistics. Malte Nuhn, Julian Schamper, and Hermann Ney. 2013. Beam search for solving substitution ciphers. InProceedingsofthe51stAnnualMeetingoftheAssociationforComputationalLinguistics (Volume 1: Long Papers), pages 1568–1576, Sofia, Bulgaria. Association for Computational 97 Linguistics. Malte Nuhn, Julian Schamper, and Hermann Ney. 2014. Improved decipherment of homophonic ciphers. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1764–1768, Doha, Qatar. Association for Computational Linguistics. Edwin Olson. 2007. Robust dictionary attack of short simple substitution ciphers. Cryptologia, 31(4):332–342. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. FAIRSEQ: A fast, extensible toolkit for sequence modeling. In Proceedings of the2019 Conference ofthe North American Chapterof the Associationfor Computational Lin- guistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics. Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. Gigaword fifth edition LDC2011T07. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Andreas Müller, Joel Nothman, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Eva Pettersson and Beata Megyesi. 2019. Matching keys and encrypted manuscripts. InProceedings of the 22nd Nordic Conference on Computational Linguistics, pages 253–261, Turku, Finland. Linköping University Electronic Press. Nima Pourdamghani, Nada Aldarrab, Marjan Ghazvininejad, Kevin Knight, and Jonathan May. 2019. Translating translationese: A two-step approach to unsupervised machine translation. In Proceedingsofthe57thAnnualMeetingoftheAssociationforComputationalLinguistics, pages 3057–3062, Florence, Italy. Association for Computational Linguistics. Sujith Ravi and Kevin Knight. 2008. Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 812–819, Stroudsburg, PA, USA. Association for Computational Linguistics. Sujith Ravi and Kevin Knight. 2009. Probabilistic methods for a Japanese syllable cipher. In Com- puter Processing of Oriental Languages.Language Technology for the Knowledge-based Economy, 22ndInternational Conference,ICCPOL2009, HongKong, March26-27,2009. Proceedings, pages 270–281. Sujith Ravi and Kevin Knight. 2011. Bayesian inference for Zodiac and other homophonic ciphers. InProceedingsofthe49thAnnualMeetingoftheAssociationforComputationalLinguistics: Human 98 Language Technologies - Volume 1, HLT ’11, pages 239–247, Stroudsburg, PA, USA. Association for Computational Linguistics. Sravana Reddy and Kevin Knight. 2011. What we know about the voynich manuscript. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 78–86, Portland, OR, USA. Association for Computational Linguistics. Han Renfei. 2020. Using attention-based sequence-to-sequence neural networks for transcription of historical cipher documents. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. InProceedingsofthe54thAnnualMeetingoftheAssociationforCom- putationalLinguistics(Volume1: LongPapers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics. Ray Smith. 2007. An overview of the Tesseract OCR engine. InICDAR’07: ProceedingsoftheNinth International Conference on Document Analysis and Recognition, pages 629–633, Washington, DC, USA. IEEE Computer Society. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In In Proceedings of Association for Machine Translation in the Americas, pages 223–231. Benjamin Snyder, Regina Barzilay, and Kevin Knight. 2010. A statistical model for lost language decipherment. InProceedingsofthe48thAnnualMeetingoftheAssociationforComputational Linguistics, ACL ’10, pages 1048–1057, Stroudsburg, PA, USA. Association for Computational Linguistics. Mohamed Ali Souibgui, Ali Furkan Biten, Sounak Dey, Alicia Fornés, Yousri Kessentini, Lluis Gomez, Dimosthenis Karatzas, and Josep Lladós. 2022. One-shot compositional data generation for low resource handwritten text recognition. InProceedingsoftheIEEE/CVFWinterConference on Applications of Computer Vision, pages 935–943. Mohamed Ali Souibgui, Alicia Fornés, Yousri Kessentini, and Beáta Megyesi. 2021a. Few shots is all you need: A progressive few shot learning approach for low resource handwriting recognition. arXiv preprint arXiv:2107.10064. Mohamed Ali Souibgui, Alicia Fornés, Yousri Kessentini, and Crina Tudor. 2021b. A few-shot learning approach for historical ciphered manuscript recognition. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 5413–5420. IEEE. Sebastian Spiegler and Christian Monson. 2010. EMMA: a novel evaluation metric for morpholog- ical analysis. InProceedingsofthe23rdInternationalConferenceonComputationalLinguistics, pages 1029–1037. Association for Computational Linguistics. 99 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc. Warren Weaver. 1947. Warren weaver and norbert wiener correspondence 1947. Xusen Yin, Nada Aldarrab, Beata Megyesi, and Kevin Knight. 2019. Decipherment of historical manuscript images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 78–85. 100 AppendixA SyntheticCiphers In this appendix, we present the full set of synthetic ciphers that we used for the decipherment experiments presented in chapter 3. Cipher#1(353characters,fromEnglish) kac butnqymkupqmr tckauv ql m tckauv ux rqyzjqlkqb mymrdlql kamk ql jlcv ku lkjvd kcfkl gaqba mpc gpqkkcy qy my jyeyugy rmyzjmzc myv ku lkjvd kac rmyzjmzc qklcrx gacpc kac jyeyugy rmyzjmzc aml yu unhqujl up ipuhcy gcrrjyvcplkuuv brulc pcrmkqhcl myv gacpc kacpc mpc xcg nqrqyzjmr kcfkl gaqba tqzak ukacpgqlc amhc nccy jlcv ku acri jyvcplkmyv kac rmyzjmzc Cipher#2(150characters,fromEnglish) ldsg obmy ybbujsy zyblu qj jds aysqf bw xqlg rbbf bmj obmy lcgxbl qgx crr as ebgs obmys jds ysqubg ct qjyqksrcge bg amj xbgj jdcgf jlczs cju qrr ycedj 101 Cipher#3(653characters,fromEnglish,spacesremoved) egjtvsoztsaqhresozoeavigfsgfitaomtvozingxsfthqhrlttfngxstntavortziteiqhetvghzegjtquqohqhrrg hzaftqlzggagghygszitvittkaazokkohafohqhrzitstahgztkkohuvigziqzozahqjohuygszitkgatshgvvok kwtkqztszgvohygszitzojtazitnqstqeiqhuohuegjtathqzgsaeghustaajthfktqatittrziteqkkrghzazqhro hzitrggsvqnrghzwkgelxfzitiqkkygsitziqzutzaixszvokkwtitvigiqaazqkktrzitstaqwqzzktgxzaortqhr ozoasquohuozkkagghaiqltngxsvohrgvaqhrsqzzktngxsvqkkaygszitzojtazitnqstqeiqhuohuegjtjgzi tsaqhryqzitsazisgxuigxzzitkqhrqhrrghzesozoeomtviqzngxeqhzxhrtsazqhrngxsaghaqhrngxsrqxu iztsaqstwtnghrngxsegjjqhrngxsgkrsgqroasqforknquohufktqatutzgxzgyzithtvghtoyngxeqhzkthr ngxsiqhrygszitzojtazitnqstqeiqhuohu Cipher#4(128characters,unspecifiedEuropeanlanguage) tjacxlsi tklsjp filip di sitrp ixfcjgr j yrzjy wki xrp eiygrxip erywki pjactrp gi zkiyyj o oj di ixfcjtrp j gjy xkipdyr gipljyzr Cipher#5(107characters,unspecifiedEuropeanlanguage) xv krs xzc rpuxs deczb vxzc txsy krs vgtkxs vxzc trnmu krs bsrn hxs rslx rpux deczb xs crtl xzcx ancbx isrn Cipher#6(331characters,nospaces,unspecifiedEuropeanlanguage) lrlpjnvoptucivnsxuhhpjnnvgsvronstcvuxgsnxegstxgstcgirvinhhnvgsoucngjsugourmftroknxtph nvbrcgsfnvgskvpseuoucngjsugoutgkgspllxndhulnanjnusrvvbrhhgspsegvhnjghrmfbrvansnxhnjg 102 hogsnvgstcvrsntsnvnvgnsgstanjxburkvniulrlpjnvoptucivnsxuhhpjnxnvsugoutirvthnxpdgsvrons rmffnstthrvnkgsrobvrhhbrcgstnjegtuthrvnplljnkrvrmfbgjrsnegtogenpkpthlvutghnv Cipher#7(168characters,nospaces,unspecifiedEuropeanlanguage) epdrigxhdpxlxrkdrgxrhxtrdwvvxkdrgxhrcxhpiuxpfivvltrdkxvcrpvtxercgvgxvcprwvphxrbxcxdlk vthvpxlmvgxedhxhvgxhvmrpwxvcmrpwxxlvhvhvxuixhrevfivaxgriexvcgxwxlhvjrgnxhvjgxthpv l 103 AppendixB TheBorgCipher In this appendix, we present more pages of the Borg cipher, with decoded Latin and English trans- lation (translation byUrbanÖrneholm). We release the full transcription of the book, deciphered Latin text, and English translation on this website: http://stp.lingfil.uu.se/~bea/borg/ 104 Page0024v ad contractos arteticos sol- uendum R. castorei (uncia-drachmam) [iii] succi saluie (uncia-drachmam) [iii] succi rute (uncia-drachmam) [ii] piperis longi (uncia-drachmam) [ii] olei oliuarum [libram semis] puluerisa castoreum et piper per se coer- ce simul et pone in amp(q-h)oram lapideam bene coopertam, ne fumus ab- eat aut a(q-h)ua, et pone in catcabo, uel olla plena a(q-h)- ua ferfenti, et fac bul- lire per duas (q-h)oras e utere ungento patien- ti ad, ignem approximato et liberabitur in pauc For loosening contracted joints Take three ounces of castoreum, three drachms of juice of sage, two ounces of juice of rue, two drachms of long pepper, and half a pound of olive oil. Pulverize the castoreum and the pepper separately, mix them together and put them in a carefully closed stone jar, so neither fumes might escape nor water, and place it in a clay pot, or jar, full of simmering water, and make it boil for two hours, and use as an ointment with the patient close to the fire, and he will be liberated in a few [..] 105 Page0025r alc(q-h)imicus ad extra(q-h)endum sangui- neam coagulatum, et alios (q-h)umores intra iunctu- ras [R.] clr?ete, saponis [A]lbi greci de cane lu- ?nene. [L]i(q-h)ueritie sem lini, gallitlici omnium [aú]. uini distillati modi- cum, et fiat mixtura- et calide suppontur per [3] dies. (q-h)ec faber in prussia probauit. de dolore spi- ne spatularuam (q-h)umerorum an alchemist on extracting coagulated blood and other fluids inside the joints Take chalk(?), soap, album graecum from a [..] dog, licorice, linseed, and seed of vervain, equal parts of all; a modest amount of brandy, and it should be a mixture, and it is applied hot for three days. This, a craftsman in Prussia has proved. On back pain of the shoulder blades 106 Page0025v nos sumus (q-h)uod ad dolo- rem (q-h)ui fit in (q-h)umeri causatum a frigidida aut etiam ab ali(q-h)ua sub- tili materia tollit lon- ga fricatio facta cum oleo et uino, sit autem oleum subtilis substan- tie, et non stipitis ut optime ualet, oleum niperi alia et optima medicina (q-h)ue ad sciaticam mul- tum ualet, et est mira- bilis dare decoctionem centauree minorum, aut eius puluerem dare (uncia- drachmam) et multi certe curantur We know that, for pain in the shoulders from cold reasons, or even from some subtle matter, a long rub with oil or wine cures this; the oil should, however, be of fine quality, and not made from twigs to be most efficient, juniper oil. Another, and very good, remedy which is very efficient and wonderful in sciatica, is to give a decoction of common centaury, or [..] ounces of a powder thereof, and many are certainly cured. 107 Page0031v septima species ig R.? balsami [E.I.] allitran 1. li(q-h)uida [E.S.] olei ouorumc cis uiue [an] [f. 10.] calc teres, cum oleoillo di temperes et allitrane balsamlm appones. N?ein (q-h)erbas et lapides et n centia regionis pel ges et fimo legionX repones combulemdo mo A anturalis plume lapsu terre succe detur, et totum combur durabit, altem illo igne? [20] annos, nec a(q-h)uau pbit extingui.. octaua species ige? R. calcis uiue [E.I.] gal ni p? ?. felis tortugn [E.I.] omnia confice tep? The seventh kind of fire Take one pound of balm, half a pound of alkitran or liquid pitch, ten pounds each of oil of eggs, and quicklime. Grind the lime, then you dilute it with the oil, and add alkitran and balm. Then you pour it on herbs and stones, and anything that grows in the region, to burn it. And from the first natural rain the earth will be set on fire, and it will all burn. This fire will however last for twenty years, and will not be extinguished by water. The eighth kind of fire 108 Take one pound of quicklime, six ounces of galbanum, one pound of tortoise bile. Mix everything, grinding it 109 AppendixC TheIACipher In this appendix, we present our preliminary solution of the IA cipher in Italian plaintext. A sample of the cipher is shown in Figure C.1. Figure C.1: A Sample of the IA Cipher. OurPreliminarySolution: pratica con ques ta gente circa il uoler atendere le promese del comendatore partisse de lisbona ne hebi molte altre ne fu remedio ne con persuasioni ne con meterli timore seruato il 110 decoro poterne cauar altro se non che fariano quanto si erano per scritto meco obligati non potendo piu eche se l comendato re hauia promeso non era stato de lor commissi one anzi hauia cio fatto per ruinarli con prometendo quello ch era certo non potiano atendere lamentandossi di lui che li ha uia robati e che sapiano hauia quatro milia ducati in banco in roma li quali erano loro ne fa ceuano seruitio a X che se li pigliassi uista tal resolutione li dissi in fauor del comendator quel che mi parse sugiongendoli poiche credi ano hauessi cio fatto per ruinarli non douia no loro adimpir la sua mala uolonta come face uano non atendendo impero che li sui ministri si reputariano inganati uedendo essi restar contenti de l expedition e mancar de li mezi l auiano causata procedendossi di questo modo dubitauo nel fu turo retrouassero tutti l altri fredi quant nque fussi certo non si mancheria de farseli iustitia pero che da uno a un altro modo importaua molto come hauiano prouato nel perdono nel quale non era diferentia nel fine ma nel modo pero per queste e mo 111 lte altre ragioni non mi pareua douessero dire di non uolere compire ma se non hauiano per al hora il modo representassero a X le lor necessita che allega uano poi si remetissero a la sua uolonta che crede uo certo X acetarebe la lor scusa sapendo non si moue a pero ore ho da lor conseguir responden domi quando hauessero data la parola acetassi er no constretti di satisfare al che non hauiano modo persuaderli mi redussi pagassero li cin quemilia erano obligati resposero l obligo era da poi de liberati li presi che cio fatto non mancher iano al fatto diou del signor pierluigi altri signori a chi el commendatore hauia dato inten tione de seruitii certificandoli tutto esserssi ot nuto per fauor del signor pierluigi operado al quale non poteuano mancare del promeso molto piu ne li constrinsi meno in questo che nel caso di resposero la promesa aouera fatta per il negotio de fuor o nondimeno che fariano il debito prima de la mia par tita dariano resolutione di quello uoliano fare con ouil signor pierluigi tutti l alt ri in lisbona me ne tornai in euo 112 ra doue fecci le medeseme instantie con quelli non si trouorno a lisbona ueduto che da li mercanti non poteuo cauar resolutione mandai per tre doctori de li loro con li quali si consegliano nel negotio refertoli quanto hauia passato mi dolsi molto di tal procedere de tanta irre solutione in quel tempo douiano esser piu resol uti maxime se temeuano l andata di cesare a roma non fussi per nuocerli tanto nel presente quan to nel futuro come mi diceuano per il che mi pareua il douer del gioco che subito si spacias si una diligentia che portassi buona resolutione del passato cosI per l altri come per X a cio tutti fussero constreti hauer le lor cose a cor maxime ou che sempre si troua a le ore che di X respose ro parerli molto bene quanto diceua che ne sarebero insieme con l altri per ueder de disponerli furno insie me la resolution fu che partito da la corte doue non si poteua negotiare in santeren uerriano a trouar me dui o tre di loro con la resolution del tutto mandai un mio a lisbona doue erano tornati per la magior parte per solicitar la resolution e le lettere de li cinque milia poiche li presi erano libe 113 rtie mi redussi nel predetto loco doue quando aspecta uo che comparissero mi scrissero un maest ro georgio di euorao e capo principale piu misero che la miseria li hauia detto che per niente non uenissero a trouarmi che sarebe la total ruina del tutto per che molto si era doluta de la pratica teniano meco per questo quelli ch erano in procinto di uenire non ueniano unde col mio resolueriano il tutto mande rian lo ben resoluto la resolutione li detero fu le lettere de li cinqu che con molto trauaglio se li cauo rno di mano de le quale mando la seconda inclusa un alt ra se n e mandata in fiandra con mio ordine che subito si fac cino pagare in roma secondo l ordine d ouil che se non trouaro exequito mi fecero dire che non saliria del regno che mi mande riano resolutione expectando detta resolutione mi han fatto dire che in fiandra trouero la resolution del tutto ch uolessi esser contento de darli le lor scritu re originali ho resposto poiche in fiandra trouero la resolution del tutto li anche da ro le scriture come in effetto faro trouando la dubitando non mi diano parole hauendomi tante 114 uolte mancato benche questo possi nascere dal non poter cosi presto cauar denari dal populo hauendomi detto che per questo uoliano mandar pe r tutto il regno gente con libri ne quali parti cularmente ogniuno scriua quanto uuol dare per non star piu nel incertte cosi e puo essere alloghino il tempo per poterssi meglio resol uere quando cio non sia di tal proce dere se gia non fussi uero l acordo dice il capitan leiosan anzi a quelli doctori a chi X comando andasse ro ad reuocare quanto hauiano recercato al cardinale che disse per quanto mi dicano li dis se quando si fara un altra unione contra di uoi anderite al papa che ui proueda che l timore poco modo del denaro li habbi se in fiandra non mi mandano resolutione di quello uorrebeno nel negotio di futuro come di satis fare in tutto o in parte l intentioni date per il co mmendatore o siano li magior asini del mondo mi abino qualche sicureza de le cose loro quella non stabilisc acta lcuna prima de la mia arri 115 uata si potra pensar qualche modo se sono asini de farglelo cognoscere se per denari si sono uoluti assi curare da chi non puo il medemo faccino con chi puo cauar la maschara lor procedere naschi d asinaria mostrera di uoler concedere l inquisitione rigorosa non solo farano quanto in lor nome si e pr omeso ma quanto si uorra habino tardata la resolutione per star a uedere se a requisitione di ce sare muta cosa alcuna del perdon e consente l inquisitione cme in castiglia sapendo il re sopra cio hauerli scri tto a cio possino meglio sapere come spendano il de naro non s inouando altro credo far ano il debito sino in cima l andata mia di fia ndra essendo li diego mendez fratello di francesco che mori e l piu rico di loro di magi or auctorita dopo lui la moglie del detto fr ancesco la quale se non fussi stata proncta ad pi gliar una gran parte del peso alle spale del mio bartholomeo castodengo che fu quel mandai a lisbona non era molto non cauar qu esti cinque milia nz uedendo il signor pierluigi eou 116 ben satisfatti che pe r non scriuer questa irresolutione ho tardato tanto credendomi pur di poterli scriuere co sa di magior satisfactione circa il comendatore non penso siano per darli denari per li officii se fa rano alcuna cosa credo sara constituirli alcuna certa prouisione procedera da l opra mia la qual se non l era mi pariano resoluti non negociassi piu co se loro restando di lui molto scandalizati pur mi son sforzato l intertenghino si portino ben di esso non so che farano che lui tamben si compor ti con esso loro sia piu parco nel spendere che si do glian tino al cielo ch abbi gia speso diece milia ducati ne possan pridire quella lettera scrisse a X se lui si aslarga nel prometere creda ou che costor ti st ringono nel atendere le pr omese siano in scritto non in uoce poiche non uogli ano stare se non alla scritura sopragionsero littere di questa gente per le quali si auisaua al cardinale di notificar il bre 117 ue processato non se ne contentando per quanto si cre de s s reuerendiss mando per quelli l hauiano presentato lo restitui mi pregauano uolessi far detta notifica tione questa straniessa usata li douessi esser atta per farli resoluere in quel che doueriano mi facessero far tal effetto imperoche se questo procedeua da tanto piu l inciteriano contra di essi pero il douer del gioco mi pareua mi dessero buo na resolutione circa le promissioni e intentioni date dal commendatore cio fatto si mandassi una diligentia e si suplicassi che per un suo corrieri a posta manda si al cardinale che subito facessi tal notificatione o uer si mandassi a ciascheduno ordinario un proces so che io scriuerei d hauer saputa prohibitione la non uoleua si facessi questo ne per me ne per altri non poteuo mancar darne subito aui so a X lor restaua no discolpati se pur uoleuano non mancherei di far l intimatio ne se l mandar a roma pareua buono non si fcessi senza detta resolutione a uessi uoluto mostrar buona uolonta li ministri si tr ueriano tanto refredati ou che si poteua 118 reputar gabata che non oterriano il bisogno pia que a quelli che mi parlauano il conseglio mi pre gorno uolessi soprasedere il despacio per tutto febraro che subito scriueriano e sperauano buona resolutione ho expetato e sl hogi e non ho cosa lcuna se uerra per tutto per mandarlo buono un uescouo di saphe nominato gioan sutil che tiene un monastiero ouer priora to solito gouernarssi per priore sancti salua toris de igreio o de ecclesiola portugalenssis diocesis che ualera presso de duamilia ducati il quale uechio infermo di maniera che non puo durar molto se ben intendo X pensa domandarlo per unirlo con sancta croce di colim bria che uale preso de uintimilia duccati de una uniuersita che li s e cominciata a la quale puo suplir ditta sancta croce puo X darli del suo se non serra la porta a simil unione altre che si fano ad ca pelle administrate e mangiate da laici con la quantita de le commende diro anche dirssi che 119 procura di ottener in commenda per un suo fratello che si casa con la sorela del duca di braganza il prefa to priorato di sancta croce che hor tiene l infante arciuescouo di braca quantonque X e suoi fratelli siano degni di ogni gratia non tacero che mai piu salira de lai ci frtelli o figli di re 120
Abstract (if available)
Abstract
Libraries and archives are filled with enciphered documents from the early modern period. Example documents include encrypted letters, diplomatic correspondences, medical books, and books from secret societies. A collective effort has been put into finding, collecting, and archiving those documents. However, the information hidden in those documents are still unknown to the contemporary age. Decipherment of classical ciphers is an essential step to reveal the contents of those historical documents.
Decipherment is a challenging problem. Given some encrypted text (and nothing else), the task is to find the original text before encryption. This requires recognizing patterns in a usually small piece of data (in text or image format). Learning patterns from data is a major goal in artificial intelligence for many different applications. In this thesis, we take the task of decipherment as an example to improve computer ability to recognize and interpret concealed patterns in a small piece of data using unsupervised methods. Decipherment methods have been used in computer science and other fields not only to solve historical ciphers, but also to build more efficient systems in other domains such as machine translation.
This thesis aims to make contributions to three problems: 1. Automatically transcribing degraded handwritten documents that use an arbitrary symbol set (not necessarily alphabetical) without using any labeled data. 2. Deciphering noisy ciphers and ciphers with an unknown plaintext language. 3. Segmenting ciphertext that consists of a sequence of digits to get meaningful segments that represent language units. We describe our models and algorithms to attack each of these problems. We test our methods by deciphering synthetic and real historical ciphers. Among the results of this work, we automatically crack two real historical ciphers from the 17th century; the Borg cipher and the IA cipher. The contents of these ciphers have not been known until this work. We release new datasets, tools, and models to the research community.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Decipherment of historical manuscripts
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Deciphering natural language
PDF
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Responsible artificial intelligence for a complex world
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
A green learning approach to deepfake detection and camouflage and splicing object localization
PDF
Computational models for multidimensional annotations of affect
PDF
Learning shared subspaces across multiple views and modalities
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Schema evolution for scientific asset management
PDF
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Visual representation learning with structural prior
Asset Metadata
Creator
Aldarrab, Nada
(author)
Core Title
Automatic decipherment of historical manuscripts
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
10/26/2022
Defense Date
06/13/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,code breaking,decipherment,machine learning,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
May, Jonathan (
committee chair
), Nakano, Aiichiro (
committee member
), Narayanan, Shrikanth (
committee member
)
Creator Email
nada1324@gmail.com,naldarra@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112195841
Unique identifier
UC112195841
Identifier
etd-AldarrabNa-11287.pdf (filename)
Legacy Identifier
etd-AldarrabNa-11287
Document Type
Dissertation
Format
theses (aat)
Rights
Aldarrab, Nada
Internet Media Type
application/pdf
Type
texts
Source
20221026-usctheses-batch-988
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
code breaking
decipherment
machine learning