Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Constructing an unambiguous user-and-machine-friendly, natural-language protocol specification system
(USC Thesis Other)
Constructing an unambiguous user-and-machine-friendly, natural-language protocol specification system
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Constructing an unambiguous user-and-machine-friendly, natural-language protocol
specification system
by
Yu-Chuan Yen
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2023
Copyright 2023 Yu-Chuan Yen
Acknowledgements
As the old saying goes, "It takes a village to raise a child." This dissertation is my child in my research life
and I have been blessed with so much help from others to finish it. First, I would like to thank my beloved
advisors, Prof. Ramesh Govindan and Prof. Barath Raghavan for leading me to solve such a novel and
challenging problem all the way. The process was not easy and we did struggle to make certain progresses
in the past. However, Ramesh and Barath have always been positive and encouraging to provide the
most helpful feedback. They taught me the correct mindset and thinking process to tackle those research
difficulties. I would not be able to finish this thesis without their support and guidance.
I would like to thank my committee members, Prof. Murali Annavaram, Prof. Xiang Ren and Prof.
Chao Wang for valuable feedback to shape this thesis and further expanding the impact of the work.
I would like to thank my collaborators, Tamás Lévai and Qinyuan Ye, who brainstormed with me to
handle each challenge while providing their professions in respective areas to effectively rule out less
challenging items and enabling us to focus on real difficulties.
I am also grateful for having intelligent, talented and supportive friends around me to give me their
sincere feedback on my thesis work. Among all friends, I particularly want to show my thanks to Sulagna
Mukherjee who has been the most supportive friend throughout my PhD life. She not only provides her
professional advice but also supports me mentally when I was in those PhD doldrums.
ii
Last but not least, I want to give my biggest thanks to my family, including Mom, Elsie, Sung-Han, and
my daughters (Ellie and Emeri). They are the ones who sincerely share the same happiness as I do when I
finish this thesis work and the ones I love the most.
iii
TableofContents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Explorations from Existing Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Revisions on Specifications and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Semi-Automated Protocol Disambiguation and Code Generation . . . . . . . . . . . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Background and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Discussion of ICMP Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 sage Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Why Ambiguities Arise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Winnowing Ambiguous Logical Forms . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Logical Forms to Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.2 End-to-end Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.3 Exploring Generality: IGMP and NTP . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6.4 Exploring Generality: BFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6.5 Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 sage Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 3: A Compact Protocol Specification Configuration: Unambiguous English Specification
Text Generation and Executable Code Generation . . . . . . . . . . . . . . . . . . . . . 43
iv
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Specification Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.3 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Failed Configuration Could Exist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.2 Filtering checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Text Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.1 Readability guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.2 English sentence generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6.2 Graph to Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7.2 Discovery of a failed configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7.3 Readability discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7.4 Interoperability test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 4: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1 Networked systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.1 Protocol Languages / Formal Specification Techniques . . . . . . . . . . . . . . . . 74
4.1.2 Protocol Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.3 Protocol Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Natural Language Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.2 Semantic Parsing and Code Generation. . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.3 Pre-trained Language Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.4 NLP for Log Mining and Parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Program generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 Automatic programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2 Program Synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 5: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
v
ListofTables
2.1 Protocol specification components. sage supports those marked with♦ (fully) and+
(partially). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Error types of failed cases and their frequency in 14 faulty student ICMP implementations. 12
2.3 Students’ ICMP checksum range interpretations. . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Logical form with context and resulting code. . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Examples of categorized rewritten text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Comparison of the number of logical forms (LFs) between good and poor noun phrase labels. 36
2.7 Effect of disabling domain-specific dictionary and noun-phrase labeling on number of
logical forms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Conceptual components in RFCs. sage supports components marked with♦ (fully) and
+ (partially). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.9 Syntactic components in RFCs. sage supports parsing the syntax of those marked with♦ (fully). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.10 NTP peer variable sentence and resulting code. . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.11 Challenging BFD state management sentences. . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Mathematical definition coverage relates to Estelle . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Readability goals for English specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Specified nodes and their automatic generated number of English words and line of codes.
(*mimics features of TCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Flesch readability scores and their meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5 Flesch readability scores of selected protocols and features . . . . . . . . . . . . . . . . . . 73
vi
ListofFigures
1.1 Approaches to specifying and implementing network protocols. This thesis aims to achieve
the best of both worlds, with English specification and semi-automated implementation. . 4
2.1 SAGE components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 sage workflow in processing RFC 792. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Example of multiple LFs from CCG parsing of “For computing the checksum, the checksum
should be zero”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 LF Graphs of sentenceH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Number of LFs after Inconsistency Checks on ICMP/IGMP/BFD text: for each ambiguous
sentence, sequentially executing checks on LFs (Base) reduces inconsistencies; after the
last Associativity check, the final output is a single LF. . . . . . . . . . . . . . . . . . . . . 34
2.6 Effect of individual disambiguation checks on RFC 792: Left: average number of LFs
filtered by the check per ambiguous sentence with standard error Right: number of
ambiguous sentences affected out of 42 total. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Change of Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 System Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Example of good and poor quality text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Example of text generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 coalescence example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6 Code Snippet Sort Order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7 Code stitching of a line graph example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.8 Cyclic graph example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
vii
3.9 Illustration of an ambiguous configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
viii
Abstract
Protocol specification has existed for decades to deliver the design and implementation of numerous pro-
tocols. As the guideline and foundation of diverse advanced systems, the methods to process and compose
protocol specification have not changed much despite emerging advanced techniques. The production
of specifications remains labor-intensive and involves rigorous discussion to avoid miscommunication
via natural language media. A key reason behind these facts is the existence of ambiguities in natural
language articles. Ambiguities could represent an unreasonable sentence, a multiple-meaning sentence,
or any under-specified behaviors. However, identification of ambiguities is challenging to be applied in
domain specific context. In addition, lack of studies applying advanced natural language processing tech-
niques limits our understanding and practices of improving specification production. Motivated by the
above observations, this thesis makes the first steps in introducing and building a prototype system that
is user-and-machine-friendly and able to process natural language protocol specification while guarantee-
ing the ambiguous level of the specification. The contributions are four-fold. Firstly, it applies advanced
natural language processing techniques called "Combinatory Categorial Grammar" to analyze protocol
specification texts and identifies ambiguous sentences that could result in buggy implementations. Sec-
ondly, it parses unambiguous English specification and generates corresponding executable protocol codes
that can interoperate with well-known third party code. Thirdly, it defines protocol behaviors with a math
definition and introduces unambiguous configurations. The specification configuration is easy for authors
to design and easy to automatically generate corresponding English specification and executable code.
ix
Lastly, it categorizes a set of verification rules that are able to assist in filtering unreasonable configura-
tions which could not be turned into pieces of English paragraphs or code blocks.
x
Chapter1
Introduction
Protocol specifications such as RFCs and IETF drafts are used to deliver technical detail and organiza-
tional notes to the broader network protocol community. With the popularity of natural language, English
specification remains the most popular media to describe network protocols, compared to the ones that
use other formal specification languages or intermediate representations. However, the production of En-
glish specifications (or specifications in any natural languages) is always labor-intensive and uneasy for
authors. From the author’s perspectives, they often design and implement these specifications with their
most comfortable programming languages, but have to switch to a less-comfortable language i.e., English
to specify their ideas. Moreover, the authors are commonly not natural language experts and could make
mistakes when using natural languages. Bringing in ambiguities is one of the critical mistakes that could
lead to buggy systems failing protocol authors’ design intents. Discovering this ambiguity consciously is
challenging because it assume other readers would share the same background knowledge to derive the
semantic meanings. Such assumption is impractical, and there was little previous work that systematically
assist authors to discover these hidden ambiguity problems. As a matter of fact, the authors have no choice
but struggle with multiple revisions through human reviews to ensure the preciseness of content.
This thesis takes the first steps towards constructing an unambiguous user-and-machine-friendly,
natural-language protocol specification system. This vision is particularly inspired by the recent advanced
1
techniques, trends and observations. Firstly, the advanced natural language processing techniques enable
analysis of domain-specific contents, which provides the tools to systematically process natural language
texts and discover ambiguities from individual sentences. Secondly, automating coding process to ease la-
bor’s effort such as diverse programming synthesis techniques align perfectly with the scenarios of spec-
ification production. Lastly, an unified system to process the same source i.e., English specification can
better guarantee the interoperability of code implementation, which could reduce the chance of buggy
implementation.
Technicalchallenges. To construct such a system, we encounter a number of challenges. First, to analyze
a piece of natural language article, we have to extract semantic meanings i.e., semantic parsing. However,
the available natural language processing tools are designed to be applied on general purpose texts. In
other words, we lack of the applications that can be applied on network protocol articles. In addition, even
though we are given the semantic representations, we have to determine the operations to be applied.
More specifically, we should define what ambiguities mean under domain-specific context and to design a
mechanism that can efficiently filter out the ambiguous sentences. Second, to derive the final executable
programming piece, we need a compiler component to convert the semantic representations into a correct
protocol code. The generated code should as well get examined and tested if it could interoperate with
other implementations. Third, we require a technique to allow authors specify in a natural but guaranteed-
unambiguous language/format/configuration, which can be converted into a natural language specification
automatically without fussing authors themselves to specify in unambiguous English . Lastly, we need to
build a systematic method to examine whether a piece of language/format/configuration is qualified to be
automatically turned into unambiguous English specifications and corresponding protocol code.
Contributions To Tackle the above challenges, this thesis builds an extensible protocol specification sys-
tem that is user-and-machine-friendly, unambiguous for specifying protocols and is capable of discovering
ambiguities of the English specified protocols. To this end, we make the following contributions. We first
2
construct a grammar that is able to recognize protocol-specific terminologies, to extract semantic meanings
from individual sentence and to turn them into semantic representations. It systematically identifies am-
biguous sentences with five types of checks and provides authors substantial feedback on which sentences
to be revised. Our second contribution is developing a compiler to parse a unique semantic representation
into an executable line of code and generate a complete protocol code block with the input of environmen-
tal context and execution order of each operation. The automatically generated code was verified to be able
to interact properly with well-known third party implementations. The third contribution is introducing a
mathematical-based configuration to describe protocol behaviors from a system viewpoint. Compared to
other studies, using numerous protocol-specific operations/elements to guarantee its coverage of protocol
operations, we define the system behavior with a general set of components, involving input source, output
source, timer and multiple program states, allowing configuration to be easily compiled into specification
texts and executable code Lastly, our extensible protocol specification system is able to analyze an invalid
configuration, and identify why a configuration cannot be turned into either English texts or executable
code.
1.1 ExplorationsfromExistingSpecification
Although English specifications are commonly the first source for a protocol implementer or any network
community members to understand the detailed design of a protocol, we do not know them well. People
always complain about how difficult it is to implement the protocol behaviors depicted in the specification
even though every English sentence is readable and (most likely) grammar-wise correct. Previous research
provides alternative methods such as reference code, specifying in other formal specification languages,
but little assists readers to understand what we should actually interpret from the English specifications,
regardless of its popularity. As such, this thesis takes the first steps to identify ambiguities in existing RFC
and suggests changes to found ambiguous sentences.
3
(+)
[ease of implementation]
HOW
(-)
(+)
[ease of specification]
WHAT
(-)
English Specification
Manual Implementation
Thisthesistarget
Formal Specification
Manual Implementation
Formal Specification
Auto Implementation
Figure 1.1: Approaches to specifying and implementing network protocols. This thesis aims to achieve the
best of both worlds, with English specification and semi-automated implementation.
Semi-AutomatedProtocolDisambiguationandCodeGeneration. In this thesis, we explore to what
extent natural language processing (NLP), an area that has made impressive strides in recent years, can be
used to generate protocol implementations. We advocate a semi-automated protocol generation approach,
Sage, to synthesize functional code descriptions from natural language specifications, occupying a unique
position in the design space (Figure 1.1). Ambiguous or under-specified sentences in specifications can
be fixed by a human iteratively until Sage is able to generate protocol code automatically. Using an im-
plementation of Sage, we discover 5 instances of ambiguity and 6 instances of under-specification in the
ICMP RFC, after fixing which Sage is able to generate code automatically that interoperates perfectly with
Linux implementations. We demonstrate the ability to generalize Sage to parts of IGMP and NTP. We also
find that Sage supports half of the conceptual components found in major standards protocols.
1.2 RevisionsonSpecificationsandBeyond
An ambiguous specification would require an author to rewrite sentences so that ambiguities or under-
specified content could be corrected. Without being sensitive to natural language mistakes, it remains
an issue to rewrite a sentence because of limited knowledge of what an unambiguous sentence should
4
appear as. The same concern could be applied to any newly composed English specifications. It could be
challenging to write an unambiguous specification since the beginning. Besides the issue of composing
in natural language, we also observe the need to allow authors to specify in a more natural (to them) and
unambiguous language/format/configuration.
UnambiguousEnglishSpecificationTextGenerationandExecutableCodeGeneration. As a fact,
English specification remains the most popular representation, but writing an English specification is non-
trivial for authors and uneasy for machine to parse. This thesis propose to change this model. Without
affecting the final representation that readers will read with, if the author can specify protocol behaviors
with a representation that can easily convert into corresponding unambiguous English specification and
the underlying executable code, it can preserve the ease of specifying for authors, ease of reading for read-
ers and ease of implementation for protocol implementers. This thesis defines a 6-tuple math definition to
describe a general protocol behavior and illustrates its ability to cover the kernel operations of an impera-
tive synchronous programming language, Esterel. The proposed system, Sentence, parses a configuration,
which is derived from the math definition, to validate its validity to be converted into English specifica-
tion texts and corresponding executable code. We show that the readability of generated English texts are
guaranteed by a popular writing assistant system and demonstrates that the generated code is capable of
interoperating with Linux implementation.
1.3 ThesisOutline
This thesis is organized as follows.
• Chapter 2 introduces Sage to systematically discover natural language ambiguities and parse an
unambiguous specification into executable protocol code.
5
• Chapter 3 introduces Sentence to unambiguously specify protocols with 6-tuples that can automat-
ically turns into unambiguous English specification and corresponding protocol code.
• Chapter 4 summarizes related work.
• Chapter 5 concludes the thesis and points out directions for future work.
6
Chapter2
Semi-AutomatedProtocolDisambiguationandCodeGeneration
2.1 Introduction
Four decades of Internet protocols have been specified in English and used to create, in Clark’s words,
rough consensus and running code [20]. In that time we have come to depend far more on network
protocols than most imagined. To this day, engineers implement a protocol by reading and interpreting
specifications as described in Request For Comments documents (RFCs). Their challenge is to navigate
easy-to-misinterpret colloquial language while writing not only a bug-free implementation but also one
that interoperates with code written by another person at a different time and place.
Software engineers find it difficult to interpret specifications in large part because natural language can
be ambiguous. Unfortunately, such ambiguity is not rare; the errata alone for RFCs over the years high-
light numerous ambiguities and the problems they have caused [101, 89, 22, 45]. Ambiguity has resulted
in buggy implementations, security vulnerabilities, and has necessitated expensive and time-consuming
software engineering processes, like interoperability bake-offs [94, 44].
7
To address this, one line of research has sought formal specification of programs and protocols (§4.1),
which would enable verifying specification correctness and, potentially, enable automated code genera-
tion [15]. However, formal specifications are cumbersome and thus have not been adopted in practice; to
date, protocols are specified in natural language.
∗
In this chapter, we apply NLP to semi-automated generation of protocol implementations from RFCs.
Our main challenge is to understand the semantics of a specification. This task, semantic parsing, has ad-
vanced in recent years with parsing tools such as CCG [5]. Such tools describe natural language with a
lexicon and yield a semantic interpretation for each sentence. Because they are trained on generic prose,
they cannot be expected to work out of the box for idiomatic network protocol specifications, which con-
tain embedded syntactic cues (e.g., structured descriptions of fields), incomplete sentences, and implicit
context from neighboring text or other protocols. More importantly, the richness of natural language will
likely always lead to ambiguity, so we do not expect fully-automated NLP-based systems (§2.2).
Contributions. In this chapter, we describe sage, a semi-automated approach to protocol analysis and
code generation from natural-language specifications. sage reads the natural-language protocol specifica-
tion (e.g., an RFC or Internet Draft) and marks sentences (a) for which it cannot generate unique semantic
interpretations or (b) which fail on the protocol’s unit tests (sage uses test-driven development). The for-
mer sentences are likely semantically ambiguous whereas the latter represent under-specified behaviors.
In either case, the user (e.g., the author of the specification) can then revise the sentences and re-run sage
until the resulting RFC can cleanly be turned into code. sage can be used at various stages in the standard-
ization process (§2.2.3): while drafting, generating reference implementations, or revising a specification.
At the core of sage is an intermediate representation, called alogicalform, of the semantics of a natural-
language sentence. Intuitively, a logical form is a predicate expressing relationships between entities in
the sentence. sage uses a logical form as a unifying abstraction underlying several tasks: (a) determining
∗
In recent years, attempts have been made to formalize other aspects of network operation, such as network configuration [49,
7] and control plane behavior [71], with varying degrees of success.
8
when a sentence may be fundamentally ambiguous, (b) identifying when to seek human input to expand
its own vocabulary in order to parse the sentence, and (c) generating code.
sage is architected as a pipeline with three extensible stages, each of which makes unique contribu-
tions.
The parsing stage (§2.3) generates logical forms for each input sentence. To do this, sage extends
a pre-existing semantic parser ([98]) with domain-specific constructs necessary to correctly parse IETF
standards. These constructs include networking-specific vocabulary and domain-specific semantics ( e.g.,
the use of the word “is” to specify assignment). sage includes tools that we developed to parse structural
context (e.g., indentation to specify field descriptions) and non-textual elements ( e.g., ASCII art for packet
header representations).
Ideally, the parser should be able to reduce each sentence to a single logical form. In practice, RFCs
contain idiomatic usage that confounds natural language parsers, such as incomplete sentences to describe
protocol header fields and specific uses of verbs like is and prepositions like of. For these sentences, the
parser may emit multiple logical forms. sage’s disambiguation stage contains multiple checks that filter
out logical forms that incorrectly interpret this idiomatic usage. We have developed these filters in the
course of usingsage to parse RFCs. Even so, at the end of this stage, a sentence may not result in a single
logical form either (a) because the parser’s vocabulary or the disambiguation stage’s filters are incomplete,
or (b) the sentence may be fundamentally ambiguous. sage prompts the user to extend the vocabulary or
add a filter (for (a)) or rewrite the sentence (for (b)). As users repeatedly extend (“train”) sage’s vocabulary
and filters by parsing RFCs, we expect the level of human involvement to drop significantly (§2.2.3).
Once each sentence has been reduced to a single logical form,sage’scodegenerator converts semantic
representations to executable code (§2.5). To do this, the code generator uses contextual information that it
9
Name Description
♦ PacketFormat Packet anatomy (i.e., field structure)
♦ FieldDescriptions Packet header field descriptions
♦ Constraints Constraints on field values
♦ ProtocolBehaviors Reactions to external/internal events
SystemArchitecture Protocol implementation components
+StateManagement Session information and/or status
Comm. Patterns Message sequences (e.g., handshakes)
Table 2.1: Protocol specification components. sage supports those marked with♦ (fully) and+ (partially).
has gleaned from the RFC’s document structure, as well as static context predefined in sage about lower-
layer protocols and the underlying OS. Unit testing on generated code can uncover incompleteness in
specifications.
sage discovered (§2.6) 5 sentences in the ICMP RFC [82] (of which 3 are unique, the others being vari-
ants) that had multiple semantic interpretations even after disambiguation. It also discovered 6 sentences
that failed unit tests (all variants of a single sentence). After we rewrote these sentences, sage was able
to automatically generate code for ICMP that interoperated perfectly withping andtraceroute. In con-
trast, graduate students asked to implement ICMP in a networking course made numerous errors (§2.2).
Moreover, sage was able to parse sections of BFD [47], IGMP [26], and NTP [70] (but does not yet fully
support these protocols), with few additions to the lexicon. It generated packets for the timeout procedure
containing both NTP and UDP headers. It also parsed state management text for BFD to determine system
actions and update state variables for reception of control packets. Finally,sage’s disambiguation is often
very effective, reducing, in some cases, 56 logical forms (an intermediate representation) to 1. We have
open-sourced oursage implementation [90].
Toward greater generality. sage is a significant first step toward automated processing of natural-
language protocol specifications, but much work remains. Protocol specifications contain many compo-
nents; Table 2.1 indicates which ones sage supports well (in green), which it supports partially (in olive),
and which it does not support. Some protocols contain complex state machine descriptions (e.g., TCP)
10
or describe how to process and update state (e.g., BGP); sage can parse state management in a simpler
protocol like BFD.
Other protocols describe software architectures (e.g., OSPF, RTP) and communication patterns (e.g.,
BGP); sage must be extended to parse these descriptions. In §2.7, we break down the prevalence of pro-
tocol components by RFC to contextualize our contributions, and identify future sage extensions. Such
extensions will putsage within reach of parsing large parts of TCP and BGP RFCs.
Broaderimplications.
We note three broader takeaways from our work onsage. First, we wish to highlight the consequences
of ambiguity in specifications and how they can manifest in code. Second, with a proper analysis and dis-
ambiguation tool (i.e. CCG lexicons and disambiguation checks), SAGE can highlight ambiguities for RFC
authors, editors, protocol developers, etc. Third, SAGE shows the feasibility of generating specification
code from natural language descriptions, and we hope SAGE can inspire future work to overcome the
code generation challenges of diverse natural-language contexts.
11
ErrorType Freqency
IP header related 57%
ICMP header related 57%
Network byte order and host byte order conversion 29%
Incorrect ICMP payload content 43%
Incorrect echo reply packet length 29%
Incorrect checksum or dropped by kernel 36%
Table 2.2: Error types of failed cases and their frequency in 14 faulty student ICMP implementations.
2.2 BackgroundandOverview
Specification ambiguities can lead to bugs and non-interoperability, which we quantify using implemen-
tations of ICMP [82] by students in a graduate networking course.
2.2.1 DiscussionofICMPImplementations
ICMP, defined in RFC 792 in 1981 and used by core tools like ping and traceroute, is a simple protocol
whose specification should be easy to interpret. To test this assertion, we examined implementations of
ICMP by 39 students in a graduate networking class. Given the ICMP RFC and related RFCs, students built
ICMP message handling for a router.
†
To test whether students implemented echo reply correctly, we used the Linux ping tool to send an
echo message to their router (we tested their code using Mininet [58]). Across the 39 implementations, the
Linux implementation correctly parsed the echo reply only for 24 of them (61.5%). One failed to compile
and the remaining 14 exhibited 6 categories (not mutually exclusive) of implementation errors (Table 2.2):
mistakes in IP or ICMP header operations; byte order conversion errors; incorrectly-generated ICMP pay-
load in the echo reply message; incorrect length for the payload; and wrongly-computed ICMP checksum.
Each error category occurred in at least 4 of the 14 erroneous implementations.
To understand the incorrect checksum better, consider the specification of the ICMP checksum in this
sentence: The checksum is the 16-bit one’s complement of the one’s complement sum of the ICMP message
†
Ethics note: the code artifacts we examined were pre-existing, with no personal or identifying information; the code was
not generated for this analysis.
12
Index ICMPchecksumrangeinterpretations
1 Size of a specific type of ICMP header.
2 Size of a partial ICMP header.
3 Size of the ICMP header and payload.
4 Size of the IP header.
5 Size of the ICMP header and payload, and any IP options.
6 Incremental update of the checksum field using whichever checksum
range the sender packet chose.
7 Magic constants (e.g., 2 or 8 or 36).
Table 2.3: Students’ ICMP checksum range interpretations.
startingwiththeICMPType. This sentence does not specify where the checksum shouldend, resulting in a
potential ambiguity for the echo reply; a developer could checksum some or all of the header, or both the
header and the payload. In fact, students came up withseven different interpretations (Table 2.3) including
checksumming only the IP header, checksumming the ICMP header together with a few fixed extra bytes,
and so on.
2.2.2 Approach
Dealing with Ambiguity. Students in an early graduate course might be expected to make mistakes in
implementing protocols from specifications, but we were surprised at the prevalence of errors (Table 2.2)
and the range of interpretations of parts of the specification (Table 2.3) in student code. We do not mean to
suggest that seasoned protocol developers would make similar mistakes. However, this exercise highlights
why RFC authors and the IETF community have long relied on manual methods to avoid or eliminate non-
interoperabilities: careful review of standards drafts by participants, development of reference implemen-
tations, and interoperabilitybake-offs [94, 44] at which vendors and developers test their implementations
against each other to discover issues that often arise from incomplete or ambiguous specifications.
Why are there ambiguities in RFCs? RFCs are ambiguous because (a) natural language is expressive
and admits multiple ways to express a single idea; (b) standards authors are technical domain experts who
13
may not always recognize the nuances of natural language; and (c) context matters in textual descriptions,
and RFCs may omit context.
Canreferenceimplementations aloneeliminateambiguity? Reference implementations are useful
but insufficient. For a reference protocol document to become a standard, a reference implementation is
indeed often written, and this has been the case for many years. A reference implementation is often
written by participants in the standardization process, who may or may not realize that there exist subtle
ambiguities in the text. Meanwhile, vendors write code directly to the specification (often to ensure that
the resulting code has no intellectual property encumbrances), sometimes many years after the specifi-
cation was standardized. This results in subtle incompatibilities in implementations of widely deployed
protocols [78].
Approach:Semi-automatedSemanticParsingofRFCs. Unlike general English text, network protocol
specifications have exploitable structure. The networking community uses a restricted set of words and
operations (i.e., domain-specific terminology) to describe network behaviors. Moreover, RFCs conform to
a uniform style [31] (especially recent RFCs) and all standards-track RFCs are carefully edited for clarity
and style adherence [87].
Motivated by this observation, we leverage recent advances in the NLP area of semantic parsing. Nat-
ural language can have lexical [86, 48] (e.g., the word bat can have many meanings), structural (e.g., the
sentence Alice saw Bob with binoculars) and semantic (e.g., in the sentence I saw her duck) ambiguity. Se-
mantic parsing tools can help identify these ambiguities. However, for the foreseeable future we do not
expect NLP to be able to parse RFCs without some human input. Thus, sage is semi-automated and uses
NLP tools, along with unit tests, to help a human-in-the-loop discover and correct ambiguities after which
the specification is amenable to automated code generation.
14
SemanticParsing
Paragraph Extraction
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unused |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Internet Header + 64 bits of Original Data Datagram |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Header Struct Extraction
Field Description Relations
• Assign
• Associate
• Various
Disambiguation
LF to Graph Conversion
Internal Inconsistency Checks
1. Type
2. Argument Ordering
3. Predicate Ordering
4. Distributivity
Associativity Check
Final LF Selection
CodeGenerator
Filter Non-executable LFs
LF to Code Conversion
Code Snippet Reordering
Code Stitching
Dynamic
Code
&
Static
Framework
Final Executable Code
Figure 2.1: SAGE components.
Disambiguation Implementation
RFC CODE
Semantic
Parsing
Disambiguation 1 LF/sentence
Code
Gen.
Unit Tests
user
✓
✗
✓
✗
resolve ambiguity and
implicit protocol behavior
Figure 2.2: sage workflow in processing RFC 792.
2.2.3 sageOverview
Figure 2.1 shows the three stages of sage. The parsing stage uses a semantic parser [5] to generate in-
termediate representations, called logical forms (LFs), of sentences. Because parsing is not perfect, it can
output multiple LFs for a sentence. Each LF corresponds to one semantic interpretation of the sentence, so
multiple LFs represent ambiguity. Thedisambiguation stage aims to automatically eliminate such ambigu-
ities. If, after this, ambiguities remain, sage asks a human to resolve them. The code generator compiles
LFs into executable code, a process that may also uncover ambiguity.
sage Workflow. To clarify how sage works, and when (and for what reason) human involvement is
necessary, we briefly describe the workflow that a sage user (e.g., a specification author) would follow
(Figure 2.2). First, the user extracts actionable sections of a specification and feeds these to the semantic
parsing stage. RFCs contain significant explanatory, non-actionable, material ( e.g., the introduction) that
may not be relevant to the analysis;sage currently requires a human to identify these, but can potentially
identify such sections automatically, which we have left to future work. The parsing stage analyzes each
15
sentence in the input. The output of this stage is a set of logical forms representing semantic interpretations
of the sentence (§2.3). The disambiguation stage (§2.4) winnows these logical forms based on built-in
checks that capture domain-specific usage in protocol specifications.
If this step does not result in a single LF, there are two possibilities: (a) either the sentence is funda-
mentally ambiguous, or (b) the sentence contains terms not present in sage’s lexicon or domain specific
usage not present insage’s built-in checks. At this point,sage presents the sentence to the user, who can,
for case (a), rewrite the sentence to resolve the ambiguity, or, for case (b), extendsage’s lexicon or add to
its built-in checks. This is akin to systems like spell and grammar checkers, which present users with po-
tential errors, and permit users to add entries to local dictionaries as part of a correction step. Adding new
lexical entries is, of course, more difficult than adding entries to a dictionary. Our sage implementation
contains a simple user interface enhancement to suggest additions in order to reduce the cognitive load on
the user. Better user interfaces can further reduce cognitive load, but will require significant user studies
so we leave these to future work.
Over time, as sage is used to analyze RFCs, we expect this manual effort to decline significantly. Our
intuition for this comes from Zipf’s law, first defined in quantitative linguistics, which shows that the
frequency of word usage is heavy-tailed: some are very common while others are rare. Over time, the
lexical entries and checks added tosage may cover most of the text in a new specification, and users need
only add the occasional lexical entry or domain-specific check. Our evaluations (§2.6) corroborate this
intuition.
Once each sentence has been reduced to a single LF, the code generator stage (§2.5) generates protocol
code and runs unit tests on them. These unit tests are to be written by the spec author; sage employs
test-driven development (§2.6.5). If a unit test fails, it is likely that protocol behavior is under-specified.
16
At this point as well sage notifies the user ( e.g., the specification author), who can rewrite the relevant
sentence(s) and re-invoke the entire pipeline.
Howandwhentousesage. A standards document begins its life as an Internet Draft discussed at sev-
eral IETF meetings. At this stage, specification authors can use sage to identify fundamentally ambiguous
sentences. Before the protocol is standardized, participants in the standardization process develop a ref-
erence implementation. During this stage, developers of the reference implementation can test sage’s
auto-generated code against their implementation to identify under-specified behavior (§2.6.5). Finally,
when a vendor decides to implement the protocol on their platform, they can use sage’s generated code
as a starting point for their implementation.
sage can also help to revise specifications in two ways. Its disambiguation stage can eliminate am-
biguity introduced during the revision. Moreover, it can generate code for two different versions of a
specification, and with the help of analysis tools ( e.g., static analysis, control flow analysis), a future ver-
sion of sage could help protocol implementers to develop backward compatibility mechanisms between
the two versions.
17
2.3 SemanticParsing
Semantic parsing is the task of extracting meaning from a document. Tools for semantic parsing formally
specify natural language grammars and extract parse trees from text. More recently, deep-learning based
approaches have proved effective in semantic parsing [30, 112, 55] and certain types of automatic code
generation [111, 61, 84]. However, such methods do not directly apply to our task. First, deep learn-
ing typically requires training in a “black-box”. Since we aim to identify ambiguity in specifications, we
aim to interpret intermediate steps in the parsing process and maintain all valid parsings. Second, such
methods require large-scale annotated datasets; collecting high-quality data that maps network protocol
specifications to expert-annotated logical forms (for supervised learning) is impractical.
For these reasons, we use the Combinatory Categorial Grammar (CCG [5]) formalism that enables (a)
coupling syntax and semantics in the parsing process and (b) is well suited to handling domain-specific
terminology by defining a small hand-crafted lexicon that encapsulates domain knowledge. CCG has been
used to parse natural language explanations into labeling rules in several contexts [97, 104].
CCGbackground. A CCG takes as input a description of the language syntax and semantics. It describes
the syntax of words and phrases usingprimitivecategories such as noun (N), noun phrase (NP), or sentence
(S), andcomplexcategories comprised of primitive categories, such as S\NP (to express that it can combine
a noun phrase on the left and form a sentence). It describes semantics with lambda expressions such as
λx.λy. @Is(y,x) andλx. @Compute(x).
CCG employs a lexicon, which users can extend to capture domain-specific knowledge. For example,
we added the following lexical entries to the lexicon to represent constructs found in networking standards
documents:
1. checksum→ NP: "checksum"
2. is→ {(S\NP)/NP:λx.λy. @Is(y,x)}
18
3. zero→ {NP:@Num(0)}
This expresses the fact (a) “checksum” is a special word in networking, (b) “is” can be assignment, and
(c) zero can be a number. CCG can use this lexicon to generate alogicalform (LF) that completely captures
the semantics of a phrase such as “checksum is zero”: {S:@Is("checksum",@Num(0))}. Our code generator
(§2.5) produces code from these.
Challenges. sage must surmount three challenges before using CCG: (a) specify domain-specific syntax,
(b) specify domain-specific semantics, (c) extract structural and non-textual elements in standards docu-
ments (described below). Next we describe how we address these challenges.
Specifying domain-specific syntax. Lexical entry (1) above specifies that checksum is a keyword in
the vocabulary. Rather than having a person specify such syntactic lexical entries, sage creates a term
dictionary of domain-specific nouns and noun-phrases using the index of a standard networking textbook.
This reduces human effort. Before we run the semantic parser, we also need to identify nouns and noun-
phrases that occur generally in English, for which we use an NLP tool called SpaCy [41].
Specifying domain-specific semantics. NLTK’s CCG [63] has a built-in lexicon that captures the se-
mantics of written English. Even so, we have found it important to add domain-specific lexical entries.
For example, the lexical entry (2) above shows that the verb is can represent the assignment of a value to
a protocol field. In sage, we manually generate these domain-specific entries, with the intent that these
semantics will generalize to many RFCs (see also §2.6). Beyond capturing domain-specific uses of words
(like is), domain-specific semantics capture idiomatic usage common to RFCs. For example, RFCs have
field descriptions (like version numbers, packet types) that are often followed by a single sentence that has
the (fixed) value of the field. For a CCG to parse this, it must know that the value should be assigned to the
field. Similarly, RFCs sometimes represent descriptions for different code values of a type field using an
19
idiom of the form “0 = Echo Reply”. §2.6 quantifies the work involved in generating the domain-specific
lexicon.
Extracting structural and non-textual elements. Finally, RFCs contain stylized elements, for which
we wrote pre-processors. RFCs use descriptive lists (e.g., field names and their values) and indentation to
note content hierarchy. Our pre-processor extracts these relationships to aid in disambiguation (§2.4) and
code generation (§2.5). RFCs also represent header fields (and field widths) with ASCII art; we extract field
names and widths and generate data structures (specifically, structs in C) to represent headers to enable
automated code generation (§2.5). Some RFCs [70] also contain pseudo-code, which we represent as logical
forms to facilitate code generation.
RunningaCCG. After pre-processing, we run a CCG on each sentence of an RFC. Ideally, a CCG should
output exactly one logical form for a sentence. In practice, it outputszeroormore logical forms, some of
which arise from CCG limitations, and some from ambiguities inherent in the sentence.
20
2.4 Disambiguation
Next we describe howsage leverages domain knowledge to automatically resolve some ambiguities, where
semantic parsing resulted in either 0 or more than 1 logical forms.
2.4.1 WhyAmbiguitiesArise
To show how we automatically resolve ambiguities, we take examples from the ICMP RFC [82] for which
our semantic parser returned either 0 or more than 1 logical forms.
Zero logical forms. Several sentences in the ICMP RFC resulted in zero logical forms after semantic
parsing, all of which were grammatically incomplete, lacking a subject:
A The source network and address from the original datagram’s data
B The internet header plus the first 64 bits of the original datagram’s data
C If code = 0, identifies the octet where an error was detected
D Address of the gateway to which traffic for the network specified in the internet destination network field
of the original datagram’s data should be sent
Such sentences are common in protocol header field descriptions. The last sentence is difficult even for a
human to parse.
More than 1 logical form. Several sentences resulted in more than one logical form after semantic
parsing. The following two sentences are grammatically incorrect:
E If code = 0, an identifier to aid in matching timestamp and replies, may be zero
F If code = 0, a sequence number to aid in matching timestamp and replies, may be zero
The following example needs additional context, and contains imprecise language:
21
G To form a information reply message, the source and destination addresses are simply reversed, the type
code changed to 16, and the checksum recomputed
A machine parser does not realize that source and destination addresses refer to fields in the IP header.
Similarly, it is unclear from this sentence whether the checksum refers to the IP checksum or the ICMP
checksum. Moreover, the termtypecode is confusing, even to a (lay) human reader, since the ICMP header
contains both a type field and a code field.
Finally, this sentence, discussed earlier (§2.2.1), is under-specified, since it does not describe which byte
the checksum computation should end at:
H The checksum is the 16-bit ones’s complement of the one’s complement sum of the ICMP message starting
with the ICMP Type
While sentencesG andH are grammatically correct and should have resulted in a single logical form, the
CCG parser considers them ambiguous as we explain next.
Causesofambiguities:zerologicalforms. ExamplesA throughC are missing a subject. In the common
case when these sentences describe a header field, that header field is usually the subject of the sentence.
This information is available to sage when it extracts structural information from the RFC (§2.3). When
a sentence that is part of a field description has zero logical forms, sage can re-parse that sentence by
supplying the header. This approach does not work for D; this is an incomplete sentence, but CCG is
unable to parse it even with the supplied header context. Ultimately, we had to re-write that sentence to
successfully parse it.
Causes of ambiguities: more than one logical form. Multiple logical forms arise from more funda-
mental limitations in machine parsing. Consider Figure 2.3, which shows multiple logical forms arising
for a single sentence. Each logical form consists ofnested predicates (similar to a statement in a functional
language), where each predicate has one or more arguments. A predicate represents a logical relationship
22
Sentence For computing the checksum, the checksum field should be zero
LF1 @AdvBefore(@Action(’compute’,’0’),@Is(@And(’checksum_field’,’checksum’),’0’))
LF2 @AdvBefore(@Action(’compute’,’checksum’),@Is(’checksum_field’,’0’))
LF3 @AdvBefore(’0’,@Is(@Action(’compute’,@And(’checksum_field’,’checksum’)),’0’))
LF4 @AdvBefore(’0’,@Is(@And(’checksum_field’, @Action(’compute’,’checksum’)),’0’))
LF2:
@AdvBefore
@Action @Is
’compute ’checksum’ ’checksum_field’ ’0’
Figure 2.3: Example of multiple LFs from CCG parsing of “For computing the checksum, the checksum
should be zero”.
(@And), an assignment (@Is), a conditional (@If), or an action (@Action) whose first argument is the
name of a function, and subsequent arguments are function parameters. Finally, Figure 2.3 illustrates that
a logical form can be naturally represented as a tree, where the internal nodes are predicates and leaves
are (scalar) arguments to predicates.
Inconsistentargumenttypes. In some logical forms, their arguments are incorrectly typed, so they
are obviously wrong. For example, LF1 in Figure 2.3, the second argument of thecompute action must be
the name of a function, not a numeric constant. CCG’s lexical rules don’t support type systems, so cannot
eliminate badly-typed logical forms.
Order-sensitivepredicatearguments. The parser generates multiple logical forms for the sentence
E. Among these, in one logical form, code is assigned zero, but in the others, the code is tested for zero.
Sentence E has the form “If A, (then) B”, and CCG generates two different logical forms: @If(A,B) and
@If(B,A). This is not a mistake humans would make, since the condition and action are clear from the
sentence. However, CCG’s flexibility and expressive power may cause over-generation of semantic inter-
pretations in this circumstance. This unintended behavior is well-known [40, 105].
Predicate order-sensitivity. Consider a sentence of the form “A of B is C”. In this sentence, CCG
generates two distinct logical forms. In one, the@Of predicate is at the root of the tree, in the other@Is is
at the root of the tree. The first corresponds to the grouping “(A of B) is C” and the second to the grouping
“A of (B is C)”. For sentences of this form, the latter is incorrect, but CCG unable to generate disambiguate
between the two.
23
#1 #2
@StartsWith
@Is ’icmp_type’
’checksum’ @Of
@Of ’icmp_message’
Ones OnesSum
@StartsWith
@Is ’icmp_type’
’checksum’ @Of
Ones @Of
OnesSum ’icmp_message’
Figure 2.4: LF Graphs of sentenceH.
Predicatedistributivity. Consider a sentence of the form “A and B is C”. This sentence exemplifies a
grammatical structure calledcoordination [98]
‡
. For such a sentence, CCG will generate two logical forms,
corresponding to: “(A and B) is C” and “(A is C) and (B is C)” (in the latter form, “C” distributes over “A” and
“B”). In general, both forms are equally correct. However, CCG sometimes chooses to distribute predicates
when it should not. This occurs because CCG is unable to distinguish between two uses of thecomma: one
as a conjunction, and the other to separate a dependent clause from an independent clause. In sentences
with a comma, CCG generates logical forms for both interpretations. RFCs contain some sentences of the
form “A, B is C”
§
. When CCG interprets the comma to mean a conjunction, it generates a logical form
corresponding to “A is C and B is C”, which, for this sentence, is clearly incorrect.
Predicate associativity. Consider sentence H, which has the form “A of B of C”, where each of A,
B, and C are predicates (e.g., A is the predicate@Action("16-bit-ones-complement"). In this example, the
CCG parser generates two semantic interpretations corresponding to two different groupings of operations
(one that groups A and B, the other that groups B and C: Figure 2.4). In this case, the @Of predicate is
associative, so the two logical forms are equivalent, but the parser does not know this.
‡
For example: Alice sees and Bob says he likes Ice Cream.
§
If a higher-level protocol uses port numbers, they are assumed to be in the first 64 data bits of the original datagram’s data.
24
2.4.2 WinnowingAmbiguousLogicalForms
We define the following checks to address each of the above types of ambiguities (§2.4.1), which sage
applies to sentences with multiple logical forms, winnowing them down (often) to one logical form (§2.6).
These checks apply broadly because of the restricted way in which specifications use natural language.
While we derived these by analyzing ICMP, we show that these checks also help disambiguate text in other
RFCs. At the end of this process, if a sentence is still left with multiple logical forms, it is fundamentally
ambiguous, sosage prompts the user to re-write it.
Type. For each predicate, sage defines one or more type checks: action predicates have function name
arguments, assignments cannot have constants on the left hand side, conditionals must be well-formed,
and so on.
Argument ordering. For each predicate for which the order of arguments is important, sage defines
checks that remove logical forms that violate the order.
Predicate ordering. For each pair of predicates where one predicate cannot be nested within another,
sage defines checks that remove order-violating logical forms.
Distributivity. To avoid semantic errors due to comma ambiguity, sage always selects the non-
distributive logical form version (in our example, “(A and B) is C”).
Associativity. If predicates are associative, their logical form trees (Figure 2.4) will be isomorphic. sage
detects associativity using a standard graph isomorphism algorithm.
25
2.5 CodeGeneration
Next we discuss how we convert the intermediate representation of disambiguated logical forms to code.
2.5.1 Challenges
We faced two main challenges in code generation: (a) representing implicit knowledge about dependencies
between two protocols or a protocol and the OS and (b) converting a functional logical form into imperative
code.
Encoding protocol and environment dependencies. Networked systems rely upon protocol stacks,
where protocols higher in the stack use protocols below them. For example, ICMP specifies what opera-
tions to perform on IP header fields ( e.g., sentenceG in §2.4), and does not specify but assumes an imple-
mentation of one’s complement. Similarly, standards descriptions do not explicitly specify what abstract
functionality they require of the underlying operating system (e.g., the ability to read interface addresses).
To address this challenge,sage requires a pre-defined staticframework that provides such functionality
along with an API to access and manipulate headers of other protocols, and to interface with the OS.
sage’s generated code (discussed below) uses the static framework. The framework may either contain a
complete implementation of the protocols it abstracts, or, more likely, invoke existing implementations of
these protocols and services provided by the OS.
LogicalFormsasanIntermediateRepresentation. The parser generates an LF to represent a sentence.
For code generation, these sentences (or fragments thereof) fall into two categories: actionable and non-
actionable sentences. Actionable sentences result in executable code: they describe value assignments
to fields, operations on headers, and computations ( e.g., checksum). Non-actionable sentences do not
specify executable code, but specify a future intent such as“Thechecksummaybereplacedinthefuture” or
behavior intended for other protocols such as“Ifahigherlevelprotocolusesportnumbers,portnumbersare
26
assumed to be in the first 64 data bits of the original datagram’s data” . Humans may intervene to identify
non-actionable sentences;sage tags their logical forms with a special predicate@AdvComment.
The second challenge is that parsers generate logical forms for individual sentences, but the ordering
of code generated from these logical forms is not usually explicitly specified. Often the order in which
sentences occur matches the order in which to generate code for those sentences. For example, an RFC
specifies how to set field values, and it is safe to generate code for these fields in the order in which they
appear. There are, however, exceptions to this. Consider the sentence in Figure 2.3, which specifies that,
when computing the checksum, the checksum field must be zero. This sentence occurs in the RFC after
the sentence that describes how to compute checksum, but its executable code must occur before. To
address this,sage contains a lexical entry that identifies, and appropriately tags (using a special predicate
@AdvBefore), sentences that describe suchadvice (as used in functional and aspect-oriented languages).
¶
2.5.2 LogicalFormstoCode
Pre-processingandcontextualinformation. The process of converting logical forms to code is multi-
stage, as shown in the right block of Figure 2.1. Code generation begins with pre-processing actions. First,
sage filters out logical forms with the @AdvComment predicate. Then, it prepares logical forms for code
conversion by adding contextual information. A logical form does not, by itself, have sufficient information
to auto-generate code. For example, from a logical form that says ’Set (message) type to 3’ (@Is(type, 3))
it is not clear what “type” means and must be inferred from the context in which that sentence occurs.
In RFCs, this context is usually implicit from the document structure (the section, paragraph heading, or
indentation of text). sage auto-generates a context dictionary for each logical form (or sentence) to aid
code generation (Table 2.4).
¶
Advice covers statements associated with a function that must be executed before, after, or instead of that function. Here,
the checksum must be set to zero before computing the checksum.
27
LF @Is(’type’, ’3’)
context {"protocol": "ICMP", "message": "Destination
Unreachable Message", "field": "type", "role":
""}
code hdr->type = 3;
Table 2.4: Logical form with context and resulting code.
In addition to this dynamic context,sage also has a pre-defined static context dictionary that encapsu-
lates information in the static context. This contains field names used in lower-level protocols ( e.g., the
table maps terms source and destination addresses to corresponding fields in the IP header, or the term
“one’s complement sum” to a function that implements that term). During code generation, sage first
searches the dynamic context, then the static context.
Codegeneration. After preprocessing,sage generates code for a logical form using a post-order traversal
of the single logical form obtained after disambiguation. For each predicate, sage uses the context to
convert the predicate to a code snippet using both a dictionary of predicate-code snippet mappings and
contextual information; concatenating these code snippets results in executable code for the logical form.
For corner-cases,sage applies user-defined conversions to fine-tune the resulting code.
sage then concatenates code snippets for all the logical forms in a message into a packet handling
function
∥
. In general, for a given message, it is important to distinguish between code executed at the
sender versus at the receiver, and to generate two functions, one at the sender and one at the receiver.
Whether a logical form applies to the sender or the receiver is also encoded in the context dictionary
(Table 2.4). Also,sage uses the context to generate unique names for the function, based on the protocol,
the message type, and the role, all of which it obtains from the context dictionaries.
Finally,sage processes advice at this stage to decide on the order of the generated executable code. In
its current implementation, it only supports @AdvBefore, which inserts code before the invocation of a
function.
∥
sage generated code examples are available at [90].
28
These functions are inserted into a static framework at code stitching (Figure 2.1). This framework
provides required networking functions such as I/O handling involving socket management or, for testing
purposes, PCAP read/write and helper functions (e.g., parity checks, checksum calculation).
Adapting the code generator to new protocol packet handling functions might require some human ef-
fort in updating the conversion tables. Additionally, new predicates need to be added to the predicate-code
snippet mapping when they are first introduced. We found these steps require no deep protocol knowledge
since most of the rules are general. Significant engineering effort is only required for implementing helper
functions for the static framework, which we expect will be rare after a larger library of these is developed.
Iterativediscoveryofnon-actionablesentences. Non-actionable sentences are those for which sage
should not generate code. Rather than assume that a human annotates each RFC with such sentences before
sage can execute, sage provides support for iterative discovery of such sentences, using the observation
that a non-actionable sentence will usually result in a failure during code generation. So, to discover such
sentences, a user runs the RFC through sage repeatedly. When it fails to generate code for a sentence,
it alerts the user to confirm whether this was a non-actionable sentence or not, and annotates the RFC
accordingly. During subsequent passes, it tags the sentence’s logical forms with @AdvComment, which
the code generator ignores.
In ICMP, for example, there are 35 such sentences. Among RFCs we evaluated,sage can automatically
tag such code generation failures as@AdvComment without human intervention (i.e., there were no cases
of an actionable sentence that failed code generation once we defined the context).
29
2.6 Evaluation
Next we quantify sage’s ability to find specification ambiguities, its generality across RFCs, and the im-
portance of disambiguation and of our parsing and code generation extensions.
2.6.1 Methodology
Implementation. sage includes a networking dictionary, new CCG-parsable lexicon entries, a set of
inconsistency checks, and LF-to-code predicate handler functions. We used the index of [56] to create a
dictionary of about 400 terms. sage adds 71 lexical entries to an NLTK-based CCG parser [63].
∗∗
Overall,
sage consists of 7,128 lines of code. In addition, the static framework is 1478 lines of code; this framework
is reused across all protocols.
To winnow ambiguous logical forms for ICMP (§2.4.2), we defined 32 type checks, 7 argument ordering
checks, 4 predicate ordering checks, and 1 distributivity check. Argument ordering and predicate ordering
checks maintain a blocklist. Type checks use an allowlist and are thus the most prevalent. The distributivity
check has a single implicit rule. For code generation, we defined 25 predicate handler functions to convert
LFs to code snippets. As we analyzed additional protocols (IGMP, NTP and BFD), we manually added more
lexical entries and type checks, using the workflow described in §2.2.3; we quantify the overhead of these
in §2.6.3 and §2.6.4. Across all of these protocols, sage auto-generated 554 lines of protocol code after
disambiguation.
TestScenarios. First we examine the ICMP RFC, which defines 8 ICMP message types.
††
Like the student
assignments we analyzed earlier, we generated code for each ICMP message type. To test this for each
message, as with the student projects, the client sends test messages to the router which then responds
∗∗
NLTK is a popular general-purpose NLP toolkit: over 100k+ GitHub repositories depend on it [75]. We are aware of lim-
itations of NLTK’s CCG parser; other tools such as SPF [4] may address these limitations. We leave the comparison of the two
toolkits and possible migration to SPF to future work.
††
ICMP message types include destination unreachable, time exceeded, parameter problem, source quench, redirect, echo/echo
reply, timestamp/timestamp reply, and information request/reply.
30
with the appropriate ICMP message. For each scenario, we captured both sender and receiver packets and
verified correctness with tcpdump. We include details of each scenario in the Appendix. To demonstrate
the generality of sage, we also evaluated IGMP, NTP, and BFD.
2.6.2 End-to-endEvaluation
Next we verify that ICMP code generated bysage produces packets that interoperate correctly with Linux
tools.
Packetcapturebasedverification. In the first experiment, we examined the packet emitted by a sage-
generated ICMP implementation with tcpdump [99], to verify that tcpdump can read packet contents
correctly without warnings or errors. Specifically, for each message type, for both sender and receiver
side, we use the static framework in sage-generated code to generate and store the packet in a pcap file
and verify it using tcpdump. tcpdump output lists packet types (e.g., an IP packet with a time-exceeded
ICMP message) and will warn if a packet of truncated or corrupted packets. In all of our experiments we
found thatsagegeneratedcodeproducescorrectpacketswithnowarningsorerrors.
Interoperationwithexistingtools. Here we test whether a sage-generated ICMP implementation in-
teroperates with tools likeping andtraceroute. To do so, we integrated our static framework code and
thesage-generated code into a Mininet-based framework used for the course described in §2.2. With this
framework, we verified, with four Linux commands (testing echo, destination unreachable, time exceeded,
and traceroute behavior), that asage-generated receiver or router correctly processes echo request packets
sent by ping and TTL-limited data packets or packets to non-existent destinations sent by traceroute,
and its responses are correctly interpreted by those programs. For all these commands, the generated
code interoperates correctly with these tools. We also conducted interoperability experiments on real
machines. To do so, we extended our static framework to send and receive ICMP packets on raw sockets.
The result was identical to our Mininet experiments.
31
2.6.3 ExploringGenerality: IGMPandNTP
To understand the degree to which sage generalizes to other protocols, we ran it on two other protocols:
parts of IGMP v1 as specified in RFC 1112 [26] and NTP [70]. These RFCs contain conceptual elements
such as architecture description and behavior not specific to network protocols ( e.g., NTP stratums). These
are currently not supported by sage. In §2.7, we discuss what it will take to extend sage to completely
parse these RFCs and generalize it to a larger class of protocols.
IGMP. In RFC 1112 [26], we parsed the packet header description in Appendix I of the RFC. To do this,
we added to sage 8 lexical entries (beyond the 71 we had added for ICMP entries), 4 predicate function
handlers (from 21 for ICMP), and 1 predicate ordering check (from 7 for ICMP). For IGMP,sage generates
the sending of host membership and query message. We also verified interoperability of the generated
code. In our test, our generated code sends a host membership query to a commodity switch. We verified,
using packet captures, that the switch’s response is correct, indicating that it interoperates with the sender
code.
NTP. For NTP [70], we parsed Appendices A and B: these describe, respectively, how to encapsulate NTP
messages in UDP, and the NTP packet header format and field descriptions. To parse these, we added
only 5 additional lexical entries and 1 predicate ordering check beyond what we already had for IGMP and
ICMP.
2.6.4 ExploringGenerality: BFD
Thus far, we have discussed howsage supports headers, field descriptions, constraints, and basic behaviors.
We now explore applying sage to BFD [47], a recent protocol whose specification contains sentences
that describe how to initiate/update state variables. We have used sage to parse such state management
sentences (§6.8.6 in RFC 5880). The RFC contains additional components that sage currently can not
32
handle. These include algorithms (e.g., timing calculation) and complex communication patterns (e.g.,
authentication). In §2.7, we discuss what it will take to extendsage to completely parse BFD.
BFDIntroduction. BFD is used to detect faults between two nodes. Each node maintains multiple state
variables for both protocol and connection state. Connection state is represented by a 3-state machine
and represents the status (e.g., established, being established, or being torn down) of the session between
nodes. Protocol state variables are used to track local and remote configuration.
‡‡
StateManagementDictionary. A state management sentence describes how to use or modify protocol
or connection state in terms of state management variables. For example, bfd.SessionState is a connection
state variable; Up is a permitted value. We extend our term dictionary to include these state variables and
values as noun phrases.
Parsing. We focus on explaining our analysis of such state management sentences. sage is also able to
parse the BFD packet header described in §4.1 of RFC 5880. We analyzed 22 state management sentences in
§6.8.6 of RFC 5880 which involve a greater diversity of operations than pure packet generation. To support
these, we added 15 lexical entries, 10 predicates, and 8 function handlers.
2.6.5 Disambiguation
Revising a specification inevitably requires some degree of manual inspection and disambiguation. sage
makes this systematic: it identifies and fixes ambiguities when it can, alerts specification authors or devel-
opers when it cannot, and can help iteratively verify re-written parts of the specification.
Ambiguous sentences. When we began to analyze RFC 792 with sage, we immediately found many
ambiguities we highlighted throughout this paper; these result in more than one logical form even after
manual disambiguation.
‡‡
This is common across protocols: for example, TCP keeps track of protocol state regarding ACK reception.
33
Category Example Count
More
than 1
LF
To form an echo reply message, the
source and destination addresses
are simply reversed, the type code
changed to 0, and the checksum re-
computed.
4
0 LF
Address of the gateway to which
traffic for the network specified in
the internet destination network
field of the original datagram’s data
should be sent.
1
Imprecise
sentence
If code = 0, an identifier to aid in
matching echos and replies, may be
zero.
6
Table 2.5: Examples of categorized rewritten text.
Base Type Arg.
Order
Pred.
Order
Distrib. Assoc.
1
2
5
10
20
40
# of Logical Forms
max
avg
min
(a) ICMP
Base Type Arg.
Order
Pred.
Order
Distrib. Assoc.
1
2
3
4
5
# of Logical Forms
max
avg
min
(b) IGMP
Base Type Arg.
Order
Pred.
Order
Distrib. Assoc.
1
2
5
10
20
50
# of Logical Forms
max
avg
min
(c) BFD
Figure 2.5: Number of LFs after Inconsistency Checks on ICMP/IGMP/BFD text: for each ambiguous sen-
tence, sequentially executing checks on LFs (Base) reduces inconsistencies; after the last Associativity
check, the final output is a single LF.
We also encountered ostensibly disambiguated text that yields zero logical forms; this is caused by
incomplete sentences. For example, “If code = 0, identifies the octet where an error was detected” fails
CCG parsing due to lack of subject in the sentence, and indeed it may not be parseable for a human
lacking context regarding the referent. Such sentence fragments require human guesswork, but, as we
have observed in §2.4, we can leverage structural context in the RFC in cases where the referent of these
sentences is a field name. In these cases, sage is able to correctly parse the sentence by supplying the
parser with the subject.
Among 87 instances in RFC 792, we found 4 that result in more than 1 logical form and 1 results in 0
logical forms (Table 2.5). We rewrote these 5 ambiguous (of which only 3 are unique) sentences to enable
automated protocol generation. These ambiguous sentences were found aftersage had applied its checks
(§2.4.2)—these are in a sense true ambiguities in the ICMP RFC. In sage, we require the user to revise
34
1
3
5
7
4.23
4.92
2.26
0.23
# of LFs per Sentence
5
10
15
18
7
15
5
# of Affected Sentences
Type Argument Ordering Predicate Ordering Distributivity
Figure 2.6: Effect of individual disambiguation checks on RFC 792: Left: average number of LFs filtered by
the check per ambiguous sentence with standard errorRight: number of ambiguous sentences affected out
of 42 total.
such sentences, according to the feedback loop as shown in Figure 2.2. sage keeps the resulting LFs from
an ambiguous sentence after applying the disambiguation checks; comparing these LFs can help users
identify where the ambiguity lies, thus guiding their revisions. In our end-to-end experiments (§2.6.2), we
evaluatedsage using the modified RFC with these ambiguities fixed.
Under-specified behavior. sage can also discover under-specified behavior through unit testing; gen-
erated code can be applied to unit tests to see if the protocol implementation is complete. In this process,
we discovered 6 sentences that are variants of this sentence: “If code = 0, an identifier to aid in matching
echos and replies, may be zero”. This sentence does not specify whether the sender or the receiver or both
can (potentially) set the identifier. The correct behavior is only for the sender to follow this instruction; a
sender may generate a non-zero identifier, and the receiver should set the identifier to be zero in the reply.
Not doing so results in a non-interoperability with Linux’s ping implementation.
Efficacyoflogicalformwinnowing. sage winnows logical forms so it can automatically disambiguate
text when possible, reducing manual labor in disambiguation. To show why winnowing is necessary, and
how effective each of its checks can be, we collect text fragments that could lead to multiple logical forms,
and calculate how many are generated before and after we perform inconsistency checks along with the
isomorphism check. We show the extent to which each check is effective in reducing logical forms: in
Figure 2.5a, the max line shows the description that leads to the highest count of generated logical forms
and shows how the value goes down to one after all checks are completed. Similarly, the min line represents
35
Sentence Label #LFs
The ’address’ of the ’source’ in an ’echo mes-
sage’ will be the ’destination’ of the ’echo re-
ply’ ’message’.
Poor 16
The ’address’ of the ’source’ in an ’echo mes-
sage’ will be the ’destination’ of the ’echo re-
ply message’.
Good 6
Table 2.6: Comparison of the number of logical forms (LFs) between good and poor noun phrase labels.
the situation for the text that generates the fewest logical forms before applying checks. Between the min
and max lines, we also show the average trend among all sentences.
Figure 2.5a shows that all sentences resulted in 2-46 LFs, butsage’s winnowing reduces this to 1 (after
human-in-the-loop rewriting of true ambiguities). Of these, type, argument ordering and the associativity
checks are the most effective. We apply the same analysis to IGMP (Figure 2.5b). In IGMP, the distributivity
check is also important. This analysis shows the cumulative effect of applying checks in the order shown in
the figure. We also apply the same analysis to BFD state management sentences (Figure 2.5c). We discover
some longer sentences could result in up to 56 LFs.
A more direct way to understand the efficacy of checks is shown in Figure 2.6 (for ICMP). To generate
this figure, for each sentence, we apply only one check on the base set of logical forms and measure how
many LFs the check can reduce. The graphs show the mean and standard deviation of this number across
sentences, and the number of sentences to which a check applies. For ICMP, as before, type and predicate
ordering checks reduced LFs for the most number of sentences, but argument ordering reduced the most
logical forms. For IGMP (omitted for brevity), the distributivity checks were also effective, reducing one
LF every 2 sentences.
Figure 2.5 does not include NTP; for the parts of this RFC thatsage analyzes, the base semantic parser
producesatmost2LFs (after adding a small number of lexical entries and checks §2.6.3), and the additional
checks winnow these down to 1 LF.
36
Increase Decrease Zero
Domain-specific Dict. 17 0 0
Noun-phrase Labeling 0 8 54
Table 2.7: Effect of disabling domain-specific dictionary and noun-phrase labeling on number of logical
forms.
Importance of Noun Phrase Labeling. sage requires careful labeling of noun-phrases using SpaCy
based on a domain-specific dictionary (§2.3). This is an important step that can significantly reduce the
number of LFs for a sentence. To understand why, consider the example in Table 2.6, which shows two
different noun-phrase labels, which differ in the way sage labels the fragment “echo reply message”. When
the entire fragment is not labeled as a single noun phrase, CCG outputs many more logical forms, making
it harder to disambiguate the sentence. In the limit, whensage does not use careful noun phrase labeling,
CCG is unable to parse some sentences at all (resulting in 0 LFs).
Table 2.7 quantifies the importance of these components. Removing the domain-specific dictionary
increases the number of logical forms (before winnowing) for 17 of the 87 sentences in the ICMP RFC.
Completely removing noun-phrase labeling using SpaCy has more serious consequences: 54 sentences
result in 0 LF. Eight other sentences result in fewer LFs, but these reduce to 0 after winnowing.
37
IPv4
TCP
UDP
ICMP
NTP
OSPF2
BGP4
RTP
BFD
♦ PacketFormat x x x x x x x x x
♦ Interoperation x x x x x x x x
♦ PseudoCode x x x x x x x x x
+State/SessionMngmt. x x x x x
Comm. Patterns x x x x x x
Architecture x x x
Table 2.8: Conceptual components in RFCs. sage supports components marked with ♦ (fully) and+
(partially).
IPv4
TCP
UDP
ICMP
NTP
OSPF2
BGP4
RTP
BFD
♦ HeaderDiagram x x x x x x x x x
♦ Listing x x x x x x x x x
Table x x x x x x x
AlgorithmDescription x x x x x x
OtherFigures x x x x x
Seq./Comm. Diagram x x x x x
StateMachineDiagram x x
Table 2.9: Syntactic components in RFCs. sage supports parsing the syntax of those marked with♦ (fully).
2.7 sageLimitations
Whilesage takes a significant step toward automated specification processing, much work remains.
Specificationcomponents. To understand this, we have manually inspected several protocol specifica-
tions and categorized components of specifications into two categories: syntactic and conceptual. Concep-
tual components (Table 2.8) describe protocol structure and behavior: these include header field semantic
descriptions, specification of sender and receiver behavior, who should communicate with whom, how
sessions should be managed, and how protocol implementations should be architected.
RFC authors augment conceptual text with syntactic components (Table 2.9). These include forms that
provide better understanding of a given idea (e.g., header diagrams, tables, state machine descriptions,
38
sentence The timeout procedure is called in client mode
and symmetric mode when the peer timer
reaches the value of the timer threshold vari-
able.
code if (peer.timer >= peer.threshold)
{
if (symmetric_mode ||
client_mode) {
timeout_procedure();
}
}
Table 2.10: NTP peer variable sentence and resulting code.
Type Example
Nestedcode
Original
If the Your Discriminator field is nonzero, it MUST
be used to select the session with which this BFD
packet is associated. If no session is found, the
packet MUST be discarded.
Rewritten
If the Your Discriminator field is nonzero, it MUST
be used to select the session with which this BFD
packet is associated. If the Your Discriminator field
isnonzeroand no session is found, the packet MUST
be discarded.
Rephrasing
Original
If bfd.RemoteDemandMode is 1, bfd.SessionState is
Up, and bfd.RemoteSessionState is Up, Demand
modeisactiveontheremotesystem and the local sys-
tem MUST cease the periodic transmission of BFD
Control packets.
Rewritten
If bfd.RemoteDemandMode is 1, bfd.SessionState is
Up, and bfd.RemoteSessionState is Up, the local sys-
tem MUST cease the periodic transmission of BFD
Control packets.
Table 2.11: Challenging BFD state management sentences.
communication diagrams, and algorithm descriptions). sage includes support for two of these elements;
adding support for others is not conceptually difficult, but may require significant programming effort.
Conceptual components may require significant additional research. Most popular standards have
many, if not all, of these elements. sage supports parsing of 3 of the 6 conceptual elements in Table 2.8,
for ICMP and parts of IGMP, NTP, and BFD. Our results (§2.6.2) show that extending these elements to
other protocols can, in some cases, require marginal extensions at each step. In addition, sage is already
able to parse state management for some protocols. However, much work remains to achieve complete
generality, of which state and session management is a significant piece.
39
BFDstatemanagement. When we performed CCG parsing and code generation on state management
sentences, we found two types of sentences that could not be parsed correctly (Table 2.11). Both of these
sentences reveal limitations in the underlying NLP approach we use.
The CCG parser treats each sentence independently, but the first example in Table 2.11 illustrates
dependencies across sentences. Specifically, sage must infer that the reference to no session in the second
sentence must be matched tothesession in the first sentence. This is an instance of the general problem of
co-reference resolution [38], which can resolve identical noun phrases across sentences. To our knowledge,
semantic parsers cannot yet resolve such references. To getsage to parse the text, we rewrote the second
sentence to clarify the co-reference, as shown in Table 2.11.
The second sentence contains three conditionals, followed a non-actionable fragment that rephrases
one of the conditionals. Specifically, the first condition if bfd.RemoteDemandMode is 1, is rephrased, in
English, immediately afterwards (Demand mode is active on the remote node). To our knowledge, current
NLP techniques cannot easily identify rephrased sentence fragments. sage relies on human annotation to
identify this fragment as non-actionable; after removing the fragment, it is able to generate code correctly
for this sentence.
NTP state management. The NTP RFC has complex sentences on maintaining peer and system
variables, to decide when each procedure should be called and when variables should be updated. One
example sentence, shown in Table 2.10, concerns when to trigger timeout. sage is able to parse the sentence
into an LF and turn it into a code snippet. However, NTP requires more complex co-reference resolution,
as other protocols may too [43, 38]: in NTP, context for state management is spread throughout the RFC
and sage will need to associate these conceptual references. For instance, the word “and” in the example
(Table 2.10) could be equivalent to a logical ANDor a logical OR operator depending on whether symmetric
40
mode and client mode are mutually exclusive or not. A separate section clarifies that the correct semantics
is OR.
ReducingHumanEffort. An important direction for future work is to minimize the manual effort cur-
rently required for disambiguation and code generation. Our winnowing reduces the number of instances
where users have to supply new lexical entries or checks (§2.4.2); we cannot quantify the number of such
new entries required for RFC text we have yet to examine, but expect that it will decrease over time as
more protocols are supported and more entries are in sage’s entry database. We also cannot state defini-
tively the generality of our current code generation approach. When users have to intervene, sage re-
duces cognitive load for the user by suggesting possible lexical entry additions. Future work will need to
explore similar usability enhancements for other human input tasks: adding new predicate checks, spec-
ifying cross-references (references to other protocols in a specification), and identifying non-actionable
sentences. Future work can also explore tools that automate some of these steps entirely (e.g., identifying
cross-references, or identifying non-actionable sentences) as well as techniques that improve the readabil-
ity of generated code. Finally, we have attempted to make sage’s auto-generated code clear by ensuring
that we adopt naming conventions for variables from RFCs and automatically emit context (i.e., add an
original sentence from an RFC as a comment) for each snippet of generated code. Future work can explore
how to auto-generate truly elegant code.
41
2.8 Conclusions
This section describessage, which introduces semi-automated protocol processing across multiple proto-
col specifications. sage includes domain-specific extensions to semantic parsing and automated discovery
of ambiguities and enables disambiguation; sage can convert these specifications to code. Future work
can extendsage to parse more specification elements, and devise better methods to involve humans in the
loop to detect and fix ambiguities and guide the search for bugs.
42
Chapter3
ACompactProtocolSpecificationConfiguration: UnambiguousEnglish
SpecificationTextGenerationandExecutableCodeGeneration
3.1 Introduction
A natural language specification is well known to be ambiguous and could lead to buggy or erroneous
implementations. To reduce the chance of using ambiguous languages, these specifications underwent
a variety of human-supervised processes to maintain their qualities with which the conveyed message is
less likely misinterpreted or mis-implemented. Such processes involve significant labor inputs and manual
proofreading to disambiguate each sentence and to clarify stated protocol behavior. In a recent study [109],
a semi-automated protocol disambiguation system is proposed to effectively identify ambiguous sentences
in a protocol and request the user to rewrite a sentence until it can be interpreted by a unique semantic
representation. While the guesswork of interpreting a specification is reduced and the final quality of
specification is guaranteed, it does not relax the protocol author from generating ambiguous sentences in
the very first place.
Writing an unambiguous English sentence is challenging while guaranteeing the preciseness and readi-
ness of the protocol, and alternatively, specifying the protocols in other forms can be more approachable.
Two threads of studies have been often discussed. One thread of studies [16, 14] use formal specification
43
languages to fully achieve the preciseness of protocol details, but readiness of the language itself could be
poor for a first time learner of the formal specification language. The other thread of studies consider di-
verse intermediate representations to capture protocol behaviors and preserve a certain level of readiness,
but the readiness of the representations is difficult to measured.
In this work, to achieve the preciseness of protocol details and to guarantee the readiness of the lan-
guage both, we take an approach of using mathematical definition to specify the specification configuration
and buildsentence to automatically turn the configuration into unambiguous limited English sentences.
Our configuration is both easy for the authors to specify protocol behaviors and straightforward to be
translated into English descriptions. With revisiting the question of what the significant elements are
inside a protocol, we only use those elements in mathematical definition and configuration to represent
protocol behaviors. A compact configuration allows us to specify only necessary changes for any protocol
operations. By connecting the status of protocol elements at different moments ( which are referred to as
nodes in following contexts), we are able to express a series of delicate protocol operations and subrou-
tines. Conceptually, we can relate the configuration to a control graph of significant elements, including
timer, last received packet, output packet and multiple program states.
Contribution. In this chapter, we present sentence, a mathematics-based compact protocol spec-
ification configuration that enables unambiguous readable English specification text generation and the
corresponding executable protocol code. sentence allows specification authors to input element values
and to specify the connections between; parses the corresponding graph in a specific topological order,
which aligns with how human usually reads and comprehends a piece of text, to generate restricted En-
glish sentence structure, unambiguous textual protocol sentences and paragraphs; converts the graph into
low level executable code.
A significant operation of sentence after receiving the configuration of elements in each step is to
analyze the eligibility to transform it into English texts and executable code. Besides, sentence could
44
possibly identify existence of ambiguity at this phase. We present four types of analysis checks to confirm
whether a protocol execution can violate any restrictions e.g., there is no node that connects to a specific
node, which means practically the status of the protocol elements can never be reached, and/or a protocol
behavior can be ambiguous e.g., an output packet variable is specified twice without clarifying if the two
mentions are identical or different, which can lead to at least two different protocol implementations.
After all checks are processed, we consider the protocol specification precise and unambiguous for text
generation. We demonstratessentence is able to identify an ambiguous ICMP Echo message (proved to be
ambiguous insage) with an intentionally poorly-designed configuration that follows the original RFC792.
For textual generation, we present a number of guidelines that human commonly follows to read a
piece of literature, and customize a topological graph traversal that aligns with the readability guideline
to generate texts. When transforming the specification configuration to English sentences, we use limited
English sentence structures for presentation to avoid creating ambiguities. We use a well-known online-
based typing assistant, Grammarly, to verify the readability level. Grammarly shows that all sentence
generated English texts do not require readers to reread any sentences to understand.
In code generation, we compile the specification graph into C++ executable code and use the auto-
matically generated code to interoperate with third party code. We are able to generate ICMP code which
interoperates with Linux PING program.
45
Figure 3.1: Change of Model.
3.2 Overview
Separationofspecifyinglanguageandspecificationpresentation Maintaining both the readability
of specification itself and the easiness to specify a protocol behavior is challenging. Several methods are
proposed to handle this issue, but it remains difficult to justify a language that is more readable than
English and simultaneously unambiguous. We discover we might not have to deal with both readability
and easiness of specifying protocol behaviors at the same time. Ideally, we would like to allow the protocol
authors specify protocol behaviors in their comfortable language/format. Simultaneously, we would like
to present the specification to broader community using the most readable language for human by far
(i.e., English sentences). However, it is not necessary to use the same language for specifying protocol
46
behaviors and specification presentation as long as we are able to enable automatic conversion between
the two languages.
Expectation of candidate language Based on the above discussion, we are motivated to decide what
language/format can be used for authors to specify protocol features and then construct a conversion
system to turn such language/format into readable English sentences. Hence, we fist reached out to some
specification authors and learnt that in their expectation, they would like a language that is intuitive to
engineers and easy to explain the functions/logic. In addition to these, a language/representation itself
should be expressive enough to cover most or all existing and new protocols. Last but not least, it should
leaves little or no room for ambiguity in specifying the behaviors.
Why not use formal specification languages? Indeed, formal specification languages are trustful to
be unambiguous and can specify various kinds of protocol. However, there is no consensus on which
formal specification languages shall be used formally among the protocol specification community. Every
engineer might choose different formal specification languages according to their professionals to design
their protocols. That said, from the perspective of the whole specification community, if the community
has to agree to one formal specification language, comparing the pros and cons among these languages
must be well discussed. In some cases, there is no method to relate one proof of a language to the same
feature proof of another language. That indicates, it could be unclear how the two (or more) languages are
equivalent or different and therefore, it is challenging to select a certain formal specification language as
the standard media. Besides, in our expectation, we would like to get a language/representation that is even
more general to most engineers so the requirement to be able to design in a certain formal specification
language is relaxed.
Ourapproach In this work, we allow the authors to use mathematical definition to specify the configu-
ration, which is easy to be configured and capable of capturing critical elements from a general protocol.
47
Figure 3.2: System Model.
sentence (Figure 3.2) takes the specification configuration as its input and parse it with multiple veri-
fication methods to examine whether sentence can base on the specification configuration to generate
corresponding readable English sentence descriptions and executable protocol code. In the following sec-
tions, we explain design of sentence with four major components: (a) user-specified configuration (b)
configuration verification (c) text generation and (d) code generation.
48
3.3 SpecificationConfiguration
From our observation and analysis of general protocols, we first use a mathematical definition, which is
termed as ’node’ in further context, to describe a general protocol. Then, according to the mathemati-
cal definition, we present a protocol with a graph of nodes and show how expressive this mathematical
definition is by relating it to a well known protocol language, Esterel. [16].
3.3.1 Definition
A general protocol behavior can be expressed with a graph where individual node is a 6-tuple (Q,B,C,D,E,δ ).
1. Q is the set of nodes
2. B is the timer value
3. C is the last received packet
4. D is the output packet
5. E is the program states
6. δ :B
ϵ xC
ϵ xE
ϵ →P(B
ϵ xC
ϵ xD
ϵ xE
ϵ ) is a subroutine process
We use the above mathematical definition (Definition 3.3.1) to represent a protocol. For general pro-
tocols, we particularly care about the snapshot of timer value, last received packet and output packet at
any given point of the protocol execution. A timer is either a software or hardware clock that keeps track
of the elapsed time from a predefined time point. A last received packet represents an event of receiving
packet and storing it in a buffer for the afterward operations to access. Not until encountering another
packet receiving event, the last received packet value remains the same. The output packet represents an
immediate action of sending a packet out of the device. For the rest of the memory that a protocol might be
touching, we capture them as program states. The program states can describe diverse perspectives of the
49
protocol including connection state, protocol states and local variables. For any transition between two
snapshots of the mentioned values, we consider them as a subroutine process that a protocol is executing.
We do not restrict the subroutine process to be a single operation or a batch of operation with the above
mathematical definition.
50
3.3.2 Presentation
{
"arrows":[
"start -> x1",
"x1 -> x2"
],
"nodes":{
"start":{
"timer":"",
"last received packet":"".
"output packet":"",
"program states":[""]
},
"x1":{
"timer":"",
"last received packet":"pkt".
"output packet":"",
"program states":[
"pkt.version == 1",
"local_count = 1"
]
},
"x2":{
"timer":"",
"last received packet":"".
"output packet":"outpkt",
"program states":[
"outpkt.recv = local_count"
]
}
}
}
Listing 3.1: Three node graph protocol example
51
Based on the mathematical definition of a general protocol, there are four items that sentence con-
siders. It includes (1) timer value, (2) last received packet content (3) ready-to-send packet and (4) other
program states. These four items are expressible with a pair of a attribute name and corresponding value(s).
That gives us the idea to leverage JSON format, which is commonly used to present name/value pairs, to
store the four items. This JSON format is handy for both sentence to parse and for specification authors
to specify. The four items at any given point in the protocol execution is termed as a specification node
(or simplified as ’node’) in the specification configuration. As for specifying the transitions among dif-
ferent nodes, the authors needs to specify arrows that connect the related nodes. Therefore, the whole
configuration essentially forms a graph of how the protocol executes. An example of a three node graph
is illustrated in Listing 3.1.
3.3.3 Coverage
Esterel is an imperative synchronous programming language used to specify reactive systems. Esterel’s
syntax has 16 elements [28], and among the 16 elements, 11 statements are kernel statements [9] and the
others are derived statements. We focused on whethersentence is able to express the 11 kernel statements.
We relate sentence 6-tuple representation to Esterel’s kernel statements (Table 3.1) to demonstrate how
it can be used to represent the systems that are expressible with Esterel.
52
Kernel Esterel
statements
Operationalsemantics sentencerepresentation
nothing Terminates immediately with no
other effect
(Q,ϵ,ϵ,NULL,ϵ )→ (Q,ϵ,ϵ,NULL,ϵ end
)
pause Blocks control flow in the current
cycle for resumption in the next
cycle
(Q,t
x
,ϵ,NULL,ϵ ) →
(Q,t
x+1
,ϵ,NULL,ϵ )
p; q Runs p until it terminates and
then, in the same reaction, start
q
(Q,ϵ p
,ϵ p
,NULL,ϵ p
) →
(Q,ϵ q
,ϵ q
,NULL,ϵ q
)
p||q Runsp andq in parallel (Q
p+q
,ϵ p+q
,ϵ p+q
,NULL,ϵ p+q
)
looppend Restart the body p as soon as it
terminates. Every path through
the loop body must contain at
least one pause statement to
avoid unbounded looping within
a single reaction.
(Q,ϵ,ϵ,NULL,ϵ p
)→ (Q,ϵ,ϵ,NULL,ϵ 0
)
→ (Q,ϵ,ϵ,NULL,ϵ p
)
signalSinpend Declares a local signal (Q
p
,ϵ p
,ϵ p
,NULL,S)
emits Make signalS present in the cur-
rent instant. A signal is absent
unless it is emitted
(Q
any
,ϵ,ϵ,ϵ,S )
(Q
any
,ϵ any
,ϵ any
,ϵ any
,∀x(x∈E∧x̸=S))
PresentSthenpelseq If signal S is present in the cur-
rent instant, immediately run p,
otherwise runq
(Q,ϵ any
,ϵ any
,ϵ any
,S)→ (Q,ϵ p
,ϵ p
,ϵ p
,ϵ p
)
(Q
any
,ϵ any
,ϵ any
,ϵ any
,∀x(x ∈ E ∧ x ̸=
S))→ (Q,ϵ q
,ϵ q
,ϵ q
,ϵ q
)
suspendpwhens Suspends the execution of the
body in instants where S is
present
(Q,ϵ,ϵ,ϵ,ϵ ) →
(Q,NULL,NULL,NULL,S)
trapT inpend Declare a labeled escape block (Q,ϵ x
,ϵ x
,ϵ x
,T)→ (Q,ϵ y
,ϵ y
,ϵ y
,ϵ y
)
exitT Jump to the end of the innermost
T -labeled escape block
(Q,ϵ x
,ϵ x
,ϵ x
,ϵ x
)→ (Q,ϵ y
,ϵ y
,ϵ y
,T)
Table 3.1: Mathematical definition coverage relates to Estelle
53
3.4 Verification
In this section, we describe how sentence leverages four types of checks to filter out any configurations
that cannot successfully generate corresponding English specification and executable protocol code.
3.4.1 FailedConfigurationCouldExist
sentence is designed to automatically generate corresponding readable English specification and C++
executable protocol code. Therefore, a configuration itself will need to ensure the elements that are re-
quired to be used exist, and the algorithms that are used in the later phases can be proceeded without
raising errors. For any missing elements or unreasonable design, sentence should alert the authors how
the specification configuration could fail the goal. We summarize four types of occasions that can lead to
failed text generation and/or code generation.
Mismatched attributes and unknown domain-specific terms For any human-composed piece in-
cluding thesentence’s configuration, there could be typos or any mismatched in defining node or linking
nodes. With those situations, sentence would alert authors if there is any node that cannot match with
any node definition. Without any matching, it represents in the later phase, text generator will not be
able to translate it into a reasonable sentence and code generator cannot produce executable line of codes.
Besides the above occasions, sentence is supposed to understand how to parse a packet, and therefore,
authors should as well provide a pre-registered packet format for sentence to read. In other words, if a
receiving or sending packet is not pre-registered with its packet format in sentence, error will be raised
becausesentence cannot access certain packet fields and it will cause the generated unreasonable English
sentences and non-executable code.
Unreachablenode In some cases and our own experience with the configuration generation, it is possi-
ble to mis-define a few arrows in the protocol graph. That means, a protocol can never touch a node, or
the protocol idles at a point without knowing which next operations should be done. From a graph view,
54
sentence need to analyze if the graph contains only one connected component. From the text generator
perspective, a node or an isolated connected component that will not be in effect is useless and a speci-
fication shall not mention such operations. From the code generator viewpoint, it is useless to generate
any subroutines that will not get called and it means the protocol can be optimized further without those
subroutines.
Missing logic/ existence of ambiguities Missing logic cases unlike unreachable node, which consider
the relations among node connections, considers the relations between the four attributes of the nodes.
For example, a node has two successor nodes, where only one successor specifies a conditional checks
on a local variable A (e.g., A == 0). This specification graph could creates a confusion that when the
protocol executes to the step of the node, which next operations shall be processed. Should it execute
operations from both successor nodes, or should it select one of the path to proceed? From text generation
view, it means the generated text might create ambiguities when describing the protocol behaviors. From
code generation view, it could create multiple versions of codes that could not interoperate with each
other. In whichever case,sentence shall alert the author to revise the protocol so that there exists no such
ambiguous design.
Conflictingnodes Conflicting node represents the cases that earlier executed node can have conflicting
operations to the later executed node. An example could be the earlier node specifies local variable A
assigned to 0, but the following node specifies the operations done if A equals to 1, which should never
happen/ get execute. Conflicting nodes can be hard to discover when the protocol is complicated and
involves multiple nodes. For text generation, it might not be obvious for readers to notice the conflicting
design without any experiment with the real protocol execution. Similarly for code generation, it could be
difficult to discover a certain code block is implemented without meaningful purpose and it might lead to
crashes.
55
3.4.2 Filteringchecks
We define following checks/tests to alert authors where might cause errors so that the design/debug pro-
cess can be done more effectively.
Validity checks of attribute values/entries For each node specified in the configuration, sentence
checks if it could match with at least one entity in any arrow specification. If none entity is found, sen-
tence reports to the user interface (i.e., specification authors). Similarly, when an entity in any arrow
cannot be found with its node definition, sentence warns there is no node definition for the entity. As
for all variables used in the attribute values, sentence pre-processes the operations in program states
to isolate out the variable names and memorizes the variable names for last received packet and output
packet, if any. For any access to the packet fields, sentence maintains a dictionary of registered packet
field names and verifies if the access is legit or not. For unknown field access to the packet, sentence
alarms to specification authors that the mentioned field is not registered. For other domain-specific or
self-customized functions, sentence also maintain a dictionary for authors to register. For all the other
unrecognizable variables,sentence handles them as local variables.
DFScheckssentence’s goal is to find if the graph can be traversed to cover all the nodes from its starting
node, which represents the initial of protocol execution. A straightforward method is to leverage depth-
first search algorithm and sentence starts the search from the begin node. Consideringsentence is aware
of all the registered node, it is straightforward to compare if the set of DFS-discovered nodes is the same
set as the registered nodes. When there is discrepancy,sentence flags the non-visited nodes from the DFS
result.
Missing logic checks Iterating over each node, sentence checks and memorizes if the node contains
multiple successor nodes. For those satisfied nodes, sentence explicitly compares the program states
because all the conditional checks on any variables are specified in that attribute. For each conditional
56
check,sentence ensures the conditional checks are complementary among all the successor nodes. If any
node fails the check,sentence sends feedback to the author and list out the involved successor nodes.
Uni-testchecks sentence first extracts all the variables involved in the specification configuration and
allows the users to specify a set of values for the variables. Then, sentence traverses the graph based
on the given value to find if any conflicts could happen or traversal gets terminated before its standard
end of execution. For troublesome cases,sentence shows the used variable values for authors to examine
the protocol design. In addition, if the authors have expected result, sentence could also compare the
expected result with the executed result.
Discussion: howaboutmodelchecking? Model checking represents the verification technique to de-
termine whether a formal model of the system satisfies the desired given properties, and there are abundant
methods to deal with diverse scenarios. We agree that sentence can benefit from some model checking
techniques to filter out cases that cannot satisfy the authors expected properties. However, we would like
to leave the exploration of model checking method onsentence to future study due to two reasons. One
is that authors get the flexibility to define a finer protocol design and authors could have their individual
expected properties to satisfy. sentence is designed to introduce the system structure that can be extended
further, instead of imposing some specific checks that could limit the protocol design at the system struc-
ture level. The other is that sentence’s verification phase mainly focuses on what failed configuration
could hinder sentence from generating readable English protocol descriptions and executable protocol
code. The goal of model checking is partially aligned with the purpose of sentence’s verification phase.
57
Index Readabilitygoals
1 Reading order shall follow execution flow.
2 Every operation is specified once and all mentions of an operation should appear before state-
ment of the operation
3 Narrative should clearly specify any packet field from incoming or outgoing packet
4 Narrative should avoid awkwardly assembled sentences
Table 3.2: Readability goals for English specifications
3.5 TextGeneration
In this section, sentence uses the text generator component to extract information from configuration
file and base on pre-determined readability guidelines to present semantics in limited English sentence
structure.
3.5.1 Readabilityguidelines
To make a specification easy to read, we list out four goals (as Table 3.2) to achieve when sentence is
designed. These four guidelines shall not and will not be a complete list of English writing advice, but they
are a reasonable set of guidelines to be followed in an easy-reading article. These readability guidelines
can be extended further and shall changesentence’s text generator accordingly, but we only focus on the
discussion of the listed four goals in this paper.
Narrative order While there are multiple ways to describe a topic, a chronological description order is
a more comfortable way to describe how a protocol execution should work. A flashback narrative could
lead to confusion about how some values, if any, are acquired.
Mentions Repetitive words or sentences are often not welcome. Therefore, if a protocol could have mul-
tiple caller to a subroutine, sentence shall manage to reduce repeating the same English description of
the callee subroutine.
Domain-specificterms We have seen a number of protocols that use similar structure packets for both
request and reply ends. That means a certain packet field name could exist in both sender and receiver
58
{
if(pkt.a == 0){
if(pkt.b == 1){
counter = 1;
}
}
}
Code example
If sender packet field a equals to 0 and
sender packet field b euqals to 1, local sys-
tem assigns 1 tocounter.
Desired English Texts
If sender packet field a equals to 0, if
sender packet field b euqals to 1, local sys-
tem assigns 1 tocounter
Poor Texts
good
bad
Figure 3.3: Example of good and poor quality text.
packet. To minimize the chance of creating confusion, the writing of the specification should clearly state
a mention of certain packet field is in the receiver packet or the sender packet.
Awkwardlyassembledsentences We noticed that in some cases, a non-optimized piece of code can be
correct with the programming language syntax but it cannot be presented in the same way as in English.
For example, a nested if-statement as Figure 3.3 shall not produce a sentence that separates the two condi-
tions and concatenates with a comma. That poor assembled sentence is hard to read by human while the
code execution is completely valid. For those cases that the conditions can be combined, text generator
should integrate the information and generate an easy-to-read sentence.
3.5.2 Englishsentencegeneration
sentence’s text generator follows the process of Figure 3.4 to generate English sentences for the whole
protocol execution. When the text generator is given the configuration, it parses the configuration into its
59
corresponding graph. With a proper graph traversal and matching with correct context information, text
generator generates readable English sentences.
Contextinformation Relating to the design in §3.4.2, sentence requests the authors to register packet
format information and leverages the resulted stored dictionary information to provide context when gen-
erating texts. When a packet field is mentioned in a node’s program state attribute, text generator should
leverage this context to clearly specify which field name is called in the last received packet or an outgoing
packet. For example, a mention of pkt.ttl == 0 and the context of packet field dictionary with the last
received packet calledpkt can be translated into the English sentence segment of "when the time-to-live
field in the received packet equals to 0".
Node-leveltextgeneration Node level text generation represents the process of comparing the attribute
values from two connected nodes to identify the changes and generate the corresponding sentences. For
example, predecessor nodeA does not have value for the last received packet attribute, while the successor
nodeB has valuepkt for the attribute. When comparing the two nodes,sentence will generate a sentence
"Local system receives packetpkt." Building on top of this example, if nodeB has a successor nodeC and
C also has value pkt for the attribute, sentence will not generate any sentences because that indicates
from nodeB to nodeC, they are referring to the same last received packet and there is no new event of
receiving another packet. In yet another example, nodeD hasvar_a == 0 andvar_b = 1 in its program
state attribute, and both are not presented in node D’s predecessor node. sentence will generate the
sentence "Ifvar_a equals to 0, local system setsvar_b to 1."
Informationintegration To reach the goal #4 in Table 3.2,sentence iterates all the node and examines
each node’s relations with their connected node. If there exist some nodes that can be combined together
according to their attribute values,sentence will reduce the number of nodes presented in the graph and
60
leverage the information from combined node to generate English sentences (e.g., the good quality text in
Figure 3.3).
Nodelabelformultiplementions When a node has multiple predecessors, it represents the operations
of this node could be repeatedly presented in the generated text if sentence directly comparing attribute
values from its predecessor nodes and itself. However, such repetition is not welcome as discussed in
Table 3.2 #2. To handle such nodes, sentence iterates all the nodes before generating texts to label any
node that contains multiple predecessors and all the predecessors of such nodes will also get a label. When
sentence compares the node attributes, it identifies the labels to add extra explanation. For those labeled
predecessor nodes,sentence will add a explanation sentence at the end of the node-level texts mentioning
that "Please read section/paragraph/nodeXYZ (depending on the satisfying conditions)." The bracketed
words are used when the predecessor itself has multiple successors. For those labeled nodes which can be
reached from multiple predecessors, sentence will generate its texts after all its predecessors’ texts are
generated, and at the beginning of the node’s text,sentence adds the corresponding section/paragraph/n-
ode label to indicate the start of the descriptions.
Graph-level text generation To assemble all the texts from all nodes, the order of each piece of text
matters. To satisfy the needs of goal #1 and #2 in Table 3.2, sentence uses reverse postordering to put
together all the sentences. Reverse postordering has the property to visit its node before all its successors
and follows the protocol execution order.
61
{
"arrows":[
"start -> x1",
"x1 -> x2"
],
"nodes":{
"start":{
"timer":"",
"last received packet":"".
"output packet":"",
"program states":[""]
},
"x1":{
"timer":"",
"last received packet":"pkt",
"output packet":"",
"program states":[
"pkt.version == 1",
"local_count = 1"
]
},
"x2":{
"timer":"",
"last received packet":"".
"output packet":"outpkt",
"program states":[
"outpkt.recv = local_count"
]
}
}
}
Configuration
start
x1
x2
Parsed Graph
Local system receive packetpkt.
If version in the received packet is 1, local system sets
local_count to 1.
Local system generates output packetoutpkt.
Local system setsrecv of the output packet tolocal_count.
Generated English Texts
Figure 3.4: Example of text generation.
62
before:
A B C D
after:
A C D
Figure 3.5: coalescence example.
3.6 CodeGeneration
Insentence, code generation is the process of leveraging information from specification configuration to
generate executable C++ code. Based on the specification configuration, sentence forms a directed graph
by connecting each node with arcs which represents subroutine processes internally. Through traversing
the graph, sentence in turn converts per node information into executable code snippet by identifying
whether sentence receives a new packet, sends a packet or operates actions with/without conditions
specified. Ultimately, sentence concatenates each piece of code snippet to form the final executable C++
code.
3.6.1 Challenges
We confront two key challenges in the process of code generation: one is to correctly coalesce nodes and
the other is to guarantee the operational execution follows the specification.
NodeCoalescence Node coalescence is an enhancement step to present a protocol behavior in a less-node
graph compared to the original graph formed by specification configuration. sentence’s specification con-
figuration allows authors to flexibly define a dummy node which contains no distinguished value for any
attributes, a conditional node which represents only the protocol satisfies a certain set of conditions, or an
operational node stating what actions are operated with or without any conditions specified. Without co-
alescing nodes, code generator could generate redundant/blank lines of code, or easily generate erroneous
63
syntax codes. In sentence, we particularly analyze when a conditional node can pass its conditional set
to its successor nodes. When a node can coalesce with its successor,sentence removes the node and adds
according links to connect its predecessor and its successors
Execution: GraphAnalysis Graph traversal order has an immediate effect to the code execution, which
simultaneously represents the critical part to place per-node code snippet in a correct order of the overall
protocol code. Besides the node traversal order issue, it is also important to consider whether a node will
get repeatedly traversed while executing the protocol. If the operation of a node can be executed more
than once, the protocol code should be shaped to reflect this property. This leads us to categorize graphs
into two kinds: acyclic ones and cyclic graphs for better analysis.
Acyclic or cyclic graphs have to be generated differently in terms of code presentation. For example,
a protocol behavior that involves a timer countdown feature to trigger certain operation can be a cyclic
graph due to its properties to circle back to the conditional check node and test if the condition is met
at a specific moment. One common implementation of such timer countdown is to leverage a while loop
statement to generate the code. On the other hand, if a graph is an acyclic line graph, the final code can
directly stitch per node code together and does not involve any loop statement.
3.6.2 GraphtoCode
NodeConfigurationtoCodeSnippet Attributes of a node involve timer, last received packet, outgoing
packet and program states. To form a code snippet for a node, timer attribute is considered a set operation;
last received packet attribute is considered to trigger a function (if the attribute value is first observed in the
graph) that receives data from a connected socket and stores at the buffer named after the attribute value;
outgoing packet attribute is considered to trigger a function that sends data from a buffer named with
the attribute value, and the program state attribute can contain comparable statement and/or operational
statement(s). Combining all the information together, sentence shall first identify if last received packet
64
A
B C
D
Figure 3.6: Code Snippet Sort Order.
attribute is firstly set and/or if program states attribute contains any comparable statement. When any of
the above satisfies, sentence should prepare an if-else statement template and place the packet reception
function call and/or the comparable statement in the conditional part of the template, and then aggregate
the rest of the operations mentioned in the other attributes to place in the consequence part of the template.
Code Snippet Order Once per node snippet is ready, sentence has to decide the order of code snippet
placement which is required to simultaneously consider the serial sequences and parallel conditions. Fig-
ure 3.6 shows an example that sentence is supposed to pick code piece A as the head of the final code,
followingB orC and finally D as the end of the final code. This property highlights the expected order
should be able to place all the predecessors of a node before itself and in any sub- line graphs, the order
should follow the serial order from the beginning of the sub- line graph. The above requirements can
be satisfied by applying reverse order traversal on the graph to retrieve the desired order of code/node
snippets, which is one kind of topological sort.
Code Stitching Given the code snippet order is considered to guarantee its execution order will follow
how the protocol is specified, it is as well important to consider how the code snippets are stitched together
65
A B C D
Figure 3.7: Code stitching of a line graph example.
A B C D E
Figure 3.8: Cyclic graph example.
and what kind of dependencies should be considered when forming the final code. Figure 3.7 is a line graph
where all nodes are conditional nodes. The execution order is straightforwardly A,B,C and finally D.
However, stitching the code together does not meansentence simply sequentially concatenates per node
code. We realize thatB condition happens not only based on its own condition, but also is contingent on
A already happened. In other words, when we prepare the code snippet forA, sentence is expected to
place its successors’ code snippets in its consequence part of the if-else statement. Due to this contingency
concern, sentence leverage thepostordertraversalorder, instead of reverse postorder traversal order,
to process the leaf nodes of the protocol graph first and store the generated texts in a dictionary. When
sentence gradually traverses the graph to the begin node,sentence is already aware of all the generated
code snippets of the successor nodes and can attach the successor code snippets to the consequence part
of the if-else condition for each node.
66
Cyclicgraphandgo-tooperation A cyclic graph (e.g., Figure 3.8) contains at least one back edge/one
graph cycle, which represents a program is able to jump back from one node to another to process sub-
routines according to its design. The ability to jump to any node in a graph adds difficulty in code im-
plementation. Without further restrictions on graph,sentence could not easily determine the start point
of a subroutine when a convoluted graph is presented and therefore is hard to implement the whole code
with common loop structures. sentence leveragesgoto statement to handle such feedback loop on top of
the standard code stitching process. In implementation, sentence has additional steps to process before
stitching code together. It first identifies the nodes that contain back edges and maintain a list of which
node(s) (i.e. goto nodes) should present goto labels at the beginning of their individual node/code snip-
pets. Then, sentence inserts the goto statements to corresponding nodes that route to the goto nodes. If
the predecessor node of a goto node contains multiple successor nodes, sentence also needs to copy the
conditions of the goto node to correctly insert the goto statement in the program state attribute.
67
3.7 Evaluation
Next, we evaluate how our methods enable us to reach our goals of finding failed protocol configuration,
generating reasonable readable English texts and executable code that interoperates with other imple-
mented version.
3.7.1 Methodology
We explore a number of protocols and well-known or common features to implement their specification
configuration. For each piece, we leverage the verification component to revise our specification until all
checks are passed; the text generator component to generate English sentences/paragraphs and test their
readability; the code generator to generate C++ code. An experiment of a selected protocol message is
done to demonstrate that produced code is executable and interoperable with other implementer’s code.
Selectedprotocolsandfeatures We implement ICMP [82], BFD [47], IGMP [26], and TCP [83] signifi-
cant features including congestion control feature, connection establishment and termination feature, flow
control feature, and reliable transfer technique. For each protocol, we present the number of nodes used in
each specification configuration, the generated English words and line of codes respectively in Table 3.3.
English word calculation is based on Grammarly [65] online-based editor calculation and the line of codes
is based on the result from command-line-based tool, CLOC [23].
From the statistics, the automatic generated English texts is one to two order more number of words
than the number of specified nodes. In the meanwhile, the line of codes are mostly in the same order as
the number of specified node, except BFD experiment. The reason is because the BFD implementation and
design involves multiple layers of nested if statement. When the layer increases, the explicit conditional
part increases and therefore, largely increases the generated lines of codes.
3.7.2 Discoveryofafailedconfiguration
68
#nodes #Eng. words LoC
ICMP. 21 1346 140
BFD. 53 1375 2562
IGMP (host). 9 204 35
IGMP (router). 4 103 27
*Congestion control. 8 203 50
*Connection. 19 408 265
*Flow control. 3 45 10
*Reliable transfer. 11 148 31
Table 3.3: Specified nodes and their automatic generated number of English words and line of codes.
(*mimics features of TCP)
start
{
"program states":[
"send_echo_message == True"
]
}
{
"program states":[
"send_echo_message != True"
]
}
{
"output packet":"outpkt"
"program states":[
"outpkt.code = 8",
"outpkt.identifier = 25"
]
}
{
"last received packet":"pkt"
"program states":[
"pkt.code == 8"
]
}
{
"last received packet":"pkt"
"program states":[
"pkt.code == 0"
]
}
{
"output packet":"outpkt"
"program states":[
"outpkt.code = 0",
"outpkt.identifier = 0"
]
}
{
"output packet":"outpkt"
"program states":[
"outpkt.code == 0",
"outpkt.identifier == 25"
]
}
Figure 3.9: Illustration of an ambiguous configuration.
While sentence’s verification component assists us to modify our naively specified configuration, it
is subtle to reason how each case could happen in other authors’ specification configuration. Hence, we
use the already proved-to-be-ambiguous, under-specified example discussed in Sage[109] to test whether
sentence is able to identify such case. In Sage[109], an ICMP echo message description is ambiguous
due to the fact that its RFC specifies “If code = 0, an identifier to aid in matching echos and replies, may
be zero” for the identifier field. This sentence is ambiguous not only because it does not specify whether
both sender/receiver should set the field as explained in Sage, but also because it does not reason how a
reply message is matched to a echo message when identifier is set to be zero in the reply message and the
identifier is non-zero in the echo message. For the latter discussed case, Sage has explained this setting
would fail the interoperability test with Linux PING.
69
To experiment this case with sentence, we manually craft a specification in which its graph lacks
the node to match when identifier is non-zero in echo message and 0 in reply message. Figure 3.9 is an
illustration of the manually craft example which omits other field assignments and only focuses on the
issues discussed. We first generate a set of variables and packet values for an echo message. Then, we
process the echo message to generate a reply message. Finally, we input the reply message to sentence
and the verification component alarms us that there is no available path/node to process this message.
This demonstrates the case of ambiguity in the specification design.
3.7.3 Readabilitydiscussion
There are multiple perspectives to define readability, which we have learnt not only considers the lengths
of sentence, usage of words etc. sentence does not guarantee or aim to generate the best reading quality,
but we are able to observe that all the generated texts from our selected list of protocols and features
considered fine for reading. While Grammarly is not the only typing assistant, it is a well-known and
well-adopted one for writing any article. We used Grammarly’s automatic analysis, which defines their
readability check to be whether there are sentences readers might have to reread to understand. For all
sentence generated texts, they passed the readability check.
3.7.4 Interoperabilitytest
Inspired by Sage, we developed the specification configuration for ICMP message and used sentence to
generate corresponding c++ code. Then, we leverage the same Mininet-based network used for course in
Sage to test if the generated code can interoperate with the Linux PING command. Finally, we demon-
strate that automatic generated code correctly interoperates with the tools. Differences from
Sage Firstly, Sage produces the packet formulation code for ICMP itself, whilesentence’s generated code
is able to specify a whole packet which involves Ethernet, IP, ICMP and the payload. To a certain degree,
70
we indirectly show sentence is able to specify more diverse protocol behavior, which aligns with our
expectation of the ability of sentence’s configuration. Secondly, although sentence still has to assume
some APIs are known in advanced, it relies less on the static framework compared to Sage. sentence is
able to call and specify many more subroutines with its program state attributes, while every additional
functions other than packet formulation itself need to be given in Sage. Lastly, sentence relies more on
author’s specification on packet structures. Sage is able to derive the packet structure from its ASCII art,
whilesentence needs additional specification method for the packet structure (unless sentence gets the
packet structure because the author specifies a memory space and defines pointers individually for packet
field from scratch in the configuration).
71
Score Readability
comment
Schoollevel
90 to 100. Very easy 5th grade
80 to 90. easy 6th grade
70 to 80. fairly easy 7th grade
60 to 70. plain English 8th and 9th grade
50 to 60. fairly difficult 10th to 12th grade (high school)
30 to 50. difficult college
0 to 30. very difficult college graduate
Table 3.4: Flesch readability scores and their meaning
3.8 Limitations
Explanatorysentencessentence is able to process protocol execution and to generate its corresponding
English texts. However, in many specifications, we notice that specification authors will not only describe
the execution steps but also explain the motivation of the design. Such explanatory sentences are not
actionable or executable, but allows readers to understand the assumption of the protocol execution and
the framework of the system. Whilesentence can easily be extended to add an extra attribute in its con-
figuration to serve as an explanatory comment of per node operation, it remains challenging to assemble
explanatory sentences with executable sentences because of interruption of narratives or creation of more
confusion. Unlike the sentences generated by sentence, explanatory sentences do not necessarily need
to follow a specific order as long as the concept to be conveyed do not hurt the readability.
Syntacticcomponents It is common to see specification authors leverage diverse syntactic components
e.g., listings, diagrams, tables, figures etc to deliver messages to users. It is undeniable that in some cases,
English texts are not the best method to present all the ideas. Although generating syntactic components
is not part of sentence’s goal, an future extension of sentence could consider the automatic generation
of syntactic components at proper locations inside a specification.
FurtherImprovementontextquality
sentence presents a prototype system that is able to automatically verify a valid configuration, parse it
into corresponding English specification and it executable C++ code. However, there are still some spaces
72
Protocol Readabilityscores
ICMP. 32.84
BFD. 28.65
IGMP (host). 39.1
IGMP (router). 44.42
*Congestion control. 50.48
*Connection. 34.91
*Flow control. 23.10
*Reliable transfer. 70.38
Table 3.5: Flesch readability scores of selected protocols and features
that can be explored. Particularly, the readability and text quality can be further improved. A well-known
readability score is Flesch [52] readability score. The score is calculated as following:
score = 0.39
totalwords
totalsentences
+0.18
totalsyllables
totalwords
− 15.59
In an explanation of the scores [34], the readability level is as presented in Table 3.4. While Grammarly
justifies that sentence’s generated texts could reach a readability level which does not require readers to
reread any sentence to understand the meaning. We calculate the Flesch readability scores of the selected
protocols and features and present their values in Table 3.5. The best Flesch readability level thatsentence
can reach is plain English, and the worst of that is the level for college graduate to understand. Although
our assumption of specification readers should be somewhat professional and it would be reasonable to
assume the knowledge of college graduate is the prerequisite, the text quality could indeed further be
improved to an easier level of readability. We leave the exploration of adding more unambiguous English
sentence structures and more guidelines of readability goals to the extension of this work.
73
Chapter4
RelatedWork
In this thesis, our work consider three types of domain studies related to our inspirations and design,
including networked systems, natural language processing and program generation.
4.1 Networkedsystems
4.1.1 ProtocolLanguages/FormalSpecificationTechniques
Numerous protocol languages have been proposed over the years. In the ’80s, Estelle [16] and LOTOS
[14] provided formal descriptions for OSI protocol suites. Although these formal techniques can specify
precise protocol behavior, it is hard for people to understand and thus use for specification or implemen-
tation. Estelle used finite state machine specifications to depict how protocols communicate in parallel,
passing on complexity, unreadability, and rigidity to followup work [15, 93, 13]. Other research such as
RTAG [3], x-kernel [42], Morpheus [1], Prolac [53], Network Packet Representation [69], and NCT [68]
gradually improved readability, structure, and performance of protocols, spanning specification, testing,
and implementation. However, we find and the networking community has found through experience,
that English-language specifications are more readable than such protocol languages.
74
4.1.2 ProtocolAnalysis
Past research [11, 12, 14] developed techniques to reason about protocol behaviors in an effort to minimize
bugs. Such techniques used finite state machines, higher-order logic, or domain-specific languages to
verify protocols. Another thread of work [50, 51, 59] explored the use of explicit-state model-checkers
to find bugs in protocol implementations. This thread also inspired work ( e.g., [78]) on discovering non-
interoperabilities in protocol implementations. While our aims are similar, our focus is end-to-end, from
specification to implementation, and on identifying where specification ambiguity leads to bugs.
4.1.3 ProtocolModelChecking
Model checking was developed to check whether the system model satisfies the specifications. A typical
model checker consists of three components: (i) a method that describes the state transition of the sys-
tem to be verified; (ii) a specification that describes the system with temporal logic formula; and (iii) a
checking procedure that checks whether the system satisfies the desired invariants [21]. Model checking
has been widely used to find errors in software and hardware systems, including network verification and
testing [72, 17, 91, 64, 74, 92] . Musuvathi et al. used model checking to find 4 errors in Linux TCP/IP
implementation. NICE [17] combines symbolic execution and model checking to address the scalability
issue of exploring states for testing SDN applications. Flowchecker [92] used model checking to identify
any network configuration bugs within a single FlowTable. Sethi et al. [91] presented another test SDN
controller based on a model checking method
75
4.2 Naturallanguageprocessing
4.2.1 NaturalLanguageGeneration
Natural language generation (NLG) is the process of generating natural human languages from non-
linguistic representation of information. Based on the input representation, researches are categorized
into data-to-text and text-to-text generations [37]. sentence falls into the category of data-to-text gener-
ation.
While studies can be categorized into data-to-text and text-to-text generations, approaches are com-
monly applicable on both types. For example, Some approaches to text-to-text generation system (e.g.,
abstractive summarization systems) can be applicable to data-to-text generation systems, and vice versa.
One of the earliest approaches use template-based generation [76], where data is filled into a predefined
template to generate natural texts. The limitation of such approaches lies in its inflexibility and unable to
handle diverse data input. More recent studies [29] tackle these challenges by the use of machine learn-
ing techniques to generate more diverse and nuanced languages. For instance, generative models such
as GPT-2[85] and BERT [27] are used to process summarization, paraphrasing, and dialogue generation.
Another thread of studies [25] use rule-based approaches to generate texts based on linguistic rules and
constraints, which are commonly used to handle domain-specific systems. Such system is able to tailor
to specific contexts and audience. our aim is to generate networking protocol specification texts and our
approach uses rule-based method to generate restricted English texts, and these texts follows linguistic
rules and avoids creation of ambiguities.
4.2.2 SemanticParsingandCodeGeneration.
Semantic parsing is a fundamental task in NLP that aims to transform unstructured text into structured LFs
for subsequent execution [8]. For example, to answer the question “Which team does Frank Hoffman play
76
for?”, a semantic parser generates a structured query “SELECT TEAM from table where PLAYER=Frank
Hoffman” with SQL Standard Grammar [24]. A SQL interpreter can execute this query on a database and
give the correct answer [46]. Apart from the application to question answering, semantic parsing has
also been successful in navigating robots [100], understanding instructions [18], and playing language
games [103]. Research in generating code from natural language goes beyond LFs, to output concrete
implementations in high-level general-purpose programming languages [62]. This problem is usually for-
mulated as syntax-constrained sequence generation [110, 60]. The two topics are closely related to our
work since the process of implementing network protocols from RFCs requires the ability to understand
and execute instructions.
4.2.3 Pre-trainedLanguageModels.
Recently, high-capacity pre-trained language models [79, 27, 108, 57] have dramatically improved NLP in
question answering, natural language inference, text classification, etc. The general approach is to first
train a model on a huge corpus with unsupervised learning (i.e., pre-training), then re-use these weights
to initialize a task-specific model that is later trained with labeled data ( i.e.,, fine-tuning). In the context of
sentence, such pre-trained models advance improve semantic parsing [114, 113]. Recent work [32] also
attempts to pre-train on programming and natural languages simultaneously, and achieves state-of-the-art
performance in code search and code documentation generation. However, direct code generation using
pre-trained language models is an open research area and requires massive datasets; the best model for a
related problem, natural language generation, GPT [85], requires 8 M web pages for training.
77
4.2.4 NLPforLogMiningandParsing.
Log mining and parsing are techniques that leverage log files to discover and classify different system
events (e.g., ’information’, ’warning’, and ’error’). Past studies have explored Principal Component Anal-
ysis [107], rule-based analysis [35], statistic analysis [73, 102], and ML-based methods [95] to solve log
analysis problems. Recent work [6, 10] has applied NLP to extract semantic meanings from log files for
event categorization. sage is complementary to this line of work: while it uses NLP to categorize sender/re-
ceiver roles,sage takes the additional step of generating code.
4.3 Programgeneration
4.3.1 Automaticprogramming
Automatic programming refers to an abstraction that allows production of computer programming code.
[54, 19] As the meaning has changed over time, researches have diverged into multiple significant topics
such as genetic programming [106], program synthesis [39], low-code applications [88]. sentence fits
most closely into the scope of program synthesis, which represents generating full or partial program
from a specification. The specification could be a partial or full formal logical statement, or even a natural
language description.
4.3.2 ProgramSynthesis.
To automatically generate code, prior work has explored program synthesis. Reactive synthesis [81, 80]
relies on interaction with users to read input for generating output programs. Inductive synthesis [2]
recursively learns logic or functions with incomplete specifications. Proof-based synthesis ( e.g., [96]) takes
a correct-by-construction approach to develop inductive proofs to extract programs. Type-based synthesis
[77, 33] takes advantage of the types provided in specifications to refine output. In networking, program
78
synthesis techniques can automate (e.g., [66, 67]) updating of network configurations, and generating
programmable switch code [36]. It may be possible to use program synthesis insage to generate protocol
fragments.
79
Chapter5
ConclusionandFutureWork
5.1 Conclusion
Observing the need to improve general natural language protocol specification system, this thesis describes
two subsystems, sage and sentence. sage constructs a extensible 3-stage protocol disambiguation and
code generation system. It enables specification authors to systematically analyze the text quality of En-
glish specification and effectively disambiguate unreasonable semantic representations with five types of
checks. sage generates code for multiple protocol specifications and demonstrates its interoperability by
insertion into Mininet-framework environment and interaction with built-in Linux PING and TRACER-
OUtE without raising any warnings or errors.
sentence identifies the challenge for specification authors to write an ambiguity-free English spec-
ification without profession in natural language knowledge. It advocates a model change of relying on
automatic transformation between English specification and specifying language/configuration to relax
authors from fussing with a less-familiar language i.e., English. sentence is composed of four extensible
system components, including specification configuration, verification, text generator and code genera-
tor. sentence uses a formal specification definition to capture the most essential elements in a general
protocol behaviors and bases on the math definition to introduce the corresponding configuration for fur-
ther stage processing. It leverages four types of verification rules to filter unreasonable configuration. We
80
demonstratessentence is capable of discovering a proven-ambiguous example in ICMP specification with
verification rules. sentence generates readable English specification texts across multiple protocols and
is categorized as no rewrite-required for any reader to understand. The generated code from configuration
is capable of interacting with similar interoperability test environment assage.
To this end, this thesis has taken first steps in building a user-and-machine-friendly, unambiguous
protocol specification system. Many future directions are opened up and can be explored with the additions
of recent advances. More discussion is included in the following section.
5.2 FutureWork
There are a few directions to push this research in for future work.
Paragraphanalysisinsteadofper-sentenceanalysis. Our system mostly parses sentence by sentence
to determine a sentence’s ambiguity and generates code according to the order of descriptions. Therefore,
our work has some parsing limitations in some cases. For example, a sentence might use a pronoun and
there could be multiple candidates that the pronoun refers to. The candidates could be discovered from
multiple neighboring sentences that are within the same paragraph. If more than one candidate exists, this
can be considered as a kind of ambiguity that can confuse RFC readers.
When a compiler compiles a paragraph of text, it is possible that the order of descriptions do not fol-
low sequential execution order. For example, a protocol RFC can describe handshakes that resemble a state
machine, with all the involved connection states and their transitions are described in a large paragraph.
The compiler should no longer assume the generated code should follow the description order to convert
sentence by sentence. Instead, the compiler should be aware that all connection states are of equal im-
portance to be executed, and the generated code should be able to switch to any state for execution given
81
the current state. In other words, the compiler has to learn the context of a whole paragraph, determine a
correct code block template, and decide whether a variable value will be reused for the next event.
SemanticMeaningandClassifications Our system has evaluated a number of protocol RFCs with suc-
cessful code generation. However, the type of generated code remains limited. Sage can parse descriptions
that assign, associate, and rewrite values and simple if-else statements. While these sets of operations have
covered a large portion of system code, there are other types of code that should be considered such as
code for asserting values, code for adding constraints, code for logging events, etc.
Although we can use Sage to disambiguate sentences and generate exactly one logical form, the con-
version from logical form to code should not be limited to exactly one kind of code. We have to identify
when/whether a description can be interpreted as more than one type of code, and determine what the
execution order is when different types of code coexist. In other words, classification of semantic meanings
also takes an important role in code generation.
Mis-matched/Mis-capturedbehaviors. Among RFC components, many components (such as packet
formulation or state machine context) are presented with both text and syntactical components, where
syntactical components can be diagrams, listings, tables or figures. In some cases, textual sentences/para-
graphs are used to explain what the syntactical components represent, or extend what syntactical com-
ponents have covered. Thus, text and syntactical components are complementary to each other. In other
cases, due to mistakes or some other reason, texts and syntactical components can be inconsistent. This
situation causes confusion for the RFC reader whether to believe in textual descriptions or the meaning of
syntactical components.
When it comes to code generation for any case, the process of code generation has to (1) correctly asso-
ciate the same semantic meanings parsed from textual descriptions and syntactical components, (2) identify
any missing semantic meanings from either representations, and (3) identify discrepancies between texts
and syntactical components. The generated code should not repeat the same semantic meaning output,
82
or miss any mentioned semantic meanings. The code generator should also report/alarm discrepancies to
RFC drafters to reduce confusion.
Alternativecoderepresentations. The aim of a comprehensive compiler is to accept natural language
specifications and turn them into any structured representation, which can include pseudocode, formal
specification language text, and implementations in a variety of programming languages. Thus far, our
system has illustrated the possibility to turn natural language texts into working C++ code. While every
kind of representations has its advantages, the conversion among different representations could face dif-
ferent challenges. For example, RFC drafters commonly put pseudocode snippets in their drafts to better
explain their ideas to their readers, but pseudocode is not executable and maintains some level of flexibility
in the expressions given. If pseudocode is generated by the compiler and a drafter would like to compare
its logical behavior with a manually written pseudocode implementation, what the criteria is to determine
whether the functionalities are equivalent. For another example, some specifications expressed in the pro-
tocol’s design in TLA+; the TLA+ language itself has different expressions than logical form and C++, and
future work is required to identify the limitation of compiling natural language to TLA+.
StandaloneRFCormultipleRFCs. Ideally, every perspective of a protocol can be completely expressed
in a single RFC. In practice, a protocol can be explained over multiple RFCs for the ease of reading by topic.
For example, one RFC may describe the functionality of the protocol and explain the design of packets,
and a separate complementary RFC may explain what each value represents for the fields presenting in
the protocol and what additional constraints would be under a different scenario.
When multiple RFCs are given, we need to discover how to merge the content by concept without
missing or violating any constraints that may exist. From the perspective of an RFC drafter, the drafter
must confirm the consistency not only within a standalone RFC but also across all relevant RFCs. Any
83
discrepancy among RFCs could similarly lead to interoperability issues. An ideal compiler should thus take
inconsistency into consideration and automatically compare/label/alarm where the discrepancies happen.
Singleprotocolorstackofprotocols. While RFC specifications are usually limited to the discussion of
a single protocol, this single protocol has to interact/stack cleanly with other protocols when it’s applied
in a networked system. For example, the ICMP RFC describes its design, but it has to be built on top of IP.
In such cases, a compiler that is considering both protocols together first has to identify the dependency
between the two protocols and/or the constraints to be considered. When the compiler is stacking the two
protocols together, it should be aware of, for example, the general IP header description selecting the exact
value for ICMP as its next protocol field, and that the total length field in the IP header needs to add the
ICMP header and the payload.
Logicvs. performance. A protocol RFC usually focuses on describing its logical functionality and leaves
the flexibility of implementing code to any reader of the RFC (unless the RFC provides a reference imple-
mentation and expects the future protocol implementer to directly apply the reference implementation in
all contexts). In other words, we could reasonably expect the generated code from the compiler is valued for
its correct logic instead of its performance. If a drafter suggests a performance-oriented implementation,
some mechanism could be supported to identify which natural-language statements convey performance
considerations and which convey logical considerations. Moreover, the compiler can optionally output
different versions of generated code according to the needs of the user.
Unified compiler platform. In this thesis, sentence advocates a simple, straightforward and unam-
biguous configuration for authors to specify protocols. Simultaneously, there are many existing studies
preserving great properties in specifying protocols as well but cannot integrate or convert to/from our
suggested configuration style yet. A challenging future direction could be constructing a unified compiler
platform that enables language conversion among different formal languages or intermediate represen-
tations. This line of work would not only benefit the compatibility among different languages, but also
84
encourage the comparisons among all languages. Such compiler system could engage more authors in and
incentivize them to contribute because they will be comfortable in using their most comfortable languages
and benefit from analyzing it even when the original specified language is not their most comfortable one.
Usage of other advances More and more popular artificial intelligence-based chatbots are introduced
and encourage the studies to improve its correctness and robustness. While there remains a lot of work in
perfecting the whole system, we could expect the advances of such system can assist our work in rephras-
ing/restructuring the generated sentences while guaranteeing the correctness of specification. Our current
system uses limited structured sentences to avoid introducing ambiguities when generating English texts.
Although the readability is considered good enough to avoid rewritten by authors, it is undeniable the
available sentence structures are considerably limited. This would make the whole generated English
paragraph less natural as how a human could compose a paragraph. An integration with NLP advances
and/or AI advances could benefit the readability even more.
85
References
[1] Mark B Abbott and Larry L Peterson. “A language-based approach to protocol implementation”.
In: IEEE/ACM transactions on networking (1993).
[2] R Alur, R Bodik, E Dallal, D Fisman, P Garg, G Juniwal, H Kress-Gazit, P Madusudan, M Martin,
M Raghothman, et al. “Syntax-Guided Synthesis. Dependable Software Systems Engineering”. In:
NATO Science for Peace and Security Series (2014). http://sygus. seas. upenn.
edu/files/sygus_extended. pdf (2014).
[3] David P. Anderson. “Automated protocol implementation with RTAG”. In: IEEE Transactions on
Software Engineering 14.3 (1988), pp. 291–300.
[4] Yoav Artzi. Cornell SPF: Cornell Semantic Parsing Framework. 2016. eprint: arXiv:1311.3011.
[5] Yoav Artzi, Nicholas FitzGerald, and Luke S Zettlemoyer. “Semantic Parsing with Combinatory
Categorial Grammars.” In: ACL (Tutorial Abstracts) 3 (2013).
[6] Nicolas Aussel, Yohan Petetin, and Sophie Chabridon. “Improving performances of log mining for
anomaly prediction through nlp-based log parsing”. In: 2018 IEEE 26th International Symposium
on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
IEEE. 2018, pp. 237–243.
[7] Ryan Beckett, Ratul Mahajan, Todd Millstein, Jitendra Padhye, and David Walker. “Network
Configuration Synthesis with Abstract Topologies”. In: Proceedings of the 38th ACM SIGPLAN
Conference on Programming Language Design and Implementation. PLDI 2017. Barcelona, Spain:
Association for Computing Machinery, 2017, pp. 437–451.isbn: 9781450349888.doi:
10.1145/3062341.3062367.
[8] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. “Semantic parsing on freebase
from question-answer pairs”. In: Proceedings of the 2013 conference on empirical methods in natural
language processing. 2013, pp. 1533–1544.
[9] Gérard Berry. “The constructive semantics of pure Esterel”. In: http://www. inria.
fr/meije/esterel/esterel-eng. html (1999).
86
[10] Christophe Bertero, Matthieu Roy, Carla Sauvanaud, and Gilles Trédan. “Experience report: Log
mining using natural language processing and application to anomaly detection”. In: 2017 IEEE
28th International Symposium on Software Reliability Engineering (ISSRE). IEEE. 2017, pp. 351–360.
[11] Karthikeyan Bhargavan, Davor Obradovic, and Carl A Gunter. “Formal verification of standards
for distance vector routing protocols”. In: Journal of the ACM (JACM) 49.4 (2002), pp. 538–576.
[12] Steve Bishop, Matthew Fairbairn, Michael Norrish, Peter Sewell, Michael Smith, and
Keith Wansbrough. “Rigorous specification and conformance testing techniques for network
protocols, as applied to TCP, UDP, and sockets”. In: Proceedings of the 2005 conference on
Applications, technologies, architectures, and protocols for computer communications. 2005,
pp. 265–276.
[13] Gregor von Bochmann.Methodsandtoolsforthedesignandvalidationofprotocolspecificationsand
implementations. Université de Montréal, Département d’informatique et de recherche . . ., 1987.
[14] Tommaso Bolognesi and Ed Brinksma. “Introduction to the ISO specification language LOTOS”.
In: Computer Networks and ISDN systems 14.1 (1987).
[15] Frédéric Boussinot and Robert De Simone. “The ESTEREL language”. In: Proceedings of the IEEE
79.9 (1991), pp. 1293–1304.
[16] Stanislaw Budkowski and Piotr Dembinski. “An introduction to Estelle: a specification language
for distributed systems”. In: Computer Networks and ISDN systems 14.1 (1987), pp. 3–23.
[17] Marco Canini, Daniele Venzano, Peter Perešıni, Dejan Kostić, and Jennifer Rexford. “A NICE way
to test OpenFlow applications”. In: 9th USENIX Symposium on Networked Systems Design and
Implementation. 2012.
[18] David L Chen and Raymond J Mooney. “Learning to interpret natural language navigation
instructions from observations”. In: Twenty-Fifth AAAI Conference on Artificial Intelligence . 2011.
[19] Wendy Hui Kyong Chun. “On software, or the persistence of visual knowledge”. In: grey room
(2005), pp. 26–51.
[20] David D Clark. “A cloudy crystal ball: visions of the future”. In: Proceedings of the Twenty-Fourth
Internet Engineering Task Force (1992), pp. 539–544.
[21] Edmund M Clarke. “The birth of model checking”. In: 25 Years of model checking: history,
achievements, perspectives (2008), pp. 1–26.
[22] Ed. D. Harkins. Secure Password Ciphersuites for Transport Layer Security (TLS). RFC 8492. 2019.
doi: 10.17487/RFC8492.
[23] Albert Danial. cloc. https://github.com/AlDanial/cloc. Version 1.90. 2021.
[24] C. J. Date. A Guide to the SQL Standard: A User’s Guide to the Standard Relational Language SQL.
USA: Addison-Wesley Longman Publishing Co., Inc., 1987.isbn: 0201057778.
87
[25] Kees Van Deemter, Mariët Theune, and Emiel Krahmer. “Real versus template-based natural
language generation: A false opposition?” In: Computational linguistics 31.1 (2005), pp. 15–24.
[26] Dr. Steve E. Deering. Host extensions for IP multicasting. RFC 1112. 1989.doi: 10.17487/RFC1112.
[27] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep
bidirectional transformers for language understanding”. In:arXivpreprintarXiv:1810.04805 (2018).
[28] Christophe Diot, Robert de Simone, and Christian Huitema. “Communication protocols
development using ESTEREL”. In: First International HIPPARCH workshop. INRIA Sophia Antipolis.
Citeseer. 1994, pp. 15–16.
[29] Chenhe Dong, Yinghui Li, Haifan Gong, Miaoxin Chen, Junxin Li, Ying Shen, and Min Yang. “A
survey of natural language generation”. In: ACM Computing Surveys 55.8 (2022), pp. 1–38.
[30] Li Dong and Mirella Lapata. “Coarse-to-fine decoding for neural semantic parsing”. In: arXiv
preprint arXiv:1805.04793 (2018).
[31] RFC Editor and Heather Flanagan. RFC Style Guide. RFC 7322. Sept. 2014.doi: 10.17487/RFC7322.
[32] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou,
Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. “CodeBERT: A Pre-Trained Model for
Programming and Natural Languages”. In: ArXiv abs/2002.08155 (2020).
[33] John K Feser, Swarat Chaudhuri, and Isil Dillig. “Synthesizing data structure transformations
from input-output examples”. In: ACM SIGPLAN Notices 50.6 (2015), pp. 229–239.
[34] Rudolf Flesch. How to Write Plain English. https://web.archive.org/web/20160712094308/http:
//www.mang.canterbury.ac.nz/writing_guide/writing/flesch.shtml. Mar. 2023.
[35] Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. “Execution anomaly detection in distributed
systems through unstructured log analysis”. In: 2009 ninth IEEE international conference on data
mining. IEEE. 2009, pp. 149–158.
[36] Xiangyu Gao, Taegyun Kim, Michael D. Wong, Divya Raghunathan, Aatish Kishan Varma,
Pravein Govindan Kannan, Anirudh Sivaraman, Srinivas Narayana, and Aarti Gupta. “Switch
Code Generation Using Program Synthesis”. In: Proceedings of the Annual Conference of the ACM
Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and
Protocols for Computer Communication. SIGCOMM ’20. Virtual Event, USA: Association for
Computing Machinery, 2020, pp. 44–61.isbn: 9781450379557.doi: 10.1145/3387514.3405852.
[37] Albert Gatt and Emiel Krahmer. “Survey of the state of the art in natural language generation:
Core tasks, applications and evaluation”. In: Journal of Artificial Intelligence Research 61 (2018),
pp. 65–170.
[38] Stanford NLP Group. CoreNLP Coreference Resolution.
https://stanfordnlp.github.io/CoreNLP/coref.html.
88
[39] Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. “Program synthesis”. In: Foundations
and Trends® in Programming Languages 4.1-2 (2017), pp. 1–119.
[40] Julia Hockenmaier and Yonatan Bisk. “Normal-form parsing for Combinatory Categorial
Grammars with generalized composition and type-raising”. In: Proceedings of the 23rd
International Conference on Computational Linguistics (Coling 2010). Beijing, China: Coling 2010
Organizing Committee, Aug. 2010, pp. 465–473.url: https://www.aclweb.org/anthology/C10-1053.
[41] Matthew Honnibal and Ines Montani. “spaCy 2: Natural language understanding with Bloom
embeddings, convolutional neural networks and incremental parsing”. To appear. 2017.
[42] Norman C Hutchinson and Larry L Peterson. “The x-kernel: An architecture for implementing
network protocols”. In: IEEE Transactions on Software engineering 1 (1991), pp. 64–76.
[43] Allen AI Institute. AllenNLP Coreference Resolution.
https://demo.allennlp.org/coreference-resolution.
[44] IPP Interoperability Testing Event #2. http://www.pwg.org/ipp/testing/bake2.html.
[45] M. Jethanandani, S. Agarwal, L. Huang, and D. Blair. YANG Data Model for Network Access Control
Lists (ACLs). RFC 8519. 2019.doi: 10.17487/RFC8519.
[46] Aishwarya Kamath and Rajarshi Das. “A survey on semantic parsing”. In: arXiv preprint
arXiv:1812.00978 (2018).
[47] Dave Katz and David Ward. Bidirectional Forwarding Detection (BFD). RFC 5880. 2010.doi:
10.17487/RFC5880.
[48] Ruth M Kempson and Annabel Cormack. “Ambiguity and quantification”. In: Linguistics and
Philosophy 4.2 (1981), pp. 259–309.
[49] David Kessens, Tony J. Bates, Cengiz Alaettinoglu, David Meyer, Curtis Villamizar,
Marten Terpstra, Daniel Karrenberg, and Elise P. Gerich. Routing Policy Specification Language
(RPSL). RFC 2622. June 1999.doi: 10.17487/RFC2622.
[50] Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. “Life, Death, and the Critical
Transition: Finding Liveness Bugs in Systems Code”. In: 4th USENIX Symposium on Networked
Systems Design & Implementation (NSDI 07). NSDI. USENIX Association, 2007.
[51] Charles Edwin Killian, James W Anderson, Ryan Braud, Ranjit Jhala, and Amin M Vahdat. “Mace:
language support for building distributed systems”. In: ACM SIGPLAN Notices 42.6 (2007),
pp. 179–188.
[52] J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. Derivation of new
readability formulas (automated readability index, fog count and flesch reading ease formula) for
navy enlisted personnel. Tech. rep. Naval Technical Training Command Millington TN Research
Branch, 1975.
89
[53] Eddie Kohler, M Frans Kaashoek, and David R Montgomery. “A readable TCP in the Prolac
protocol language”. In: Proceedings of the conference on Applications, technologies, architectures,
and protocols for computer communication. 1999, pp. 3–13.
[54] Adele Mildred Koss. “Programming on the Univac 1: a woman’s account”. In: IEEE Annals of the
History of Computing 25.1 (2003), pp. 48–59.
[55] Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gardner. “Neural semantic parsing with type
constraints for semi-structured tables”. In: Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing. 2017, pp. 1516–1526.
[56] Jim Kurose and Keith Ross. Computer Networking: A Top Down Approach, 2012.
[57] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and
Radu Soricut. “Albert: A lite bert for self-supervised learning of language representations”. In:
arXiv preprint arXiv:1909.11942 (2019).
[58] Bob Lantz, Brandon Heller, and Nick McKeown. “A network in a laptop: rapid prototyping for
software-defined networks”. In: Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in
Networks. 2010, pp. 1–6.
[59] Hyojeong Lee, Jeff Seibert, Charles Edwin Killian, and Cristina Nita-Rotaru. “Gatling: Automatic
Attack Discovery in Large-Scale Distributed Systems.” In: NDSS. 2012.
[60] Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. “Neural symbolic
machines: Learning semantic parsers on freebase with weak supervision”. In: arXiv preprint
arXiv:1611.00020 (2016).
[61] Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, and Michael D Ernst. “Program synthesis
from natural language using recurrent neural networks”. In: University of Washington Department
of Computer Science and Engineering, Seattle, WA, USA, Tech. Rep. UW-CSE-17-03-01 (2017).
[62] Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočisk` y, Andrew Senior,
Fumin Wang, and Phil Blunsom. “Latent predictor networks for code generation”. In: arXiv
preprint arXiv:1603.06744 (2016).
[63] Edward Loper and Steven Bird. “NLTK: the natural language toolkit”. In: arXiv preprint cs/0205028
(2002).
[64] Rupak Majumdar, Sai Deep Tetali, and Zilong Wang. “Kuai: A model checker for software-defined
networks”. In: 2014 Formal Methods in Computer-Aided Design (FMCAD). IEEE. 2014, pp. 163–170.
[65] Alex Shevchenko Max Lytvyn and Dmytro Lider. Grammarly. https://www.grammarly.com/. Mar.
2023.
[66] Jedidiah McClurg, Hossein Hojjat, Pavol Čern` y, and Nate Foster. “Efficient synthesis of network
updates”. In: Acm Sigplan Notices 50.6 (2015), pp. 196–207.
90
[67] Jedidiah McClurg, Hossein Hojjat, Nate Foster, and Pavol Čern` y. “Event-driven network
programming”. In: ACM SIGPLAN Notices 51.6 (2016), pp. 369–385.
[68] Kenneth L McMillan and Lenore D Zuck. “Formal specification and testing of QUIC”. In:
Proceedings of ACM SIGCOMM. 2019.
[69] Stephen McQuistin, Vivian Band, Dejice Jacob, and Colin Perkins. “Parsing Protocol Standards to
Parse Standard Protocols”. In: Proceedings of the Applied Networking Research Workshop. ANRW
’20. Virtual Event, Spain: Association for Computing Machinery, 2020, pp. 25–31.isbn:
9781450380393.doi: 10.1145/3404868.3406671.
[70] D. Mills. Network Time Protocol (version 1) specification and implementation . RFC 1059. 1988.doi:
10.17487/RFC1059.
[71] Christopher Monsanto, Joshua Reich, Nate Foster, Jennifer Rexford, and David Walker.
“Composing Software Defined Networks”. In: 10th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 13). Lombard, IL: USENIX Association, Apr. 2013, pp. 1–13.
isbn: 978-1-931971-00-3.url:
https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/monsanto.
[72] Madanlal Musuvathi, Dawson R Engler, et al. “Model Checking Large Network Protocol
Implementations.” In: NSDI. Vol. 4. 2004, pp. 12–12.
[73] Karthik Nagaraj, Charles Killian, and Jennifer Neville. “Structured comparative analysis of
systems logs to diagnose performance problems”. In: Presented as part of the 9th{USENIX}
Symposium on Networked Systems Design and Implementation ({NSDI} 12). 2012, pp. 353–366.
[74] Tim Nelson, Andrew D Ferguson, Michael JG Scheer, and Shriram Krishnamurthi. “Tierless
programming and reasoning for software-defined networks”. In: 11th{USENIX} Symposium on
Networked Systems Design and Implementation ({NSDI} 14). 2014, pp. 519–531.
[75] List of NLTK dependents. https://github.com/nltk/nltk/network/dependents.
[76] Will Oremus. “The first news report on the LA earthquake was written by a robot”. In: Slate. com
17 (2014).
[77] Peter-Michael Osera and Steve Zdancewic. “Type-and-example-directed program synthesis”. In:
ACM SIGPLAN Notices 50.6 (2015), pp. 619–630.
[78] Luis Pedrosa, Ari Fogel, Nupur Kothari, Ramesh Govindan, Ratul Mahajan, and Todd Millstein.
“Analyzing protocol implementations for interoperability”. In: 12th{USENIX} Symposium on
Networked Systems Design and Implementation ({NSDI} 15). 2015, pp. 485–498.
[79] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. “Deep contextualized word representations”. In: arXiv preprint
arXiv:1802.05365 (2018).
[80] Nir Piterman, Amir Pnueli, and Yaniv Sa’ar. “Synthesis of reactive (1) designs”. In: International
Workshop on Verification, Model Checking, and Abstract Interpretation . Springer. 2006, pp. 364–380.
91
[81] Amir Pnueli and Roni Rosner. “On the synthesis of a reactive module”. In: Proceedings of the 16th
ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 1989, pp. 179–190.
[82] J. Postel. Internet Control Message Protocol. RFC 792. 1981.doi: 10.17487/RFC0792.
[83] J. Postel. TRANSMISSION CONTROL PROTOCOL. RFC 793. 1981.doi: 10.17487/RFC0793.
[84] Maxim Rabinovich, Mitchell Stern, and Dan Klein. “Abstract syntax networks for code generation
and semantic parsing”. In: arXiv preprint arXiv:1704.07535 (2017).
[85] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
“Language models are unsupervised multitask learners”. In: OpenAI Blog 1.8 (2019), p. 9.
[86] Keith Rayner and Susan A Duffy. “Lexical complexity and fixation times in reading: Effects of
word frequency, verb complexity, and lexical ambiguity”. In: Memory & cognition 14.3 (1986),
pp. 191–201.
[87] RFC Editor. http://www.rfc-editor.org/.
[88] Clay Richardson and John R Rymer. “The Forrester Wave™: low-code development platforms, Q2
2016”. In: Forrester, Washington DC (2016).
[89] Y. Lindell S. Gueron A. Langley. AES-GCM-SIV: Nonce Misuse-Resistant Authenticated Encryption.
RFC 8452. 2019.doi: 10.17487/RFC8452.
[90] SAGE. https://github.com/USC-NSL/sage.
[91] Divjyot Sethi, Srinivas Narayana, and Sharad Malik. “Abstractions for model checking SDN
controllers”. In: 2013 Formal Methods in Computer-Aided Design. IEEE. 2013, pp. 145–148.
[92] Ehab Al-Shaer and Saeed Al-Haj. “FlowChecker: Configuration analysis and verification of
federated OpenFlow infrastructures”. In: Proceedings of the 3rd ACM workshop on Assurable and
usable security configuration . 2010, pp. 37–44.
[93] Deepinder Sidhu and Anthony Chung. A formal description technique for protocol engineering.
University of Maryland at College Park, 1990.
[94] First SIP Interoperability Test Event. https://www.cs.columbia.edu/sip/sipit/1/. 2008.
[95] Ruben Sipos, Dmitriy Fradkin, Fabian Moerchen, and Zhuang Wang. “Log-based predictive
maintenance”. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge
discovery and data mining. 2014, pp. 1867–1876.
[96] Saurabh Srivastava, Sumit Gulwani, and Jeffrey S Foster. “From program verification to program
synthesis”. In: Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of
programming languages. 2010, pp. 313–326.
92
[97] Shashank Srivastava, Igor Labutov, and Tom Mitchell. “Joint concept learning and semantic
parsing from natural language explanations”. In: Proceedings of the 2017 conference on empirical
methods in natural language processing. 2017, pp. 1527–1536.
[98] Mark Steedman and Jason Baldridge. “Combinatory categorial grammar”. In:
Non-Transformational Syntax: Formal and explicit models of grammar (2011), pp. 181–224.
[99] TCPDUMP & LIBPCAP Public Repository. https://www.tcpdump.org/. Accessed: 2020-05-22.
[100] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee,
Seth Teller, and Nicholas Roy. “Understanding natural language commands for robotic navigation
and mobile manipulation”. In: Twenty-fifth AAAI conference on artificial intelligence . 2011.
[101] M. Thomson. Example Handshake Traces for TLS 1.3. RFC 8448. 2019.doi: 10.17487/RFC8448.
[102] Risto Vaarandi. “A data clustering algorithm for mining patterns from event logs”. In: Proceedings
of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003)(IEEE Cat. No. 03EX764).
IEEE. 2003, pp. 119–126.
[103] Sida I Wang, Percy Liang, and Christopher D Manning. “Learning language games through
interaction”. In: arXiv preprint arXiv:1606.02447 (2016).
[104] Ziqi Wang, Yujia Qin, Wenxuan Zhou, Jun Yan, Qinyuan Ye, Leonardo Neves, Zhiyuan Liu, and
Xiang Ren. “Learning from Explanations with Neural Execution Tree”. In: International
Conference on Learning Representations. 2020.url: https://openreview.net/forum?id=rJlUt0EYwS.
[105] Michael White and Rajakrishnan Rajkumar. “A more precise analysis of punctuation for
broad-coverage surface realization with CCG”. In: Coling 2008: Proceedings of the workshop on
Grammar Engineering Across Frameworks. 2008, pp. 17–24.
[106] M.-J. Willis, H.G. Hiden, P. Marenbach, B. McKay, and G.A. Montague. “Genetic programming: an
introduction and survey of applications”. In: Second International Conference On Genetic
Algorithms In Engineering Systems: Innovations And Applications. 1997, pp. 314–319.doi:
10.1049/cp:19971199.
[107] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. “Detecting large-scale
system problems by mining console logs”. In: Proceedings of the ACM SIGOPS 22nd symposium on
Operating systems principles. 2009, pp. 117–132.
[108] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.
“Xlnet: Generalized autoregressive pretraining for language understanding”. In: Advances in
neural information processing systems. 2019, pp. 5754–5764.
[109] Jane Yen, Tamás Lévai, Qinyuan Ye, Xiang Ren, Ramesh Govindan, and Barath Raghavan.
“Semi-automated protocol disambiguation and code generation”. In: Proceedings of the 2021 ACM
SIGCOMM 2021 Conference. 2021, pp. 272–286.
93
[110] Pengcheng Yin and Graham Neubig. “A Syntactic Neural Model for General-Purpose Code
Generation”. In: Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational
Linguistics, July 2017.doi: 10.18653/v1/P17-1041.
[111] Pengcheng Yin and Graham Neubig. “A syntactic neural model for general-purpose code
generation”. In: arXiv preprint arXiv:1704.01696 (2017).
[112] Pengcheng Yin, Chunting Zhou, Junxian He, and Graham Neubig. “StructVAE: Tree-structured
latent variable models for semi-supervised semantic parsing”. In: arXiv preprint arXiv:1806.07832
(2018).
[113] Sheng Zhang, Xutai Ma, Kevin Duh, and Benjamin Van Durme. “AMR Parsing as
Sequence-to-Graph Transduction”. In: ArXiv abs/1905.08704 (2019).
[114] Sheng Zhang, Xutai Ma, Kevin Duh, and Benjamin Van Durme. “Broad-Coverage Semantic
Parsing as Transduction”. In: EMNLP/IJCNLP. 2019.
94
Abstract (if available)
Abstract
Protocol specification has existed for decades to deliver the design and implementation of numerous protocols. As the guideline and foundation of diverse advanced systems, the methods to process and compose protocol specification have not changed much despite emerging advanced techniques. The production of specifications remains labor-intensive and involves rigorous discussion to avoid miscommunication via natural language media. A key reason behind these facts is the existence of ambiguities in natural language articles. Ambiguities could represent an unreasonable sentence, a multiple-meaning sentence, or any under-specified behaviors. However, identification of ambiguities is challenging to be applied in domain specific context. In addition, lack of studies applying advanced natural language processing techniques limits our understanding and practices of improving specification production. Motivated by the above observations, this thesis makes the first steps in introducing and building a prototype system that is user-and-machine-friendly and able to process natural language protocol specification while guarantee- ing the ambiguous level of the specification. The contributions are four-fold. Firstly, it applies advanced natural language processing techniques called "Combinatory Categorial Grammar" to analyze protocol specification texts and identifies ambiguous sentences that could result in buggy implementations. Secondly, it parses unambiguous English specification and generates corresponding executable protocol codes that can interoperate with well-known third party code. Thirdly, it defines protocol behaviors with a math definition and introduces unambiguous configurations. The specification configuration is easy for authors to design and easy to automatically generate corresponding English specification and executable code. Lastly, it categorizes a set of verification rules that are able to assist in filtering unreasonable configurations which could not be turned into pieces of English paragraphs or code blocks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Towards highly-available cloud and content-provider networks
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Balancing security and performance of network request-response protocols
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Data-driven and logic-based analysis of learning-enabled cyber-physical systems
PDF
Efficient pipelines for vision-based context sensing
PDF
Model-driven situational awareness in large-scale, complex systems
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Supporting faithful and safe live malware analysis
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Low cost fault handling mechanisms for multicore and many-core systems
Asset Metadata
Creator
Yen, Yu-Chuan
(author)
Core Title
Constructing an unambiguous user-and-machine-friendly, natural-language protocol specification system
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-08
Publication Date
06/14/2024
Defense Date
05/03/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
formal specification,graph abstraction,natural language,networking protocol,OAI-PMH Harvest,semantic representation,specification system,unambiguous specification,user friendly
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Govindan, Ramesh (
committee chair
), Raghavan, Barath (
committee chair
), Annavaram, Murali (
committee member
)
Creator Email
b99901079@gmail.com,yeny@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113170489
Unique identifier
UC113170489
Identifier
etd-YenYuChuan-11956.pdf (filename)
Legacy Identifier
etd-YenYuChuan-11956
Document Type
Thesis
Format
theses (aat)
Rights
Yen, Yu-Chuan
Internet Media Type
application/pdf
Type
texts
Source
20230613-usctheses-batch-1055
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
formal specification
graph abstraction
natural language
networking protocol
semantic representation
specification system
unambiguous specification
user friendly