Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Conversation-level syntax similarity metric
(USC Thesis Other)
Conversation-level syntax similarity metric
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Conversation-Level Syntax Similarity Metric
by
Reihane Boghrati
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2018
Copyright 2018 Reihane Boghrati
Acknowledgments
First and foremost, I would like to express my deepest appreciation to my advisor, Pro-
fessor Morteza Dehghani, for his support, guidance, and encouragement throughout my
graduate studies. I am extremely glad to call Morteza my advisor, who cared so much
about my work as well as my graduate life. He made sure I enjoy my work in every stage
and taught me innumerable lessons and insights on the workings of academic research.
I thank my committee members, Prof. Andrew Gordon, Elsi Kaiser, Paul Rosenbloom,
Cyrus Shahabi, and Jason Zevin whose insightful comments, ideas, and feedback were
absolutely invaluable.
I am very thankful to the members of CSSL, who are certainly more than lab mates
to me, for their support and genuine friendship. Justin, Joe, Aida, Mohammad, Drew,
Brendan, Katie, Kate, Matt, and Eunkyung, thank you for the stimulating discussions,
valuable collaborations, and for all the fun and laughter we had in the past four years.
I also deeply thank my friends in dierent states of the U.S, Canada, Europe, and Iran
for being there for me through my ups and downs all these years. I am specially grateful
to Zahra, Negar, Roohy, Aida, Mohammad, Nazanin, and Nazanin, my very close friends
whom I considered my family in the past several years while I was living thousands miles
far from my family in Iran.
ii
Last but not least, my deepest and sincere thanks goes to my family. I am forever
grateful for their unconditional love and support and I would certainly have not made
it this far without them. I thank my mom who has always inspired me to follow what
I care about despite any diculty; my dad whose advice have enlightened every step of
my life; my brothers, Hamzeh and Hamid, with whom I shared my happiest childhood
memories and whom I can always count on; and my sisters-in-law, Atefeh and Sara, who
are denitely the kindest and closest sisters I could wish for. Finally, I would like to
wholeheartedly thank Alireza; my husband who supported me through all the stressful
and worrisome moments and whose endless love, kindness, generosity, and understating
always brighten up my day.
iii
Table of Contents
Acknowledgments ii
List Of Tables vii
List of Figures ix
Chapter 1 Introduction 1
Chapter 2 Conversation Level Syntax Similarity Metric 5
2.1 Introduction 6
2.2 CASSIM 15
2.3 Study 1: Method Validation 21
2.3.1 Study 1A: Mechanical Turk 23
2.3.1.1 Method 23
2.3.1.2 Procedure 24
2.3.1.3 Results 26
2.3.2 Study 1B: Negotiation 27
2.3.2.1 Method 27
2.3.2.2 Procedure 28
2.3.2.3 Results 29
2.4 Study 2: Investigating Communication Accommodation in Social Media 29
2.4.1 Method 32
2.4.2 Analyses 34
2.4.3 1. Post to Comment Syntactic Similarity 34
2.4.3.1 Results 37
2.4.4 2. Linguistic Adjustment Across Posts 38
2.4.4.1 Results 40
2.4.5 3.a. Sentence Order Aects Syntax Accommodation 41
2.4.5.1 Results 42
2.4.6 3.b. Syntactic Similarity Removing First and Last Sentences 43
2.4.6.1 Results 44
2.4.7 Limitations 44
2.5 General Discussion 45
2.5.1 Limitations 50
2.5.2 Future Work 50
iv
Chapter 3 Follow my Language!
Effect of Power Relations on Syntactic Alignment 52
3.1 Introduction 53
3.2 Method 55
3.3 Studies 57
3.3.1 The U.S. Supreme Court Study 58
3.3.1.1 Data 58
3.3.1.2 Analysis 59
3.3.1.3 Results 59
3.3.1.4 Discussion 61
3.3.2 Wikipedia Study 62
3.3.2.1 Data 62
3.3.2.2 Analysis 63
3.3.2.3 Results 63
3.3.2.4 Discussion 64
3.4 General Discussion and Future Work 64
Chapter 4 Generalized Representation of Syntactic Structures 67
4.1 Introduction 68
4.2 CASSIM-GR 71
4.3 Experiments 73
4.3.1 Experiment One 75
4.3.1.1 Data 75
4.3.1.2 Analysis 76
4.3.1.3 Results 76
4.3.2 Experiment Two 77
4.3.2.1 Data 78
4.3.2.2 Analysis 78
4.3.2.3 Results 78
4.3.3 Experiment Three 79
4.3.3.1 Data 79
4.3.3.2 Analysis 79
4.3.3.3 Results 80
4.4 Discussion and Future Work 80
Chapter 5 Effect of Syntactic Similarity on Recall and Agreement 84
5.1 Introduction 85
5.2 Experiment 1 88
5.2.1 Method 89
5.2.1.1 Syntax Prompt Production 89
5.2.1.2 Study 92
5.2.2 Analysis 93
5.2.2.1 Participants 95
5.2.2.2 Results 95
5.2.3 Discussion 96
v
5.3 Experiment 2 97
5.3.1 Method 97
5.3.1.1 Stimuli Selection 98
5.3.1.2 Study 99
5.3.2 Analysis 101
5.3.2.1 Participants 101
5.3.2.2 Results 102
5.3.3 Discussion 103
5.4 General Discussion 104
Appendix A
107
A.1 Hungarian Algorithm 107
A.2 Instructions for Experiment 1 109
A.3 Prompts 111
A.3.1 Prompt 1 111
A.3.2 Prompt 2 111
A.3.3 Prompt 3 111
A.3.4 Prompt 4 111
A.4 Detailed Results of Null and Alternative Models 112
A.5 Comparison of Processing Time 114
Bibliography 115
vi
List Of Tables
2.1 The Four Questions Which Were Used in the Corpus Collection Question-
naire. 25
2.2 Mturk corpus statistics 26
2.3 Results of the rst validation task. 27
2.4 Results of the second validation task. 30
2.5 Reddit corpus statistics 34
2.6 Syntax Accommodation Study, Analysis 1 37
2.7 Syntax Accommodation Study, Analysis 2 40
2.8 Syntax Accommodation Study, Analysis 3.a 43
3.1 U.S. Supreme Court dialogues example 57
3.2 Supreme Court Results 59
3.3 Wikipedia Result 62
4.1 Corpora Overview. 74
4.2 Accuracy of approaches in three experiments. 74
4.3 Comparison of approaches in three experiments. 74
5.1 Cluster Center Prompt for Recall 91
5.2 Generated Sentence Prompt for Recall 92
5.3 Statement Prompt for Agreement 100
vii
A.1 Validation Study, Analysis 1 112
A.2 CAT Study, Analysis 1 112
A.3 CAT Study, Analysis 2 113
A.4 Processing Time Corpus Statistics 114
A.5 Processing Time 114
viii
List of Figures
2.1 A constituency parse tree of the sentence \John hit the ball". S represents
the sentence \John hit the ball". The two nodes at the next level represent
the noun \John" and the verb phrase \hit the ball". The verb phrase \hit
the ball" is decomposed into two additional nodes, representing the verb
\hit" and the noun phrase \the ball". The noun phrase is then represented
at the lowest level as a determiner, \the," and a noun, \ball". 13
2.2 Edit Distance algorithm. The three possible operations are shown here:
adding node f, deleting node c and renaming node d to g. 17
2.3 An example of a complete weighted bipartite graph. The yellow nodes are
considered as set A and the blue nodes are considered as set B. 19
2.4 Syntax similarity calculation process. The Edit Distance calculator mod-
ule, calculates the similarity of each two pair of constituency parse trees
which are generated by the parse tree generator module. In the last step,
the Hungarian algorithm module nds the minimum weight perfect match-
ing of the graph of sentences' parse trees. The bold edges are the ones that
are selected by the Hungarian algorithm. The overall syntax similarity of
the two document is summation of the selected edges' weights divided by
the number of edges. 22
2.5 Schema for the rst analysis in Study 2. Step 1- Syntax similarity of
post1 and comment1 which is written on post1 is calculated. Step 2-
Syntax similarity of post1 and a random comment from a random post is
calculated. Then the outputs of the two steps are used to nd the most
similar pairs as shown in Equation 2.3. 36
2.6 Schema for the second analysis in Study 2. Step 1- Syntax similarity of
post1 and comment1 which is written on post1 is calculated. Step 2-
Syntax similarity of a random post from User1's pool of previous posts
and comment1 is calculated. Then the outputs of the two steps are used
to nd the most similar pairs as shown in Equation 2.4. 39
ix
2.7 Schema for the third analysis in Study 2. Similarity of rst and last sen-
tences of a post and rst and last sentences of a comment on the post is
computed. Then the outputs are used to nd the most similar sentences. 42
4.1 CASSIM-Group Representation Process. 69
5.1 Recall Experiment Process. 94
A.1 An example of Hungarian Algorithm process. 1- Convert the complete
bipartite graph to a matrix, with cells representing edges' weights. 2- In
each row, subtract the minimum number from all elements of the row.
For example, in the rst row, the smallest number is 1, and therefore all
elements of the rst row are subtracted by 1. 3- In each column, subtract
the minimum number from all elements of the column. 4- Cover all the
zeros in the matrix with minimum number of lines. 5- If the minimum
number of lines is less than number of rows (or columns), nd the minimum
uncovered number in the matrix (here 2). Then subtract this number from
all the cells which are not highlighted and add it to the cells which are
highlighted twice, in our example a
2
b
1
. Then go to step 4. Until the
number of lines is equal to number of rows (or columns), then go to step
6. 6- From each row-column, choose a zero. 7- The cells that are chosen
in the previous step are the edges in the optimal solution for minimum
weight perfect matching problem. The sum of the actual weights of the
highlighted edges is the minimum weight, in our example 12. 108
x
Chapter 1
Introduction
It has long been known that language both aects and encodes our psychological states
(Freud, 1966; Rorschach, 1921; Murray, 1943). Adopting computational text analysis
tools, many studies have explored the relationship between language use and human
behavior (Ramirez-Esparza et al., 2008; Maass et al., 2006; Dehghani et al., 2013a; Mehl
et al., 2012). Two major components of language that have been of interest are semantics
and syntax. Analysis of semantics is the study of identication and extraction of meaning
from words, phrases, and sentences. On the other hand, analysis of syntax refers to the
study of the structure of language and the rules and principles which regulate how we
put the substructure of language together to form sentences.
The syntax and semantics of human language can illuminate many individual psycho-
logical dierences and important dimensions of social interaction. Accordingly, psycholog-
ical and psycholinguistic research has begun incorporating sophisticated representations
of semantic content to better understand the connection between semantics and psycho-
logical processes (Sagi and Dehghani, 2014; Pennebaker, 2011; Garcia and Sikstr om, 2014;
Dehghani et al., 2016a). However, syntactical analysis of language has been for the most
1
part overlooked when it comes to analysis of human behavior. Yet, automated analysis
of syntax does play an important role in other computer science applications; including
question-answering systems (Pasca and Harabagiu, 2001), machine translation (Yamada
and Knight, 2001), and sentence compression (Lin and Wilbur, 2007).
To ll the gap between syntax and behavioral analysis, the main emphasis in my thesis
is to design, build, and validate a novel method which computes the syntactic similarity
between documents. This method combines parse tree generators and minimum weight
perfect matching techniques to obtain the syntax similarity. Adopting this method, I
demonstrate how syntactic structures may reveal crucial aspects of human behavior and
beliefs.
In Chapter 2, I introduce ConversAtion level Syntax SImilarity Metric (CASSIM)
a novel method for calculating conversation-level syntax similarity. CASSIM relies on
constituency parse trees and uses edit distance as a measure of similarity between the
parse trees. Edit distance between two trees is dened as the number of nodes that are
needed to be added, deleted, or renamed to transform one tree to another. Then, to match
the most similar sentences across the two documents, CASSIM performs a minimum
weight perfect matching algorithm, here Hungrain algorithm, and nally reports a single
syntactic similarity score.
After validating CASSIM, I exhibit its applicability in the domain of psycho-linguistics.
Communication Accommodation (Giles, 2008) is a widely accepted theory about the re-
lationship between language use and human behavior. It suggests that individuals adjust
their verbal and non-verbal behavior to their interlocutor. I conduct a series of analyses
with CASSIM to investigate syntax accommodation in social media discourse. I run the
2
same experiments using two well-known existing syntactic metrics, LSM and Coh-Metrix,
and compare their results to CASSIM. Overall, the results indicate that CASSIM is able
to reliably measure syntax similarity and to provide robust evidence of syntax accommo-
dation within social media discourse.
In a subsequent study, in chapter 3, I employ CASSIM to study a component of
communication accommodation which is research on power and dominance relations and
suggests that language use is dependent on power position. There are dierent linguistic
markers which imply power standing of people. For example, when high power individuals
interact with people in low power positions, the language of the interaction tends to follow
the language of the high power individuals. While previous studies have mostly focused
on the word level features of language during this process, in two dierent experiments,
I apply CASSIM to show that not only people in low power mirror word usage of people
in high power, but they also adjust their syntactic structures to those in high power.
Later, in chapter 4, I propose Conversation Level Syntax Similarity Metric-Group
Representation (CASSIM-GR) which is an extension to CASSIM. This tool builds gen-
eralized representations of syntactic structures of documents, thus allowing researchers
to distinguish between people and groups based on syntactic dierences. CASSIM-GR
builds o of CASSIM by applying spectral clustering (Shi and Malik, 2000) to syntactic
similarity matrices and calculating the center of each cluster of documents. This result-
ing cluster centroid then represents the syntactical structure of the group of documents.
To examine the eectiveness of CASSIM-GR, I conduct three experiments across three
unique corpora. In each experiment, I calculate the clustering accuracy and compare
the proposed technique to a bag-of-words approach. The results provide evidence for the
3
eectiveness of CASSIM-GR and demonstrate that combining syntactic similarity and
tfidf semantic information improves the total accuracy of group classication.
Finally, in chapter 5, I utilize both CASSIM and CASSIM-GR to examine the hy-
pothesis that familiar syntactic structures enhance both recall and persuasiveness of ar-
guments. In two separate experiments, I demonstrate that there is a positive relationship
between syntax familiarity and individuals' 1. recall of sentences and 2. acceptance of
arguments to which they would normally be opposed. The rst shows that an individ-
ual's writing style has an impact on their ability to recall sentences. By obtaining written
samples from individuals prior to evaluating them on a sentence recall task, I show that
higher similarity of individuals' writing styles with sentences' syntactic structure results
in higher recall. The second study provides evidence that individuals also exhibit higher
levels of agreement with statements whose syntactic structures are familiar, even when
the statements disagree with their stated views. The results of these two studies support
the main hypothesis that an individual's increased familiarity with a syntactic structure
renders that structure easier to recall and more agreeable.
4
Chapter 2
Conversation Level Syntax Similarity Metric
The syntax and semantics of human language can illuminate many individual psychologi-
cal dierences and important dimensions of social interaction. Accordingly, psychological
and psycholinguistic research has begun incorporating sophisticated representations of
semantic content to better understand the connection between word choice and psycho-
logical processes. In this work we introduce ConversAtion level Syntax SImilarity Metric
(CASSIM), a novel method for calculating conversation-level syntax similarity. CAS-
SIM estimates the syntax similarity between conversations by automatically generating
syntactical representations of the sentences in conversation, estimating the structural dif-
ferences between them, and calculating an optimized estimate of the conversation-level
syntax similarity. After introducing and explaining this method, we report results from
two method validation experiments (Study 1) and conduct a series of analyses with CAS-
SIM to investigate syntax accommodation in social media discourse (Study 2). We run
the same experiments using two well-known existing syntactic metrics, LSM and Coh-
Metrix, and compare their results to CASSIM. Overall, our results indicate that CASSIM
5
is able to reliably measure syntax similarity and to provide robust evidence of syntax ac-
commodation within social media discourse.
2.1 Introduction
Language serves many purposes. A single sentence or utterance can be used to convey
information, make a promise, signal a threat, or share emotions (Searle, 1975). In the
domain of social dynamics, one of the more important functions of language is to signal
solidarity and build rapport. Such signals can be as simple as a statement of agreement
or as sophisticated as a Shakespearean drama. However, no matter the form, linguistic
choices are critical components of a range of social interactions from group formation
and maintenance (Nguyen et al., 2012) to sparking romance (Ireland et al., 2011) and
successful negotiations (Taylor and Thomas, 2008).
Accordingly, many studies have shown the relationship between language use and
various psychological dimensions. For example, language can be a marker of age (Pen-
nebaker and Stone, 2003), gender (Groom and Pennebaker, 2005; Laserna et al., 2014),
political orientations (Dehghani et al., 2014) and even eating habits (Skoyen et al., 2014).
Further it can help us better understand various aspects of depression (Ramirez-Esparza
et al., 2008), moral values (Graham et al., 2009; Dehghani et al., 2016a), neuroticism and
extraversion (Mehl et al., 2012) and cultural backgrounds (Maass et al., 2006; Dehghani
et al., 2013a).
Notably, however, much of research on language and psychology has relied on methods
for measuring word-level semantic similarity (e.g. whether participants use similar words
6
as one another). While word choice captures many aspects of social behavior, there is
more to language than just words, and examinations of words alone may fail to capture
important dierences in language use. For example, although they share the same words,
the sentences, \dog bites man" and \man bites dog" mean very dierent things. Needless
to say, the rules that govern how words can be t together to form meaningful utterances,
called syntax, are essential for sophisticated communication. Indeed, syntax is one of the
fundamental components distinguishing human language from many animal calls (e.g.
Berwick et al., 2013).
Given the importance of syntax for structuring human communication, it is perhaps
unsurprising that much can be learned about individual dierences from the syntax that
they use. For example, even when the basic facts conveyed in an utterance are similar,
dierences in syntactic patterns can signal a variety of underlying demographic and psy-
chological factors such as educational or regional background (Bresnan and Hay, 2008),
gender (Vigliocco and Franck, 1999), socio-economics (Jahr, 1992), and emotional states
and personality (Gawda, 2010). Syntactic structure can also signal a speaker's assessment
of their listener, such as the way that adults simplify their sentences when communicating
with children (Snow, 1977).
A number of theories have been developed to assess the role of language in social
signaling, including Communication Accommodation Theory (CAT; Giles, 2008) and the
interactive alignment model (Pickering and Garrod, 2004). These and other related the-
ories posit that we adjust our verbal and non-verbal behaviors to maximize similarities
between ourselves and others when we want to signal solidarity, and we maximize lin-
guistic dierences when trying to push others away (Shepard et al., 2001). Related to
7
these theories Bock (1986b) demonstrated that not only are people sensitive to syntactic
form, but also that they tend to replicate it in their own linguistic constructions un-
der certain conditions. They exposed participants to a syntactic form and then asked
them to describe a picture in one sentence. Their results demonstrated the activation
process of syntactic alignment, whereby exposure to a syntactic structure leads to a subse-
quent alignment or mirroring of the syntactic structure of future linguistic constructions.
Branigan et al. (2010) described the mechanism underlying alignment and focused on the
linguistic alignment of computers and humans. They proposed that people align more
with computers because they believe computers do not have as much communication
skills as humans. Further, Branigan et al. (2011) concluded that linguistic alignment is
related to the perception of participants of their partner and its linguistic communication
skills. In their study, they asked participants to select a picture based on their partners
(a human or a computer) description or name a picture themselves. Even though, the
scripts in both human and computer situations were identical, participants showed higher
linguistic alignment with computer partners.
In the past decade, researchers have increasingly focused on investigating the benets
and consequences of syntactic alignment between speakers (Fusaroli et al., 2012; Fusaroli
and Tyl en, 2016; Reitter et al., 2006; Healey et al., 2014; del Prado Martn and Du Bois,
2015; Schoot et al., 2016; Branigan et al., 2000). For example, Reitter and Moore (2014)
found a positive correlation between long-term linguistic alignment adaptation between
participants in a task and task success. Further, del Prado Martn and Du Bois (2015)
found a positive relation between syntactic alignment and aective alignment which they
measured using information theory and aggregated measures of the aective valences of
8
the words, respectively. Lastly, this eect has even been shown to in
uence second lan-
guage learners (L2 learners) in producing passive sentences (Kim and McDonough, 2008),
dative constructions (McDonough, 2006), and production of wh-questions (McDonough
and Chaikitmongkol, 2010).
Recent studies, however, suggest that language alignment might not be as simple as
previously claimed and is a complex cognitive process (Riley et al., 2011). For example,
Fusaroli et al. (2012) showed that higher performance of participants in a perceptual
task had a positive connection to their task relevant vocabulary alignment but not their
overall verbal behavior alignment. Also, Schoot et al. (2016) examined whether syntactic
alignment in
uences interlocutor's perception of the speaker, but the results could not
provide strong eect of syntactic alignment on perceived likability.
Although studies have recently started using automated methods for extracting and
coding syntactic features, majority of earlier studies rely on hand-coded assessment of
syntax similarity. While hand-coding is typically very accurate and eective, one of
the major draw backs of relying on a human coders alone is ineciency { analyzing
thousands, or millions, of social media posts, for example, will simply not scale up using
human coders. Unfortunately, while parsing the syntax of a sentence is a relatively
simple task for people with relevant training, it has proven to be a challenging task
for computers due to the potential for syntactical ambiguity in language
1
. Part of the
challenge in measuring and assessing syntax comes from the complexity of syntax itself.
As a generative process, language can be shaped in nearly innite ways. The most simple
process can be described by a vast range of sentences with various syntactic structures.
1
Sentence parsing has shown to be anNP -complete problem and therefore a theoretically dicult task
(Koller and Striegnitz, 2002).
9
\She gave the dog a ball," \The dog was given a ball by her," and \She was the one
who gave the dog a ball" all convey the same general information but the emphasis shifts
based on the sentence structure. Given such diversity, how do we automatically measure
and compare syntax? How do we consider whether someone is adjusting their syntax in
a given context? Nonetheless, research in computational linguistics has produced several
methods that demonstrate high parsing accuracy (Tomita, 1984; Earley, 1970; Brill, 1993;
Fern andez-Gonz alez and Martins, 2015; Zhang and McDonald, 2012; Weiss et al., 2015;
Andor et al., 2016).
Recently, a number of tools have been developed to measure specic components
of syntactic complexity using automated processing. Lu's (2010) system analyzes the
syntactic complexity of a document based on fourteen dierent measures including the
ratio of verb phrases, number of dependent clauses, and T-units. TAALES is yet another
tool which measures lexical sophistication based on several features such as frequency,
range, academic language, and psycholinguistic word information (Kyle and Crossley,
2015). Coh-Metrix was developed to measure over 200 dierent facets of syntax (Graesser
et al., 2004), and several of these facets deal with syntax complexity (e.g. mean number
of modiers per noun phrase, mean number of high-level constituents per word, and
the incidence of word classes that signal logical or analytical diculty). Coh-Metrix's
SYNMEDPOS and SYNSTRUT indices can also calculate part of speech and constituency
parse tree similarities, and some of the facets capture text diculty (Crossley et al., 2008).
A variation of Coh-Metrix called Coh-Metrix Common Core Text Ease and Readabil-
ity Assessor (T.E.R.A.) provides information about text diculty and readability. One
component of this tool is dedicated to syntactic simplicity and measures average number
10
of clauses per sentence, the number of words before the main verb of the main clause
in a sentence and the syntactic structure similarity throughout the document (Crossley
et al., 2016). Kyle (2016) recently introduced TAASSC which is a syntactic analysis
tool. It calculates a number of indices related to syntax such as mean length of T-unit,
number of adjectives per noun phrase, and umber of adverbials per clause. Last but not
least, Linguistic Style Matching (LSM; Niederhoer and Pennebaker, 2002; Ireland and
Pennebaker, 2010) measures syntax similarity based on function words use, which may
not directly re
ect syntax matching, but it indirectly determines a dimension of syntax
similarity. LSM calculates the syntax similarity score using the weighted absolute dif-
ference score of use of pre-specied categories of function words in LIWC (Pennebaker
et al., 2001) between two documents.
While these tools have proven to be useful across many dierent domains, they are
limited by their focus on particular syntactic features which may or may not be present
in dierent sentences or situations. Most of the discussed tools are based on xed op-
erationalizations of specic syntactic features. This top-down approach is valuable for
analyses that aim to examine variations of those specic features, but necessarily restricts
the coverage of the analysis. One reason why previous approaches rely on measurements
of pre-specied syntactic features is likely that generating unconstrained representations
of sentences' syntactical structure is a challenging computational task. Also, relying only
on word-categories results in language-dependency. When studying word patterns in text,
it is vital to use a complete list of words in desired categories which may or may not be
available in many languages/sociolects. However, it is relatively easy to compile a list of
11
words for closed categories, where a xed set of words covers the category throughly, e.g.,
rst person pronouns.
To address these concerns, we propose a dierent approach to capture syntactic sim-
ilarity called ConversAtion level Syntax SImilarity Metric (CASSIM)
2
. CASSIM aims to
provide researchers the opportunity to study the relationship between communication
styles, psychological factors, or group aliations by investigating dynamics in syntac-
tical patterns. CASSIM was developed to extend the boundaries of syntactic analysis
by enabling direct quantitative comparisons of the structure of sentences or documents
3
to each other. The foundation of our method involves the generation of constituency
parse trees, or tree-shaped representations of the syntactic structure of sentences (e.g.
Figure 2.1). Through constituency parse trees, the hierarchical structure that charac-
terizes syntactical patterns can be represented by a series of nested components. For
example, a constituency parse tree of the sentence \John hit the ball" might represent
the complete sentence as the highest node in the tree. At the next highest level of the
tree, two nodes might represent the noun \John" and the verb phrase \hit the ball" as
two separate nodes. The verb phrase \hit the ball" might then be decomposed into two
additional nodes, representing the verb \hit" and the noun phrase \the ball". Finally,
the noun phrase \the ball" could be represented at a lower level as a determiner, \the",
and a noun, \ball". Accordingly, constituency parse trees are able to represent the entire
syntactical structure of a sentence, because they capture syntactical relations between
words and phrases at multiple levels of depth.
2
Available as open-source software at https://github.com/USC-CSSL/CASSIM/
3
We refer to conversations with more than one sentence as documents.
12
S
N VP
V NP
D N
John hit the ball.
Figure 2.1: A constituency parse tree of the sentence \John hit the ball". S represents
the sentence \John hit the ball". The two nodes at the next level represent the noun
\John" and the verb phrase \hit the ball". The verb phrase \hit the ball" is decomposed
into two additional nodes, representing the verb \hit" and the noun phrase \the ball".
The noun phrase is then represented at the lowest level as a determiner, \the," and a
noun, \ball".
For any two documents being compared, CASSIM operates as follows: rst, con-
stituency parse trees representing the sentences contained in each document are gener-
ated. Then, the syntactic dierence for each between-document pair of constituency parse
trees is calculated. Next, using a minimization algorithm, the set of between-document
sentence pairs with the least dierences are identied, a process called minimum weight
perfect matching. Finally, the syntax dierence scores for the set of minimally dier-
ent, between-document sentence pairs are averaged to create a single point estimate of
document syntax similarity.
Overall, CASSIM has several advantages over the existing systems. First, it is language-
independent and modular. This means that CASSIM can be used to investigate syntax
similarity in any language as long as a syntax parser for that language can be provided
13
to CASSIM. In addition, researchers can use the syntax parser of their choice and are not
conned to one specic parser built into the system. Thus, CASSIM can and will con-
tinue to accommodate state-of-the-art computational syntax parsing algorithms. Second,
CASSIM does not rely on any specic syntactic features (e.g. noun phrases), but rather
it uses the entire syntactic structure of sentences (i.e. constituency parse trees). Third,
CASSIM is open-source. Users can download CASSIM's source code and make additions
to it, or can just download the binary of the program and simply use it to analyze data
4
.
Lastly, while CASSIM relies on a seemingly complex algorithmic procedure, quantifying
syntactic similarity via overlap of syntax trees, it is highly ecient compared to tools
that perform a similar operation. Thus, CASSIM is particularly well suited for research
that requires analyzing large corpora.
The remainder of this paper is organized as follows: First, we explain the algorithm
underlying CASSIM in detail. Second, in two dierent analyses, we validate CASSIM's
ability to capture syntax similarity and compare it to LSM, SYNMEDPOS, and SYN-
STRUT (Study 1). Third, we conduct three tests using CASSIM to investigate whether
the word-level eects identied in communication accommodation research generalize to
syntactical patterns in social media discourse (Study 2). These analyses demonstrate how
CASSIM might be used as a tool for psychological and psycholinguistic research. Finally,
we discuss our ndings and potential future directions. We note that the primary focus
of this paper is on the proposed method. Other than the rst experiment, which is used
to validate the method, the other experiments are designed to both demonstrate how
4
Available as open-source software at https://github.com/USC-CSSL/CASSIM/
14
CASSIM can be used to address psychological questions, and to compare its performance
to other available tools.
2.2 CASSIM
As discussed earlier, a large body of research has identied syntax as an important in-
dicator of various psychological and social variables. Moreover, in the past few years,
several computational tools have been developed for automatic analysis of syntax. The
development of CASSIM and the execution of the studies reported in this paper are in-
tended to further advance this area of study. We start by discussing the algorithm used
in CASSIMin detail.
CASSIM executes three general steps when estimating the syntax similarity of two
documents. First, the algorithm builds a constituency parse tree for each of the sentences
in the two documents to be compared. As our goal is to compare the syntax similarity of
the two documents and not their semantic similarity, CASSIM then removes the actual
words (called leaves) from the parse trees leaving only nodes representing the syntactic
features (e.g., word order, parts of speech) intact. Word removal eliminates the eect
of using similar words in the two sentences on the similarity estimates produced by our
method. To generate constituency parse trees, we use an unlexicalized parser developed
by Klein and Manning (2003). This parser is time and resource ecient while also being
acceptably accurate
5
. After completing this step, each of the documents being compared
5
This parser achieves an F1 score of 86.32% (Klein and Manning, 2003) on the Wall Street Journal
section of the Penn treebank corpus (See Marcus et al., 1999).
15
are represented by a set of parse trees that indicate the syntactical structure of the original
sentences in the documents.
Next, CASSIM calculates the syntax similarity for each possible pair of sentences
between the two documents (one from document A, and one from document B). To do
this, CASSIM uses an algorithm called Edit Distance, a well-known algorithm in graph
theory which calculates the minimum number of operations (i.e. adding, deleting, or
renaming a node) needed to transform one graph into the other (Navarro, 2001). Because
trees are a special case of graphs, CASSIM can estimate the syntax similarity between
two documents' sentences by calculating the Edit Distance for each between-document
pair of sentences' parse tress. Thus, if document A has two sentences and document B
has three sentences, the Edit Distance would be calculated for six sentence pairs.
For example, in Figure 2.2 we have two trees, S1 and S2. Edit Distance can be used
to nd the number of operations needed to transform the rst sentence's tree into the
other. If we start with treeS1, we rst need to add nodef, then delete nodee, and nally
rename noded to nodeg. This means that three operations are needed to transform the
syntactic structure of S1 to that of S2.
Once the Edit Distance for each sentence pair between the two documents is calcu-
lated, CASSIM normalizes the Edit Distance scores. Normalization is necessary because
Edit Distance is a positively biased function of the number of nodes in the parse trees
being compared. Parse trees that have a greater number of nodes (e.g. trees for longer
sentences) tend to require a greater number of Edit Distance operations. Therefore, CAS-
SIM normalizes Edit Distance scores in order to control for the length of parse trees. To
normalize, we divide the output of Edit Distance by the average number of nodes in the
16
c b
a
e d
c b
a
e d
add delete rename f e
f
b
a
d
c
f
d
b
a
g
c
f
Tree S1 Tree S2
e
Figure 2.2: Edit Distance algorithm. The three possible operations are shown here:
adding node f, deleting node c and renaming node d to g.
two parse trees. For example, in Figure 2.2, both sentences have 5 nodes, so CASSIM
divides the Edit Distance output by 5. This division prevents the syntax similarity of the
documents from being aected by the number of words in the sentences.
The output of the normalization process is a syntax dissimilarity score for each pair
of sentences in the two documents. Syntactic dissimilarity scores range from 0 to 1,
where smaller output values indicate higher syntactic similarity between sentences. For
instance, the normalized Edit Distance of the two trees in Figure 2.2 is 0.6 (3 divided by
5) and it is used as a measure of syntactic similarity between the two sentences.
Finally, in the third step, CASSIM calculates the syntactic similarity at the document
level. One approach to calculating the syntax similarity of two documents is to simply
average over the Edit Distance of the pairs of sentences between them. One advantage
of this approach is that in maintains the temporal structure of the interaction, such that
the syntactic similarity score will re
ect the similarity between sentences that occur at
17
adjacent times. However, a notable drawback to this approach is that in some cases it can
lead to biased representations of syntactic similarity. Consider two documents, A andB,
each having two sentences,S
1
,S
2
, andS
3
,S
4
. Further, suppose the two sentences in each
document are signicantly dierent in terms of syntax (i.e. S
1
is dierent fromS
2
, andS
3
is dierent fromS
4
), but also that each have a very syntactically similar sentence pair in
the other document (i.e. S
1
is similar toS
3
, andS
2
is similar toS
4
). Averaging the syntax
similarity of all the sentences pairs would wash away this similarity, and would, from one
perspective, fail to accurately indicate matching between the documents. However, in
our view, dierent empirical questions might in
uence whether researchers are better o
operationalizing syntactic similarity as the similarity between sentences that occur at
matching points in a sequence (i.e. at adjacent time points) or as the maximal matching
across all sentences
6
.
More specically, step three avoids potentially underestimating document similarity
by identifying the parse tree pairing for each parse tree in a document that has the
minimum Edit Distance and then dropping the Edit Distances for all the other pairs.
In the example above, this matching process would match S
1
to S
3
and S
2
to S
4
and it
would drop the Edit Distances between S
1
and S
4
and S
2
and S
3
.
CASSIM implements the matching process in two steps: First, it constructs a com-
plete weighted bipartite graph, with nodes representing parse trees and weighted edges
representing the Edit Distance between every two parse trees in the documents. A com-
plete bipartite graph is dened as a graph which is composed of two independent sets of
nodes, A and B. That is, no two nodes within the same set are connected by an edge,
6
Accordingly, even though maximal matching is the default mechanism in CASSIM, an option for
switching to average matching is available for calculating document level similarity.
18
&' "$=0.40
&'#(=0.39
)"
PT
)#
PT
)$
PT
)(
PT
Figure 2.3: An example of a complete weighted bipartite graph. The yellow nodes are
considered as set A and the blue nodes are considered as set B.
but each node in one set shares an edge with every node in the other set. For example,
in Figure 2.3 set A's nodes (yellow nodes) represents one document's parse trees and set
B's nodes (blue nodes) represents the other's. There is no edge between document A's
nodes, nor there is one between document B's, while every yellow node is connected to
every blue node.
The second step in the matching process is to identify the optimal pattern of node
pairings (or sentence constituency parse trees) that minimizes the edge weights (or dier-
ences) between them. As discussed above, this involves identifying, for each parse tree in
a given document, the parse tree in the comparison document to which it is most similar.
This process is called minimum weight perfect matching, and it refers to nding a set of
pairwise non-adjacent edges, with minimum weights, in which every node should appear
19
in exactly one matching (Brent, 1999). Thus, the outcome of minimum weight perfect
matching when applied to the example above would be a graph in which S
1
is matched
toS
3
andS
2
is matched toS
4
. Because every node can appear in only one matching, the
edges between S
1
and S
4
, and, S
2
and S
3
would be dropped.
Given nodes i2A and j2B, the weight function w(i;j) refers to the weight of the
edge between two nodes i and j. The goal in minimum weight perfect matching problem
is to minimize the sum of the edge weights. Note that in the context of our method,
an edge weight is a measure of the syntax similarity between the nodes (constituency
parse trees) linked by an edge. Thus, as discussed above, the goal of the algorithm is to
minimize the sum of similarity scores (recall that lower values indicate greater similarity).
This is accomplished by minimizing the following equation:
X
i2A and j2B
w(i;j) (2.1)
In order to conduct minimum weight perfect matching, CASSIM uses the Hungarian
algorithm (Kuhn, 1955). This algorithm nds the pattern of node pairings that minimizes
the weights of all edges. For our purposes, this pairing translates to an optimized measure
of similarity between two documents. The Hungarian algorithm matches the most similar
nodes from the two sets of A and B until each of the nodes in one (or both) of the sets
participates in exactly one matching (see appendix A for details about the Hungarian
algorithm). In cases where the number of sentences in document A is not the same as
in document B, each of the sentences in the shorter document are matched with one
sentence in the longer document, while some of the sentences in the longer document
20
are not matched to any sentence in the shorter document. For the sentences in the
longer document that are unmatched, CASSIM then nds the most similar sentence in
the shorter document. To avoid the eect of number of sentences on similarity, similar to
the Edit Distance algorithm for sentence-level similarity, CASSIM normalizes the output
of the Hungarian algorithm by dividing the output by the number of edges which are
selected.
Conversation-level syntactic similarity scores also range from 0 to 1. CASSIM subtract
scores by 1 so that larger values indicate greater similarity between the two documents. In
Figure 2.4, we provide an overview of the process for calculating document-level syntactic
similarity just outlined.
2.3 Study 1: Method Validation
To validate the proposed method, we conducted two separate analyses. In the rst
analysis, we compiled and validated a corpus of grammatically similar sentences generated
by Amazon Mechanical Turk participants, and tested our system against it. In the second
analysis, we used CASSIM to analyze a corpus of conversations about negotiations used
in Ireland and Henderson (2014). To establish a better performance matrix, along with
CASSIM, we also analyzed these corpora using Coh-Metrix and LSM.
Because Coh-Metrix has been developed to measure cohesion within documents (and
not syntax similarity across documents), we implemented SYNMEDPOS and SYNSTRUT
that deal with facets of syntax similarity to measure between-documents similarity. SYN-
MEDPOS is a syntax dissimilarity metric (i.e. smaller numbers signal higher syntactic
21
)
"
: Colorless green ideas sleep
furiously.
)
#
: Whereof one cannot speak,
thereof one must be silent.
)
$
: Those who cannot
remember the past are
condemned to compute it.
)
(
: Language disguises thought.
Parse Tree
Generator
Parse Tree
Generator
&'
"$
=0.40
&'
#(
=0.39
Edit Distance
Calculator
)
"
PT
)
#
PT
)
$
PT
)
(
PT
Hungarian
Algorithm
)
"
PT
)
#
PT
)
$
PT
)
(
PT
1-(0.27+0.15)/2 = 0.79
Figure 2.4: Syntax similarity calculation process. The Edit Distance calculator module,
calculates the similarity of each two pair of constituency parse trees which are generated
by the parse tree generator module. In the last step, the Hungarian algorithm module
nds the minimum weight perfect matching of the graph of sentences' parse trees. The
bold edges are the ones that are selected by the Hungarian algorithm. The overall syntax
similarity of the two document is summation of the selected edges' weights divided by
the number of edges.
22
similarity) which measures the minimal edit distance of POS between two sentences (Mc-
Namara et al., 2014). SYNSTRUT nds the largest common subtree between two sen-
tences' constituency parse trees and divides the number of nodes in the common subtree
by the total number of nodes in each sentence's parse tree. We extracted the common
subtree as noted in McNamara et al. (2014). Then we calculated the SYNSTRUT score
of two documents by averaging over the SYNSTRUT scores of each two pair of sentences
between the two documents.
We used Text Analysis, Crawling and Interpretation Tool (TACIT; Dehghani et al.,
2016b) to obtain percentage of word usage for the word categories used by LSM, which are
identical to the categories of function words in LIWC (Pennebaker et al., 2007). We then
calculated the LSM score between two documents using the following formula described
in Ireland et al. (2011):
LSM
preps
= 1 ((jpreps
1
preps
2
j)=(preps
1
+preps
2
+ 0:0001)) (2.2)
And then averaged over category-level LSM scores to yield a total LSM score for the two
documents.
2.3.1 Study 1A: Mechanical Turk
2.3.1.1 Method
For our rst analysis, we compiled multiple small corpora containing syntactically similar
documents. To create these corpora, we asked participants to generate sets of sentences
that match the grammar rules in an example sentence. We then used CASSIM to calculate
23
the syntactic similarity both within and between corpora. If our method is able to
capture syntactic similarity, sentences generated to match the grammar of one sentence
should have higher syntactic similarity scores with the matching sentence than with other
sentence prompts or sentences generated from other prompts. In summary, the goal of this
analysis is to test whether CASSIM could accurately indicate that sentences generated
by participants to match the syntax of a sentence prompt, are more syntactically similar
to that prompt compared to other non-related sentences.
2.3.1.2 Procedure
120 Amazon Mechanical Turk participants from the United States (no other demographic
information was collected) completed a set of four tasks in which we asked them to com-
pose sentences that are grammatically similar to a set of sentence prompts. Accordingly,
this study had a repeated-measures design. Participants were rst given detailed instruc-
tions about the task. We explained what we mean by grammar rules by providing detail
examples; however, we assured them that they will not be asked about grammar rules.
We then provided two sets of examples which were similar to the task they were supposed
to complete (see appendix A). For each of the two example sentences, three possible re-
sponses (sentences with similar syntactic structures) were presented to the participants.
Then, the participants were presented with four composition tasks in randomized
order. For these tasks, the sentence prompt length ranged between one and four sentences
(Table 2.1), and the sentences provided in each task were syntactically dierent from
sentences in the other tasks. For each set of sentences, participants were asked to create
new sentences that were grammatically similar to the original. We specically asked them
24
Table 2.1: The Four Questions Which Were Used in the Corpus Collection Questionnaire.
Question id Question sentences
Question 1 The two most important days in your life are the day you are born and
the day you nd out why. The nice thing about being a celebrity is that
you bore people and they think it's their fault.
Question 2 I am enough of an artist to draw freely upon my imagination. Imagina-
tion is more important than knowledge. Knowledge is limited. Imagi-
nation encircles the world.
Question 3 When we love, we always strive to become better than we are. When
we strive to become better than we are, everything around us becomes
better too.
Question 4 What is the point of being alive if you don't at least try to do something
remarkable?
to use similar grammatical rules as the ones used in the question sentences and not to
use the same exact words when creating new sentences.
Two of the responses were dropped for having failed to complete the attention task, in
which participants were asked to recall the number of sentences in the previous task. At
the end we left with 118 participant and four responses per participant. The descriptive
statistics of the corpus is provided in Table 2.2.
After collecting the data, we asked two independent coders to code whether a re-
sponse is syntactically similar to its prompt or not. They were also instructed to exclude
responses which used the exact same words as the prompt. The coders had an acceptable
inter-rater reliability (Kappa = 0:53). To resolve the con
ict of codings between the two
coder, we asked a third coder to code the cases on which the rst two coders did not
agree. We then removed the responses which were coded as being syntactically dierent
from their prompts. Finally, we used CASSIM to measure the syntax similarity of each
document in the corpus.
25
Table 2.2: Mturk corpus statistics
Question Sentence (mean, SD, range) Word (mean, SD, range)
Question1 1.97, 0.17, [1, 2] 33.30, 6.78, [20, 51]
Question2 3.61, 1.31, [1, 8] 26.68, 7.60, [17, 60]
Question3 2.068, 0.43, [1, 4] 24.51, 5.28, [13, 52]
Question4 1.03, 0.16, [1, 2] 14.60, 3.08, [8, 31]
2.3.1.3 Results
Our results demonstrate that CASSIM correctly calculated higher syntactic similarity for
the documents containing sentences generated in response to the same prompt compared
to documents containing sentences generated in response to other prompts. Specically,
a maximal structured linear mixed eect model (Barr et al., 2013) with comparison
type (corresponding to the same prompt/corresponding to a dierent prompt) as an
independent variable, document id as random eect, and the CASSIM calculated syntactic
similarity as the dependent variable, revealed that the documents within each task were
judged to be signicantly more similar to the same corpus class (M = 0:7838;SD =
0:0850) compared to the other corpora (M = 0:6331;SD = 0:0417),
2
(1) = 331:84;p<
:001. Results were obtained by standardizing similarity scores and performing an ANOVA
test of the full model with comparison type as xed eect against the model without the
xed eect (see Table 1 in appendix A for precise estimate of the models). Dividing the
xed eect parameters by the residual standard error resulted in eect size of 1:7541.
Table 2.3 demonstrates the full result of running the same linear mixed eect model
on the results of LSM, SYNMEDPOS, and SYNSTRUT metrics. All of the four metrics
successfully categorized the responses written to a prompt syntactically more similar to
that prompt compared to the other prompts. Notably, however, as shown in Table 2.3,
26
Table 2.3: Results of the rst validation task.
Method Name
2
df Eect Size [95% CI] p
CASSIM 331.84 1 1.7541 [1.617, 1.891] <.001
LSM 201.85 1 1.2396 [1.098, 1.381] <.001
SYNMEDPOS 278.28 1 -1.4161 [-1.544, -1.288] <.001
SYNSTRUT 240.08 1 1.4589 [1.312, 1.606] <.001
the eect size achieved using CASSIM is much higher compared to the other techniques.
The negative eect size of SYNMEDPOS accounts for the fact that SYNEMDPOS is
inherently a syntax dissimilarity metric.
This result provides evidence for CASSIM's ability to identify syntactically similar
documents and veries its applicability for investigating the role of syntax in dierent
domains.
2.3.2 Study 1B: Negotiation
2.3.2.1 Method
In the second analysis, we sought to reanalyze the analysis of Ireland and Henderson
(2014). In their study, Ireland and Henderson (2014) aimed to examine whether the
language style matching of participants during a negotiation task is correlated with par-
ticipants' nal agreement. LSM (Niederhoer and Pennebaker, 2002) focuses on the role
of function words and suggests that people adapt their linguistic style in dyadic conver-
sations to their conversation partners. Function words, or style words, include pronouns,
prepositions, articles, conjunctions, auxiliary verbs, among others. These words occur
frequently in speech, and their meaning is mostly dened by the context in which they
are used. LSM measures the similarity in function words use between two documents
27
(Ireland and Pennebaker, 2010), and this similarity has been shown to predict many so-
cial outcomes. For instance, the degree of which two participants use similar function
words in speed dating interaction indicates their willingness to contact one another in the
future, and this similarity predicts whether the date will result in a match better than
the participants' own perceptions of their match likelihood (Ireland et al., 2011). The
same eect has been found when using a couple's instant messages to predict whether
they will still be together three months later (Slatcher et al., 2008). Function word use
can also be useful in the social media domain. For example, users participating in the
same conversation on Twitter tend to use words from the same function words category
in their tweets compared to those who are not (Danescu-Niculescu-Mizil et al., 2011).
In this analysis, we used CASSIM and Coh-Metrix on the corpus described in Ireland
and Henderson (2014) and followed their exact analysis procedure to evaluate the results
using two additional measures.
2.3.2.2 Procedure
Ireland and Henderson (2014) collected 60 sets of conversations that took place during
negotiation dyads on an instant messenger. The participants were supposed to reach
an agreement over four issues during 20 minutes. The negotiation transcripts were then
checked for spelling and typographical errors and aggregated to one text le per partic-
ipant. See Ireland and Henderson (2014) for more details about the data-set collection
procedure.
We analyzed the negotiations and agreement correlation with a focus on early and
late stages of the conversation. As suggested in Ireland and Henderson (2014), for the
28
early and late stages, we used the rst and last 100 words of the negotiation transcripts
respectively.
2.3.2.3 Results
To analyze the results of CASSIM, LSM, and Coh-Metrix, we rst calculate the z-scores
of the methods' outcomes. For analyzing the correlation between syntactic similarity
and negotiation outcome, we followed the procedure detailed in Ireland and Henderson
(2014). Specically, we ran a logistic regression and regressed the agreement variable on
the result of each of the techniques.
Table 2.4 demonstrates the analysis results for all the three methods in both stages of
the negotiation. The results clearly indicate an overall agreement between CASSIM and
SYNMEDPOS, where there is less syntactic similarity in the conversations in the early
stages of the negotiations, and that the similarity in syntax between the players increase
in the later stages. We would like to note that we do not have \ground-truth" in this
validation experiment. However, the fact that we see the same general trend using two
dierent methods provides further validation to the performance of CASSIM.
2.4 Study 2: Investigating Communication Accommodation
in Social Media
After validating CASSIM in Study 1, we designed Study 2 to apply our method to examine
how our syntax similarity measure can be used to investigate a particular psychological
theory. One of the most compelling examples of the relationship between language style
29
Table 2.4: Results of the second validation task.
Method Name [95% CI] SE OR P
Early CASSIM 0.4867[-0.0595, 1.1107] 0.2931 1.6269 =.0968
Early LSM
1
-0.2008[-0.7481, 0.3263] 0.2708 0.8181 =.4585
Early SYNMEDPOS 0.2540[-0.2714, 0.7996] 0.2683 1.2891 =.3438
Early SYNSTRUT 0.4830[-0.0676, 1.1016] 0.2945 1.6209 =.1009
Late CASSIM 0.6860[0.1197, 1.3308] 0.3048 1.9857 =.0244
Late LSM
2
-0.5088[-1.1007, 0.0324] 0.2857 0.6012 =.0749
Late SYNMEDPOS 1.5336[0.7753, 2.4991] 0.4349 4.6349 <.001
Late SYNSTRUT 0.4882[-0.0712, 1.1342] 0.3030 1.6294 =.1071
1
Even though we used the same corpus and procedure as described in Ire-
land and Henderson (2014), our analysis yielded slightly dierent results
from what was reported in the original paper. We contacted the authors
of the paper, and it was concluded that there might had been some mod-
ication made to the nal corpus used in the original paper. We report
the results reported in Ireland and Henderson (2014) for the sake of com-
parison. Stats reported in Ireland and Henderson (2014) for early LSM:
p =:366
2
Stats reported in Ireland and Henderson (2014) for Late LSM: =
0:65; 95%CI[1:25;0:05];SE = 0:30;OR = 0:52;p =:031
and psychological and social factors is the phenomenon of communication accommoda-
tion, which involves a speaker's dynamic adjustment of communication styles in order to
mimic, or deviate from, another person or group.
There is considerable research on communication accommodation, and this research
has led to the development of several theories such as Communication Accommodation
Theory (CAT). CAT is a well-known theory in communication (Giles, 2008) which posits
that people adjust their verbal and non-verbal behavior to be more or less similar to
others' in order to minimize or maximize their social dierence (Shepard et al., 2001).
Research has provided evidence of communication accommodation in a variety of everyday
interactions (Jacob et al., 2011; Gu eguen, 2009). For example, a study by Tanner et al.
(2008) showed that the nal rating of a product in a product-review scenario is in
uenced
30
by whether or not the interviewer mimics the participant's verbal and non-verbal behav-
ior. Participants in the mimicking condition gave higher ratings to the product being
discussed compared to the participants who were not mimicked. Similarly, Van Baaren
et al. (2003) found that when a waitress repeats customer's orders back to them, it is more
likely that the customer will feel more socially close to the waitress and that they will
subsequently leave them a higher tip as a result. At the same time, some recent studies
suggest that language alignment is a more complicated process than previous proposed
(Riley et al., 2011; Fusaroli et al., 2012; Schoot et al., 2016).
While this research provides strong evidence for the relationship between word-level
patterns and psychological and behavioral phenomena, it does not examine the relation-
ship between such phenomena and higher-order syntactical dynamics. We designed Study
2 to investigate the relationship between syntactic language structure and discussion par-
ticipation on social media using CASSIM. Specically, we use CASSIM to determine
whether individuals adapt their use of syntactic structures while interacting with one
another on the social media platform, Reddit.com. The contributions of this study are
two-fold: First, we demonstrate how CASSIM can be used to perform syntactic analysis
of conversations, and second, we test whether eects of word-level linguistic style apply
to syntax. Specically, we sought to test three hypotheses:
1. When people comment on a post, they use a syntax structure that is similar to
the syntax structure of that post. We operationalized this as: the syntax similarity
between a given post and a comment on that post is greater than the similarity
between that post and a comment from a dierent post.
31
2. When people comment on a post, they adjust their syntax to match the syntax of
the post. When people comment on a post, the syntax structure of their comment
will be more similar to the post than to their own previous posts.
3. People adjust the rst or last sentence of a comment to match the rst or last
sentence of the original post. When people comment on a post, the syntax similarity
of the rst or last sentence of the comment, is more similar to the rst or last
sentence of the post compared to the other sentences.
Hypothesis 3 is exploratory, therefore, we did not repeat the experiment using LSM
and Coh-Metrix.
2.4.1 Method
We collected our data from existing, naturally-occurring posts on Reddit.com. Reddit is
a social networking service in which users can post content and other users may comment
on the created content. Content on Reddit is divided into subreddits, with each subreddit
devoted to a specic topic/group of interest (e.g., gaming, soccer, liberal, conservative).
Additionally, each subreddit has a set of moderators whose responsibility is to remove
posts and comments that are o-topic from the assigned subreddit. Importantly, users
mostly express their thoughts, beliefs and opinions about a particular topic within each
subreddit and moderators help keep the platform clean of o topic conversations. This
structure makes it suitable for investigating syntax accommodation in social media con-
versations. The special interest subreddit structure of the social network on Reddit.com
makes it a good t for our experiment, because it naturally leads to the formation of loose
32
groups (Weninger et al., 2013). As per our hypotheses, we would expect that discussions
between users who post in the same special interest forum (e.g. a liberal forum) would
have greater syntax similarity, compared to what might be expected as an average value
of syntax similarity.
For the current study, we rst collected all the posts and top-level comments (that
is the comments written directly in response to the post and not in response to other
comments) from two subreddits: /r/liberal and /r/conservative. These two subreddits
include users' opinions and discussions toward specic issues (compared to posting photos)
which makes an appropriate corpus for studying syntax accommodation. To collect the
data, we used the Reddit Crawler in TACIT (Dehghani et al., 2016b).
To facilitate syntax comparison between a comment's text and the original post's text,
we removed all posts solely comprised of links to other webpages or images with no text
content. We also removed all posts with no text in the posting users' historical dataset.
Additionally, some comments quoted one or more sentences from the original post. To
avoid in
ation of syntactic similarity due to repeating the exact same sentences from the
post in comment, we removed all sentences in comments which directly quoted the post's
sentences. Finally, we removed all posts and comments with only one word.
This data collection resulted in a corpus of 167 posts from the /r/liberal subreddit
and 146 posts from /r/conservative subreddits (with the total of 7256 comments). Ad-
ditionally, where available, we collected historical data for all non-anonymous users who
had commented on the /r/liberal and /r/conservative subreddits across all the Reddit.
We were able to collect historical post data for 84% of our sample, resulting in a dataset
33
Table 2.5: Reddit corpus statistics
Question Sentence (mean, SD, range) Word (mean, SD, range)
Posts 11.37, 16.59, [1, 191] 178.51, 233.30, [6, 2540]
Top-level comments 3.95, 5.16, [1, 75] 59.84, 92.75, [2, 1499]
Users' historical posts 7.87, 11.43, [1, 861] 123.18, 187.29, [2, 6109]
with 2846 unique users and 86368 posts. We checked for identical posts and there was no
repeated posts in our corpus. Table 2.5 shows descriptive statistics of our reddit dataset.
2.4.2 Analyses
In this section, we report results from three analyses performed using CASSIM to examine
the presence of CAT in social media. The rst two analyses compare posts and comments
in the same conversation to the posts and comments in dierent conversation. The last
analysis investigates the most similar sentences among a post and its comments.
2.4.3 1. Post to Comment Syntactic Similarity
First, we used CASSIM to investigate whether there is higher syntactic similarity between
a post and the comments written in response to it compared to a post and comments
written in response to other posts in the same subreddit. We ensured that the post,
comment, and random comment in each analysis were all written within the same com-
munity in order to exclude the eects of homophily in syntax accommodation. One may
argue that because the post and comment are written by the same group, and people
in the same group are known to share similar characteristics, they are similar; however,
since the random comment is also from the same community, our experimental design
controls for that objection. We also excluded the comments with only one word. Lastly,
34
the random comment were chosen with respect to two additional criteria: 1- Number of
its sentences being in the range of average comments' number of sentences, 2- Number
of words used in it being in the range of average comments' number of words. We dene
range as the mean number of sentences or words standard deviation of number of
sentences or words.
Equation 2.3 models the aforementioned hypothesis. Comment C
0
is written on post
P
0
, and post P
0
is posted in subreddit S
0
. Also, Comment C
1
is a random comment
which is written on a random post,P
1
, which is in the same subreddit as the postP
0
(i.e.
S
0
). Using CASSIM, we calculated the syntax similarity of C
0
and P
0
and the syntax
similarity of C
1
and P
0
, which is a randomly selected comment from the same subreddit
community. Using CASSIM, we calculated the syntax similarity for each of the comments
in the subreddit S
0
. If the syntax similarity of C
0
andP
0
is signicantly higher than the
syntax similarity of C
1
and P
0
, we may infer that comments on a post are more likely
than other random comments from other posts to follow similar syntactic structure to the
original reference post. We also repeat this analysis with LSM and Coh-Matrix (Figure
2.5).
Syntax Similarity(C
0
;P
0
)>Syntax Similarity(C
1
;P
0
) (2.3)
For each of the 6882 comments in the /r/liberal and /r/conservative subreddits, we
calculated the syntax similarity between the comment, C
0
, and its original post, P
0
, and
also the syntax similarity of a random comment,C
1
, to the same post,P
0
. As mentioned
earlier, scores range from 0 to 1, with larger scores account for higher syntax similarity.
35
-./00*1)
"
(2(*+
"
)
-./00*1)
#
(2(*+
#
)
...
-./00*1)
$
(2(*+
$
)
3/()
4
Pool of random
comments
CASSIM
561!/0
./00*1)
3/()
4
CASSIM
./00*1)
-
3/()
4
Calculate
Similar Pairs
Figure 2.5: Schema for the rst analysis in Study 2. Step 1- Syntax similarity of post1
and comment1 which is written on post1 is calculated. Step 2- Syntax similarity of post1
and a random comment from a random post is calculated. Then the outputs of the two
steps are used to nd the most similar pairs as shown in Equation 2.3.
36
Table 2.6: Syntax Accommodation Study, Analysis 1
Method Name
2
df Eect Size [95% CI] p
CASSIM 40.7 1 0.1694 [0.1179, 0.2213] <.001
LSM 17.3 1 0.1287 [0.0684, 0.1893] <.001
SYNMEDPOS 0.36 1 -0.01283 [-0.05497, 0.02892] =.55
SYNSTRUT 20 1 0.0585 [0.0148, 0.1018] <.001
2.4.3.1 Results
We used a maximal structure linear mixed eect model with CASSIM syntactic similarity
score as dependent variable, comparison type (comparing the post to its own comments
or to random comments) as independent variable and the users who wrote the post, com-
ment and random comment and the subreddit name as random eects. We standardized
similarity scores and performed an ANOVA test of the full model with comparison type
as xed eect against the model without the xed eect (see Table 2 in appendix A
for precise estimate of the models). The results of this analysis support our hypothesis
that a comment, C
0
and its original post P
0
(M = 0:6406;SD = 0:0680), are syntac-
tically more similar to each other, than a random comment C
1
and the same post P
0
(M = 0:6352;SD = 0:0677),
2
(1) = 40:7;p<:001. Dividing the xed eect parameters
by the residual standard error resulted in eect size of 0:1694.
The same linear mixed eect model was applied on LSM, SYNMEDPOS, and SYN-
STRUT results. As demonstrated in Table 2.6, LSM and SYNSTRUT measures show
the same signicant trend as CASSIM, i.e. the comments written in response to a post
are syntactically more similar to the post compared to random comments, while SYN-
MEDPOS does not.
37
2.4.4 2. Linguistic Adjustment Across Posts
Second, we hypothesized that users adjust the syntax structure of their comments to be
more similar to the original post being referenced. To test this hypothesis, we determined
whether there was higher syntax similarity between a user's comment written in response
to another user's post and lower syntax similarity between their own previously written
posts and the comment.
Equation 2.4 models the above hypothesis. Comment C
0
is written on post P
0
from
subreddit S
0
, by user U
0
. P
1
is a random post which is also written by user U
0
in
another subredditS
1
. We measure the syntax similarity ofC
0
andP
0
and also the syntax
similarity of C
0
and the randomly selected post, P
1
(a post written by the same user in
a dierent subreddit). If the syntax similarity of C
0
and P
0
is signicantly higher than
the syntax similarity of C
0
and P
1
, we may conclude that the syntax structure of users'
comments is more aected by the original post's syntax, compared to their syntax use in
previous posts (Figure 2.6).
Syntax Similarity(C
0
;P
0
)>Syntax Similarity(C
0
;P
1
) (2.4)
To test hypothesis 2, We used the corpus of 6882 comments from the rst analysis
and also the entire corpus of 86368 of historical data posts made by the 2846 users who
had commented on the /r/liberal or /r/conservative subreddits. For each comment in the
corpus of the two subreddits, if the comment was written by a user from the users' corpus,
the syntax similarity ofC
0
and the original post,P
0
, and also the syntax similarity ofC
0
and a random post from the user's historical data, P
1
, were calculated.
38
-./00*1)
"
(2(*+
"
)
-./00*1)
#
(2(*+
#
)
...
-./00*1)
$
(2(*+
$
)
3/()
4
Pool of Previous
Post by 2(*+
-
CASSIM
./00*1)
-
561!/0
3/()
CASSIM
./00*1)
-
3/()
4
Calculate
Similar Pairs
Figure 2.6: Schema for the second analysis in Study 2. Step 1- Syntax similarity of post1
and comment1 which is written on post1 is calculated. Step 2- Syntax similarity of a
random post from User1's pool of previous posts and comment1 is calculated. Then the
outputs of the two steps are used to nd the most similar pairs as shown in Equation 2.4.
39
Table 2.7: Syntax Accommodation Study, Analysis 2
Method Name
2
df Eect Size [95% CI] p
CASSIM 21 1 0.0825 [0.0473, 0.1177] <.001
LSM 46.9 1 0.1102 [0.0787, 0.1417] <.001
SYNMEDPOS 6.3 1 -0.0462 [-0.08259, -0.0101] <.05
SYNSTRUT 2.24 1 -0.0295 [-0.06789, 0.009165] =.13
2.4.4.1 Results
We used a maximal structure linear mixed eect model with CASSIM-calculated syntactic
similarity as the dependent variable and the comparison type (comment being compared
to its original post/comment being compared to a random post by the commenter) as an
independent variable. We also entered the subreddit's name and users' names as random
eects to our model. We standardized similarity scores and performed an ANOVA test
of the full model with comparison type as xed eect against the model without the
xed eect (see Table 3 in appendix A for precise estimate of the models). The result of
this analysis supported our hypothesis that a comment, C
0
, is syntactically more similar
to its original post, P
0
(M = 0:6470;SD = 0:0604), compared to a random post, P
1
from the writer of the comment (M = 0:6420;SD = 0:0593),
2
(1) = 21;p < :001.
Dividing the xed eect parameters by the residual standard error resulted in eect size
of 0:0825. The same model was applied on the LSM, SYNMEDPOS, and SYNSTRUT
scores. As shown in Table 2.7, LSM shows the same trend as CASSIM with higher eect
size (0:1102). SYNMEDPOS also demonstrates the same trend but with lower eect size
(0:0462) while SYNSTRUT does not show any signicant eect. The negative eect
size of SYNMEDPOS accounts for the fact that SYNMEDPOS is a syntax dissimilarity
metric.
40
2.4.5 3.a. Sentence Order Aects Syntax Accommodation
The results of the previous two analyses provide evidence for syntax accommodation in
social media conversations. In the third analysis, we conduct an exploratory analysis
and test our hypothesis that the order of sentences also aects syntax accommodation.
Specically, we are interested in the potential role of primacy eects in syntax accommo-
dation. For example, it could be the case that syntax accommodation is primarily driven
by the modication of the rst sentence of a post and that other sentences in a post do
not show syntax accommodation. Accordingly, in a third analysis, we investigated which
sentences in a post and comment pair tend to drive syntax accommodation eects.
To conduct this analysis, for all the comments in the two subreddits /r/liberal and
/r/conservative, we calculated the syntax similarity of the rst sentence and last sentence
of the comment to the rst and last sentence of the original post. All the comments or
posts with only one sentence were removed for this analysis resulting in 300 posts and
4775 comments.
Equations 2.5 through 2.8, show the analyses performed (Figure 2.7).
Syntax Similarity(post
firstsentence
;comment
firstsentence
) (2.5)
Syntax Similarity(post
firstsentence
;comment
lastsentence
) (2.6)
Syntax Similarity(post
lastsentence
;comment
firstsentence
) (2.7)
Syntax Similarity(post
lastsentence
;comment
lastsentence
) (2.8)
41
Edit Distance
Calculator
Comment
First Sentence
Post First
Sentence
Edit Distance
Calculator
Comment
Last Sentence
Post First
Sentence
Edit Distance
Calculator
Comment
First Sentence
Post First
Sentence
Edit Distance
Calculator
Comment
Last Sentence
Post First
Sentence
Calculate Minimum
Edit Distance
Dissimilarity Score
Figure 2.7: Schema for the third analysis in Study 2. Similarity of rst and last sentences
of a post and rst and last sentences of a comment on the post is computed. Then the
outputs are used to nd the most similar sentences.
For each comment, we calculated the syntax similarity of its rst and last sentences
to the rst and last sentences of the original post (as shown in Equations 2.5 to 2.8).
2.4.5.1 Results
We performed a maximal structure linear mixed eect model with comparison type (post's
rst and last sentences to comment's rst and last sentences) as independent variable and
CASSIM syntactic similarity as dependent variable. Writers of comments and posts, and
also subreddit name (either liberal or conservative) were entered as random eects to the
model.
The result of this analysis indicated that the rst sentence in a post has higher syntac-
tic similarity to the rst sentence in a comment (M = 0:6291;SD = 0:0832) (Equation
2.5) compared to the syntactic similarity between the rst sentence in a post and the
last sentence in the comment (M = 0:6151;SD = 0:0857) (Equation 2.6) and the syn-
tactic similarity between the last sentence in a post and the rst sentence in a comment
42
Table 2.8: Syntax Accommodation Study, Analysis 3.a
Comparison Type Estimate Std. Error df t p
post
firstsentence
& comment
firstsentence
0.1791 0.0297 252 6.04 <.001
post
firstsentence
& comment
lastsentence
-0.1419 0.0227 217 -6.26 <.001
post
lastsentence
& comment
firstsentence
-0.2679 0.0507 256.9 -5.28 <.001
post
lastsentence
& comment
lastsentence
-0.3618 0.0523 266 -6.92 <.001
(M = 0:6066;SD = 0:0962) (Equation 2.7). The rst sentences in the post and comment
were also syntactically more similar than the last sentences in the post and comment
(M = 0:5961;SD = 0:1007) (Equation 2.8). As demonstrated in Table 2.8, if we consider
the comparison of post and comment's rst sentences as a reference point, the compar-
ison of post's rst sentence and comment's last sentences aects similarity and lowers
it by 0:1388 0:0266. Additionally, the comparison of the last sentence of the post to
the rst and last sentences of the comment, lowers the similarity by 0:2714 0:0659 and
0:3632 0:0753, respectively.
Our analysis conrms that the structure of the rst sentence in a post aects the
syntax structure of the rst sentence in its following comments. Table 2.8 shows the
dierence among comparison types.
2.4.6 3.b. Syntactic Similarity Removing First and Last Sentences
The results of the third analysis suggest that the sentences at the beginning of a post and
a comment follow similar syntax structures. As a follow up test, we sought to determine
whether the syntax accommodation eect among posts and comments identied in the
rst two analyses is derived solely by similarities between the rst sentence of a post
and the rst sentence of a comment. To test for these eects, we re-ran the rst and
43
the second analyses to validate their results after removing both the rst sentence of the
comment and the rst sentence of the post.
2.4.6.1 Results
The results of re-running the rst analysis with rst sentences removed show the same
trend as the one reported in Analysis 1. Even after removing the rst sentence of the com-
ment and the post, the syntax similarity between comments that are written in response to
a post and the original post (M = 0:6490;SD = 0:0633) is higher compared to the syntax
similarity between random comments and the original post (M = 0:6443;SD = 0:0608)
2
= 5:0657;p<:05.
Further, we also replicated results of Analysis 2. After removing the rst sentences
of the post and the comment, the comments which were written on a post were still
syntactically more similar to the original post (M = 0:6574;SD = 0:0647) compared to
the previous posts written by the author of the comment (M = 0:6461;SD = 0:0686)
2
= 4:9603;p<:05.
2.4.7 Limitations
A major limitation of our analyses is that we do not consider comments' threads (i.e.
comment on comment), and they might carry important social signals. Further, another
important source of information in these forums is the stance of commenters towards a
post (for or against), which was not available in our corpus. This information may be
important in the analysis of syntax priming.
44
2.5 General Discussion
While semantics and word choice have been extensively used to study human behavior,
less emphasis has been put on the role of syntax and whether the way people put their
words together can help to convey their intentions. Although no one can deny the im-
portance of semantics in revealing various aspects of human psychology, results of our
analyses along with previous ndings in the eld provide evidence that syntax can also
provide important linguistic information about social interactions. Despite the meth-
ods used in a small number of previous studies (e.g. Healey et al., 2014; Reitter et al.,
2006), large-scale analysis of syntax has often been constrained by the available methods
to specic facets of syntax. We have developed and provided evidence for the eective-
ness of CASSIM for comparing syntax structures between documents. We also compared
CASSIM to two well-known existing methods, LSM and Coh-Metrix, and showed its
applications and advantages.
In order to validate CASSIM's ability to capture within and across corpus syntactic
similarity, we tested it on a corpus of syntactically similar documents generated by MTurk
users. The results of this test provided strong evidence that CASSIM is able to reliably
measure document level syntax similarity. Additionally, both LSM and Coh-Matrix con-
rmed the direction of the results, however, the eect size reported by CASSIM is higher
than the other two methods. It is worth mentioning that this is the only analysis in our
work for which \ground-truth" exists, and, as a result, provides an important validation
test-bed for the dierent algorithms. We also reanalyzed the results of a negotiation
study previously performed by Ireland and Henderson (2014). Namely, we reanalyzed
45
their corpus using CASSIM and Coh-Matrix, and compared the results with the ndings
of LSM (which was used by the authors of the original paper). Next, we used CASSIM
to investigate syntax accommodation in social media conversations to demonstrate how
CASSIM might be applied to psychological research as well as to further validate the
method. Using a corpus of naturally-generated conversations on Reddit.com, we pro-
vided evidence that users tend to follow the syntax of their conversation partners on
social media. Specically, we found that comments which are written in response to a
post are likely to follow the original post's syntax. Additionally, users adjusted their
syntax use in comments to be more similar to the original post, and their comment was
more syntactically similar to that post compared to a random post they have previously
written. While in the former analysis CASSIM eect size was higher, unexpectedly, LSM
had higher eect size in the latter experiment.
Finally, we found that the rst sentence of a post and the rst sentence of its following
comments are the most similar sentences in syntactic structure, but that this rst sentence
similarity does not completely drive the eects found in our rst two analyses. It should
be noted, however, that while these results correspond to a primacy eect, the role of
primacy in syntax accommodation does not necessarily imply that syntax accommodation
can be reduced to mere cognitive priming. As research in CAT has demonstrated, syntax
accommodation occurs to varying degrees as a function of context and goal orientation.
Thus, while, as our results suggest, the structure of syntax accommodation patterns
might re
ect basic cognitive tendencies (such as increased recall for early elements in a
series), this result does not indicate that syntax accommodation is merely an arbitrary
consequence of these tendencies.
46
These ndings support our hypothesis that a post's syntax aects the syntax of the
comments that follow it. Furthermore, a user's comments are syntactically more similar
to the original post compared to his or her previous posts, indicating that when users
write comments on a post, they tend to use a similar syntactic structure as the post's
syntax, rather than using their own previous writing style. Finally, when a user writes a
comment on a post, he or she starts the comment with a similar syntax to the opening
sentence of the post. These results provide support for a signicant eect of syntactic
accommodation in social media conversations.
In the current research, we show that syntactic structures are psychologically relevant
by investigating whether measuring the similarity of the syntactical structures of docu-
ments can yield insight into social dynamics. We believe the development of CASSIM and
other formal tools for analysis of syntax can pave the way for further investigation of this
important aspect of language. Capturing syntax similarity with methods such as CAS-
SIM may also help researchers explore a wide variety of novel and existing psychological
questions. For instance, we can examine whether group aliation increases mirroring of
others' syntax to signal group cohesion or agreement (Giles et al., 1991).
Additionally, when compared to existing methods, we nd that CASSIM, LSM, and
Coh-Matrix produce similar trends of results in the majority of our studies. However, we
believe that CASSIM has several advantages over these existing measures. First, unlike
Coh-Matrix, which has been developed for measuring syntactic coherency, CASSIM is
specically designed to measure syntactic similarity between documents. Second, unlike
LSM, CASSIM is language-independent. In other words, if a parser for a particular
language exists, then CASSIM can be applied to that language. Third, CASSIM does
47
not rely on any specic syntactic features, but rather it uses the entire syntactic structure
of documents, allowing for greater
exibility in analyses. Thus, while LSM is faster than
CASSIM (and SYNSTRUT and SYNMEPOS) and has a linear time complexity (in terms
of the words in the document), the increased computational cost of CASSIM purchases
considerable
exibility. Also, CASSIM and SYNMEDPOS are based on edit distance and
therefore have polynomial time complexity, while SYNSTRUT needs to nd the largest
common subtree, an operation which is exponential in the order of the number of parse
trees' nodes. For a more precise comparison of processing time, see appendix A. Finally,
CASSIM is open-source, whereas for LSM the LIWC dictionary needs to be purchased,
and public access to Coh-Matrix is only available through a web interface.
Our goal is for CASSIM to make it easier for researchers interested in the relationship
between communication styles and other forms of social dynamics to begin exploring
patterns in syntax. One promising example is research on power and dominance relations.
Previous research has identied a range of linguistic markers that indicate whether a
speaker is speaking to a superior or a subordinate (Kacewicz et al., 2013), however,
this work has focused only on word-level patterns. By using CASSIM to represent and
compare sentence and document level patterns in syntactical structure, researchers could
investigate the relationship between syntax and power dynamics.
Further, as more data is gathered on the relationship between syntax usage and psy-
chological factors of interest, we can begin using comparative syntax patterns as indicators
and measures of these factors. For example, if a reliable model of how and when syntax
accommodation is used as an association or dissociation signal is developed, it could be
used to detect subtle and unexplicit instances of these phenomena. Without having to
48
rely on content, it could be possible to infer whether a person feels aliation with or
want to distance themselves from another speaker or group (Brewer, 1991). Similarly, it
could be possible to infer whether a speaker agrees or disagrees with a communication
partner, even if they don't use the same words. Importantly, these kinds of measurement
models could be applied to any domain that is reliably associated with a specic pattern
of syntax usage. Beyond being important ndings in and of themselves, the development
of such models could potentially provide researchers with new ways to operationalize and
test hypotheses across content domains.
Yet another area where CASSIM might be used is to explore the eects of comparative
syntax patterns on situational outcomes. For example, there might be instances where
deviating from one's conversation partner's syntactic structure leads to one being viewed
more positively. Understanding how syntax usage and, more specically, how dynamic
syntax convergence and divergence patterns relate to social outcomes has the potential
to illuminate a heretofore unexplored area of social dynamics. Finally, CASSIM could
also be used to investigate the relationship between individual dierences on dimensions
of interest and syntax usage. For example, dierent sets of syntax patterns might be
associated with dierent populations. Further, people who dier on a given dimension
(e.g. intelligence, abstract vs. concrete thinking, working memory, self-talk) might tend
to employ dierent sets of syntactical structures (Kross et al., 2014; Semin and Fiedler,
1991). Alternatively, various psychological situations (e.g. psychological distance, and
use of abstract versus concrete language) could be evident as changes in how people put
sentences together (Trope and Liberman, 2010; F orster et al., 2004).
49
2.5.1 Limitations
CASSIM is based on constituency parse trees and their similarity, accordingly, its process-
ing time is determined by how optimized constituency parse tree generators are. Thus, it
is not as fast as methods such as LSM which solely rely on word-count techniques. Addi-
tionally, the accuracy of the constituency parse tree generator used in CASSIM directly
aects CASSIM's accuracy.
Further, CASSIM assumes that there are clear boundaries between sentences, and also
that documents follow accurate grammar rules. Yet, there are some cases in which these
assumptions might not hold. For example, in spoken language, people tend to connect
their sentences with conjunctions during a conversation leading to an unnecessary long
sentences. Another example is that some age groups do not follow conventional gram-
mar rules in their text messages. However, there are constituency parse tree generators
which are specically designed to address these scenarios (e.g. caseless English model by
Manning et al. (2014)).
2.5.2 Future Work
In future research, we hope to extend CASSIM so that it can be used to represent the
average or general syntax structure used by a group of people. Specically, this will be
accomplished by estimating an average representation of the syntax structures used in a
set of documents generated by a group. Such group level representations will be useful for
a variety of tasks. For example, new document representations could be compared to the
group-level representation and this might provide insight into the relation between the
50
author of the new document and the group. Further, by developing a method for group-
level representation, we can begin developing a better understanding of between-group
variations in syntax usage.
Additionally, we aim to extend CASSIM to not only compare constituency parse trees,
but also compare dependency parse trees. While constituency parse trees carry useful
information about the relationship among words' part of speech tags, dependency parse
trees exhibit the connection between the words and how they are related to one another.
We believe adding this feature help researchers to study human language in ner-grained
details.
While previous studies have mostly emphasized on semantics or word usage in lan-
guage, our results, along with the results of a handful of other studies, provide evidence
for the importance of syntax as a lens to determining social cognition. We believe that
our method for measuring syntax similarity of documents expedites the process of syntax
analysis and will further encourage researchers to incorporate syntax along with individ-
ual words and semantics when assessing psychological phenomena.
51
Chapter 3
Follow my Language!
Effect of Power Relations on Syntactic Alignment
Communication accommodation (or language alignment) is a phenomenon in social inter-
actions in which people adjust their language to that of their interlocutor. A component of
communication accommodation is research on power and dominance relations which sug-
gests language use is dependent on power position. There are dierent linguistic markers
which imply power standing of people. For example, when high power individuals interact
with people in low power positions, the language of the interaction tends to follow the
language of the high power individuals. While previous studies have mostly focused on
the word level features of language during this process, in two dierent experiments, we
show that not only people in low power mirror word usage of people in high power, but
they also adjust their syntactic structures to those in high power.
52
3.1 Introduction
Human language is an important tool which both captures and is aected by underlying
psychological states and social interactions. Semantic analysis has received extensive
attention over the years and the majority of the studies have focused on the role of
semantics, and more specically word-usage, in exploring psychological factors in evident
language, such as political orientations and moral concerns Pennebaker and Stone (2003);
Dehghani et al. (2014); Graham et al. (2009); Dehghani et al. (2016a); Mehl et al. (2012);
Maass et al. (2006); Dehghani et al. (2013a); Ramirez-Esparza et al. (2008). At the same
time, several studies have shown that syntactic features also carry vital information about
individuals' and groups' characteristics, such as emotional states and personality Bresnan
and Hay (2008); Vigliocco and Franck (1999); Jahr (1992); Gawda (2010); Boghrati et al.
(2017c).
In the past decade, the development of various automatic tools for measuring syn-
tactic features and capturing syntactic similarity Lu (2010); Kyle and Crossley (2015);
Graesser et al. (2004); Kyle (2016); Niederhoer and Pennebaker (2002); Ireland and
Pennebaker (2010) has paved the way for researchers to explore a wider variety of novel
and existing psychological questions by focusing not only on word usage but also on syn-
tactic features. A recent example of these tools is ConversAtion level Syntax SImilarity
Metric (Boghrati et al., 2017a, CASSIM) which measures the syntactic similarity between
two documents based on their sentences' constituency parse tree similarity. Employing
CASSIM, Boghrati et al. (2017a) studied Communication Accommodation Theory (Giles,
2008, CAT) and the interactive alignment model Pickering and Garrod (2004) in social
53
media conversations. Specically, they demonstrated the presence of syntactic alignment
in social media.
These and other related theories propose that people adjust features of their com-
munication dynamics, such as vocal patterns and gestures, while interacting with others
in order to maximize or minimize their social dierences Shepard et al. (2001). When
producing an utterance, people often face multiple syntactic structure choices that convey
relatively the same meaning. Syntactic alignment suggests that the availability of these
choices is dependent on what people have recently heard or produced Bock (1986a). For
example, Boghrati et al. (2017a) showed that when people write comment as response to
a post, they tend to follow the syntactic structures used in that post compare to their
previous writing style.
Several models have been introduced to explain the basic underlying cognitive mech-
anism of syntactic priming Reitter et al. (2011); Chang et al. (2006), while other theories
have focused on higher social-cognitive explanations of this phenomenon. For exam-
ple, the social exchange process theory, a component of communication accommodation
theory, states that people assess the utilities and costs of their actions and choose ac-
cordingly. Although, priming might decrease personal identity, it is also a mechanism
to become similar to others and thus attract their attention Giles (1979). For example,
research shows that people who are in low power positions tend to adapt their language
to the language of their superiors Danescu-Niculescu-Mizil et al. (2012). In their study,
Danescu-Niculescu-Mizil et al. (2012) examined function word classes and demonstrated
the linguistic coordination among people with dierent power status. However, this work
was focused only on word-level patterns, i.e., what words people choose rather than how
54
they put the words together. In a dierent study, Kacewicz et al. (2013) identied a
range of linguistic markers which indicate whether a speaker is speaking to a superior or
to a subordinate.
Building o of the results of capturing syntactic alignment in social media using
CASSIM, in the current study, we aim to investigate the relationship between syntactic
features and power dynamics. Our goal is to employ a tool for measuring syntactic
features and examine whether syntactic alignment can indicate if a person is in high
power or in low power. Particularly, we apply CASSIM on two real-life situations in which
people with dierent power positions have verbal communications: the U.S. Supreme
Court dialogues among lawyers and justices, and Wikipeida conversations among editors
with administrative and non-administrative access Danescu-Niculescu-Mizil et al. (2012).
These two corpora are specically suitable for our purpose as conversations occur between
interlocutors who are positioned in dierent power positions and interact to achieve a goal.
In the following sections, we rst describe the method used in this study to measure
syntactic similarity among conversations. Next, we explain the two studies we conducted
to explore the relationship between syntax and power. For each study, we rst describe
the dataset and our approach, then we demonstrate and explain the results. Finally, we
discuss our results and conclusions.
3.2 Method
In this section, we explain our approach for measuring syntactic structures of conversa-
tions. Notably, our goal is to assess a syntactic similarity score for verbal conversations
55
between two people. These scores will help us determine whether syntactic alignment can
be a marker of power status and dominance.
To measure syntactic similarity scores, we used ConversAtion level Syntax SImilarity
Metric (Boghrati et al., 2017a, CASSIM). CASSIM relies on edit distance dierence of
constituency parse trees to evaluate syntactic similarity of documents or conversations.
Given two documents, rst, CASSIM generates constituency parse trees for the sentences
in each document. Second, it calculates the edit distance between each two sentences'
constituency parse trees. Edit distance captures the number of operations (adding, re-
moving, or replacing) needed to transfer one tree to another. Next CASSIM matches
the most syntactically similar sentences across the two documents using Hungarian al-
gorithm. Finally, it provides a score between 0 and 1 where higher numbers indicate
higher similarity between the two documents. For more details on how CASSIM works
see Boghrati et al. (2017a).
As mentioned earlier, in the current paper, we use two corpora for studying the
relationship between syntactic alignment and power status: the U.S. Supreme Court
dialogues and Wikipedia conversations. These two corpora include conversations among
people who are in dierent power positions and interact to achieve a desirable goal. In
the following two analyses, we employ CASSIM to compare each two pair of consecutive
turns in a conversation between two persons with dierent power status and assess a
syntactic similarity score. We then treat these scores as the degree to which interlocutors
coordinate with one another in terms of syntactic structure use. For instance, in the
example shown in table 3.1, we measure the syntactic similarity of the lawyer's response
to what the justice has just uttered and also the syntactic similarity of the justice's reply
56
to the lawyer. The rst score is the syntactic coordination score of the lawyer toward
the justice while the second score serves as the syntactic coordination score of the justice
toward the lawyer.
Table 3.1: U.S. Supreme Court dialogues example
Justice O'Connor: Would you mind explaining to us how these two cases relate? The
Court of Appeals for the Federal Circuit decision went one way and the Tenth Circuit
went another. And are the claims at all overlapping? How are they dierentiated?
Mr. Miller: No, Justice O'Connor. They're { they're not overlapping. The claims in the
Federal Circuit case involved three contracts covering scal years 1994, 1995, and 1996.
And the Cherokee contract at issue in the case that went through the Tenth Circuit is
scal year 1997 contract and funding agreement. The section { remedial section of the
act, section 110
Justice O'Connor: But they're certainly at odds on the legal theory.
3.3 Studies
As noted earlier, Communication Accommodation Theory Giles (2008) suggests that peo-
ple in low power adjust their language to people in higher power position. In other words,
the language used by people during a communication is likely to re
ect the language of
the person in the higher power role West and Turner (2013).
The main hypothesis of our analyses is drawn upon the above mentioned theories Giles
(2008); West and Turner (2013). We are primarily interested in examining the following
hypothesis:
People in low power tend to accommodate their syntactic structures toward people
in high power while people in high power generally do not converge toward people
in low power.
57
In the following subsections, for each study, we rst introduce the corpus, then describe
the process, and nally report and discuss the results.
3.3.1 The U.S. Supreme Court Study
For the rst analysis, we used the U.S. Supreme Court dialogue corpus collected by Hawes
et al. (2009) and later Expanded to include the nal votes by Danescu-Niculescu-Mizil
et al. (2012). We applied CASSIM on the conversations happening among the justices and
lawyers in each case. Then, we conducted independent t-tests to examine our hypotheses.
In the following, we rst describe the dataset and our procedure. Then, we report and
discuss the results.
3.3.1.1 Data
The U.S. Supreme Court corpus includes oral arguments among justices and lawyers.
During a case, the lawyers have thirty minutes to defend their party. The justices,
a group of nine individuals, may interrupt the lawyers to ask questions or clarications
which often lead to interactions between the lawyers and the justices. After the arguments
for each case, the nal decision is made by the majority votes of the justices.
The oral arguments includes 204 cases with the total of 50,389 verbal exchanges among
11 justices and 311 lawyers. For more details about the corpus see Danescu-Niculescu-
Mizil et al. (2012).
58
3.3.1.2 Analysis
To explore the relationship between power status and syntactic structure alignment, we
re-framed our hypothesis stated earlier for the U.S. Supreme Court corpus as following:
Lawyers align their syntactic structure use toward justices more than justices coor-
dinate toward lawyers.
Lawyers align their syntactic structure use toward Chief justices more than they
do toward Associate justices because Chief justices are in higher power position
compare to Associate justices.
Lawyers align their syntactic structure use toward justices who eventually vote
against them (i.e. whom they are more dependent on) more than they do toward
justices who voted for them.
Table 3.2: Supreme Court Results
Hypothesis t df Eect Size [95% CI] p
Lawyers coordination toward jus-
tices
7.69 48012 0.07 [0.05,0.09] <.001
Lawyers coordination toward Chief
justices
7.71 3048.9 0.16 [0.12,0.2] <.001
Lawyers coordination toward jus-
tices on the opposite side
3.35 20329 0.04 [0.02,0.07] <.001
3.3.1.3 Results
We used CASSIM to measure the syntactic similarity score between each two pair of
consecutive turns in a conversation between a lawyer and a justice. We then labeled the
turns where a lawyer speaks to a justice as low-to-high and the turns where a justice
59
speaks to a lawyer as high-to-low. As table 3.2 shows, applying a t-test with CASSIM
score as dependent variable and comparison type (corresponding to low-to-high condition
or high-to-low condition) as independent variable demonstrated that lawyers adjust their
syntactic structure toward justices more than justices do toward lawyers t(48012) =
7:69;p< 0:001;d[95%CI] = 0:07[0:05; 0:09].
Further, recognizing the second hypothesis, we examined whether lawyers coordi-
nate more toward the Chief justice or the Associate justices. Because Chief justices
are in higher power position compare to Associate justices, we hypothesized that syn-
tactic structure alignment between lawyers and Chief justices is stronger than between
lawyers and Associate justices. Similar to the previous hypothesis, we used CASSIM
to compare lawyers syntactic structure use toward Chief justices and their syntactic
structure use toward Associate justices. As table 3.2 shows, applying an independent
t-test with comparison type (Chief-justices or Associate-justices) as independent variable
and CASSIM scores as dependent variable showed that lawyers coordinate their syn-
tactic structure use toward Chief justices more than they do toward Associate justices
t(3048:9) = 7:71;p< 0:001;d[95%CI] = 0:16[0:12; 0:2].
Finally, we used CASSIM to compare syntactic structure alignment of lawyers toward
justices who voted against them, and justices who voted for them. We hypothesized
that lawyers coordinate their syntactic structure towards justices who lean against them
more than justices who may be in favor of them. The reason for this is because lawyers
need to convince justices who are against them. As a result, they are more dependent
on them Emerson (1962). The results conrm our hypothesis: Applying a t-test with
comparison type (opposite-side or same-side) as independent variable and CASSIM scores
60
as dependent variable show that lawyers tend to mimic the syntactic structure to justices
who voted for the opposite side more closely compare to justices who voted for their side
t(20329) = 3:35;p< 0:001;d[95%CI] = 0:04[0:02; 0:07] (See table 3.2).
3.3.1.4 Discussion
In this study we examined the relationship between power and syntactic structures in
the U.S. Supreme Court oral arguments among lawyers and justices. We used CASSIM
to measure syntactic structure alignment in the arguments. As table 3.2 demonstrates,
our results showed that lawyers, who are in lower power position, adapt their syntactic
structures toward justices. Further, lawyers tend to use syntactically more similar lan-
guage to Chief justices compare to Associate justices. The dierence emerges because
Chief justices are positioned in higher power status compare to Associate justices, and
therefore we expect more syntactic alignment.
Finally, we showed the eect of dependency in power dierence on syntactic alignment.
The need for convincing another person in a conversations creates a form of dependency
Emerson (1962). Our results show that lawyers align their syntactic structures toward
justices who eventually voted against them more than they align toward justices who
eventually voted for them. Because lawyers desire to alter justices' votes toward their
own side, they feel more dependency towards justices on the opposite side which leads to
higher in-balance in power positions. As a result, they tend to mimic the language style
of opposing lawyers more closely.
61
Table 3.3: Wikipedia Result
Hypothesis t df Eect Size [95% CI] p
Non-Admin editors coordination to-
ward admin editors
2.59 66678 0.02 [0,0.03] <.001
3.3.2 Wikipedia Study
In the second study, we used a Wikipedia conversations corpus introduced by Danescu-
Niculescu-Mizil et al. (2012). We applied CASSIM on the conversations among editors
with administrative and non-administrative access to compare their syntactic structures
and to examine whether non-admin editors align their syntactic structures toward ad-
min editors. In the following subsections, we rst describe the corpus, then explain the
procedure, and nally report and discuss the results.
3.3.2.1 Data
The Wikipedia corpus includes conversations among editors with either administrative or
non-administrative access about changes to dierent articles. Generally, these interactions
are collaborative discussions in order to achieve a common goal. Some Wikipedia editors
have administrative roles which gives them permission for certain functions (such as
page deletion, page protection, or blocking and unblocking) and therefore higher status
compare to editors with non-administrative access.
This corpus includes 240,436 conversational exchanges among editors with known
status, that is either administrative role or non-administrative role, on the talk pages
about changes to articles. For more details see Danescu-Niculescu-Mizil et al. (2012).
62
3.3.2.2 Analysis
In this analysis, to investigate the eect of power status on syntactic alignment, we di-
rectly examined the hypothesis stated in Section 3: People in low power coordinate toward
people in high power while people in high power do not. In the Wikipedia corpus, admin-
istrative editors exhibit high power compared to non-administrative editors. Notably, we
are interested in investigating the following hypothesis in this study:
Editors with non-administrative role adapt their syntactic structures toward editors
with administrative role while editors with administrative role do not converge
toward editors with non-administrative role.
3.3.2.3 Results
We used CASSIM to compare syntactic structures used by administrative editors and
non-administrative editors in conversation exchanges. We then labeled the conversation
exchanges as admin-to-nonadmin when an editor with administrative role replies to an
editor with non-administrative role and nonadmin-to-admin when an editor with non-
administrative role replies to an editor with administrative role. As table 3.3 shows,
applying an independent t-test with comparison type (corresponding to either admin-to-
nonadmin or nonadmin-to-admin) as independent variable and CASSIM scores as depen-
dent variable demonstrated that editors with non-administrative role coordinate their syn-
tactic structure use toward editors with administrative role more than administrative edi-
tors coordinate toward non-administrative editorst(66678) = 2:59;p< 0:001;d[95%CI] =
0:02[0; 0:03].
63
3.3.2.4 Discussion
In the second study, we used a corpus of conversations among editors of Wikipedia with
either administrative or non-administrative role to explore the relationship between power
dynamics and syntactic alignment. Wikipedia editors with administrative roles, have
access to certain functions and are, therefore, in higher power position compare to editors
who do not have administrative role. While, results of a study by Danescu-Niculescu-
Mizil et al. (2012) showed that users coordinate toward administrative editors more than
non-administrative editors and also that administrative editors coordinate toward users
more than non-administrative editors do, in our study, we only focused on the interactions
between administrative editors and non-administrative editors. Our analysis showed that
editors with non-administrative role adjust their syntactic structure use toward editors
with administrative role more than administrative editors do toward non-administrative
editors. The results support our main hypothesis that people in low power positions
adapt their syntactic structures toward people in high power positions.
3.4 General Discussion and Future Work
The two analyses in this study provided evidence for the relationship between power
status and syntactic alignment. Our results demonstrate that individuals in low power
positions accommodate their syntactic structure towards those in high power positions
more than high power individuals do towards low power people.
Notably, in the rst analysis, we used a corpus of the U.S. Supreme Court oral ar-
guments and showed that lawyers (who are in low power) coordinate their syntactic
64
structure toward justices (who are in high power) more than justices coordinate toward
lawyers. Further, we also showed that lawyers tend to adapt their syntactic structure
toward Chief justices more than Associate justices which can be explained by the dier-
ence in power position of Chief justices and Associate justices. Finally, we investigated
the eect of dependency in syntactic alignment. Our results showed that lawyers adjust
their syntactic structure toward justices who at the end voted against them more than
justices who were on their side and voted for them.
In the second analysis, we examined our main hypothesis in a corpus of Wikipedia
conversations among editors with administrative and non-administrative access. The
same eect held in this analysis, that is, non-administrative editors (who are in low
power) coordinate their syntactic structure toward administrative editors (who are in
high power) more than administrative editors do toward non-administrative editors.
As stated in the social exchange theory Giles (1979), people assess the utilities and
costs of their actions prior to acting. For example, language alignment may attract
others in a cost of decreasing personal identity. Therefore, drawing from the results of
our analyses, when people in dierent power positions communicate, those who are in
low power try to converge to people in higher power as it brings them greater utility in
form of establishing rapport and becoming closer to those in power. However, people in
high power see no utiliy in converging toward people in low power.
Building on the results of our analyses, we aim to study syntactic structures in ner
grain details and explore syntax categories which are more common among people in low
or high power. In other words, our goal is to study whether there are syntactic structures
which can play as linguistic markers of people in dierent power positions. Further, we
65
intend to investigate which syntactic structures are more likely to be mirrored by people
in lower power positions.
In summary, our results support our hypothesis that the relationship between language
alignment and power is not limited to word-level features. The same eect may be found
in the syntactic structure use of people in dierent power positions, that is, low power
people align their syntactic structure toward high power people more than high power
people do toward low power people.
66
Chapter 4
Generalized Representation of Syntactic Structures
Analysis of language provides important insights into the underlying psychological prop-
erties of individuals and groups. While the majority of language analysis work in psy-
chology has focused on semantics, psychological information is encoded not just in what
people say, but how they say it. In the current work, we propose Conversation Level
Syntax Similarity Metric-Group Representations (CASSIM-GR). This tool builds gen-
eralized representations of syntactic structures of documents, thus allowing researchers
to distinguish between people and groups based on syntactic dierences. CASSIM-GR
builds o of Conversation Level Syntax Similarity Metric by applying spectral clustering
to syntactic similarity matrices and calculating the center of each cluster of documents.
This resulting cluster centroid then represents the syntactical structure of the group of
documents. To examine the eectiveness of CASSIM-GR, we conduct three experiments
across three unique corpora. In each experiment, we calculate the clustering accuracy
and compare our proposed technique to a bag-of-words approach. Our results provide
evidence for the eectiveness of CASSIM-GR and demonstrate that combining syntactic
67
similarity and tf-idf semantic information improves the total accuracy of group classi-
cation.
4.1 Introduction
Language lies at the heart of human communication, and analysis of language has been
shown to be an essential lens for investigating and understanding many dierent psy-
chological properties. Language analysis has provided insight into depression Ramirez-
Esparza et al. (2008), moral values Graham et al. (2009); Dehghani et al. (2016a), neuroti-
cism and extraversion Mehl et al. (2012), political orientations Dehghani et al. (2014), and
cultural backgrounds Maass et al. (2006); Dehghani et al. (2013a) among many others.
Most of these studies, however, focus on quantifying word choice or semantics. While
semantics undoubtedly play an important role in capturing psychological properties, it is
vital to also include analysis of syntax in this process. Prior research has shown that syn-
tactic structures also capture individuals and group dierences for various demographic
and psychological factors such as educational or regional background Bresnan and Hay
(2008), gender Vigliocco and Franck (1999), socio-economics Jahr (1992), and emotional
states and personality Gawda (2010).
Recently, several tools have been developed for automated analysis of syntactic struc-
tures. For example, Lu's Lu (2010) system analyzes fourteen dierent measures includ-
ing the ratio of verb phrases, number of dependent clauses, and T-units to calculate
documents' syntactic complexity. Similarly, TAALES relies on several features such as
frequency, range, academic language, and psycholinguistic word information to measure
68
!
"
!
#
!
$
N Documents
CASSIM
Compare each pair
of documents
Syntactic
Similarity
Matrix
!
"
!
#
…
!
$
!
"
!
#
… !
$
k: Number of clusters
Spectral
Clustering
k Clusters
%&'()*+
"
%&'()*+
#
%&'()*+
,
%
"
%
#
%
,
k Centroids
CASSIM
Compare !
-
to %
"
through %
,
…
…
…
!
-
: A new document
k Similarity Scores
(
"
: similarity of !
-
to %
"
(
#
: similarity of !
-
to %
#
…
(
,
: similarity of !
-
to %
,
Prediction
Assign !
-
to the cluster
with the highest
similarity score
Calculate
Centroids
Figure 4.1: CASSIM-Group Representation Process.
69
lexical sophistication Kyle and Crossley (2015). By comparison, Coh-Metrix is a tool
which provides measurement for over 200 dierent facets of syntax (e.g. mean number
of modiers per noun phrase, mean number of high-level constituents per word, and the
incidence of word classes that signal logical or analytical diculty) Graesser et al. (2004).
While each of these tools provides dierent mechanisms for measuring various syn-
tactic features, they all rely on previously identied features of interest. More recently,
we introduced ConversAtion Level Syntax Similarity Metric (CASSIM) to incorporate
constituency parse trees when calculating the syntactic similarity of documents Boghrati
et al. (2017a). CASSIM compares groups of documents based on underlying syntactic
dierences between groups of documents.
There are some situations, however, where hypothesis testing about predened fea-
tures or groups may not be the only aim. Instead, researchers may wish to identify new
groupings of documents and the features which tie them together. These group-level
linguistic representations can lead to important, novel discoveries about how a group
communicates. Clustering techniques are widely used for this type of analysis. There is
an extensive literature studying various text clustering approaches and their applications
Song et al. (2009); Sasaki and Shinnou (2005); Lin et al. (2014). This literature demon-
strates that many linguistic features facilitate improvements in text clustering Liu et al.
(2003, 2005), some of which address the eect of synonymy, hypernymy, syntax, and part
of speech tags on text clustering methods Sedding and Kazakov (2004); Lewis and Croft
(1989); Lewis (1992); Zheng et al. (2009).
70
In the current paper, we introduce ConversAtion Level Syntax Similarity Metric-
Group Representations (CASSIM-GR), a tool that captures the generalized representa-
tion of syntactic structure used by individuals in a certain group. CASSIM-GR groups
documents into separate clusters based on their syntactic similarity scores, and uses the
centroid of a cluster as a generalized representation of the syntactic structures used in
that cluster. These centroid syntax representation can then be used to understand within-
group syntax similarities and between-group syntax variations. As we will show, these
generalizations of syntactic structures can be useful when analyzing dierences between
documents written by dierent individuals or groups.
This paper is structured as follows: First, we describe our proposed approach, CASSIM-
GR, in more detail. Next, we validate the approach with a corpus of syntactically similar
documents. Then, we apply CASSIM-GR to two other corpora: documents marked as
dogmatic and non-dogmatic (Fast and Horvitz, 2016) and documents from conservative
and liberal weblogs (Dehghani et al., 2013b) and evaluate the classication accuracy
of CASSIM-GR compared to tf-idf approach and a combination of the two approaches.
Finally, we discuss limitation and future directions of our work.
4.2 CASSIM-GR
In this section we describe CASSIM-GR for clustering groups of documents with simi-
lar syntactic structures. CASSIM-GR includes four general steps: 1. constructing the
syntactic similarity matrix, 2. applying spectral clustering, 3. calculating the center of
71
clusters, 4. classication. Figure 4.1 demonstrates the steps involved in CASSIM-GR to
compute the generalized representation of syntactic structures.
First, we use CASSIM Boghrati et al. (2017a) to calculate the syntactic similarity
between each pair of documents. CASSIM relies on edit distance dierence of constituency
parse trees. It rst generates parse trees for the sentences in each document. Next, it
calculates the edit distance between each two sentences' constituency parse trees and
matches the most syntactically similar sentences using Hungarian algorithm. Finally, it
provides a score between 0 and 1 where higher numbers indicate higher similarity between
two documents. Using the syntactic similarity scores measured by CASSIM, we build a
syntactic similarity matrix. With N documents in our corpus, the syntax similarity
matrix is A
NN
; where A
i;j
is the syntactic similarity of the two documents i and j.
Next, spectral clustering Shi and Malik (2000) is used to cluster documents into a pre-
dened number of groups. It has been shown that spectral clustering often outperforms
traditional clustering algorithms Von Luxburg (2007). The general idea behind spectral
clustering is to apply k-means clustering on eigenvectors of Laplacian matrix of A. The
syntactic similarity matrix A, which is constructed in the previous step, and the number
of clusters are provided as inputs to the spectral clustering method.
Clustering documents leads us to an essential next step which is extracting general
attributes or representation of clusters. One way to address this concern is to calculate
a centroid for each cluster. Clusters' centers facilitate researchers to better understand
and analyze the syntactic structures used by a group of people or under certain situations
by only analyzing center documents and without going through hundreds of documents.
Hence, the third step in CASSIM-GR is calculating a centroid for each cluster. We dene
72
a cluster's center as the document which has the highest syntactic similarity to other
documents in its cluster. To identify a cluster's center, we calculate average syntactic
similarity of each document to other documents in its cluster and return the document
with the highest average similarity. Additionally, we may return the topn documents with
the highest average syntactic similarity to other documents in a cluster as representative
samples of that cluster.
Finally, we use cross-validation to test the accuracy and representativeness of the
clusters' centers. To cross-validate, our approach uses CASSIM to calculate the syntactic
similarity of the left-out document to each centroid and assigns the document to a cluster
with the highest similarity. This process is repeated N times and an accuracy of classi-
cation is reported by the method. In the following sections, we evaluate CASSIM-GR by
performing classication experiments on three dierent corpora.
4.3 Experiments
We conducted three experiments to validate CASSIM-GR and to examine the represen-
tativeness of the cluster centroids. Additionally, we examined how well documents with
similar syntactic structures cluster together and demonstrate the importance of syntactic
similarity in classication. Further, we compare the accuracy of syntactic clustering to
bag-of-words clustering. For this purpose, we use the tf-idf similarity matrix as input
to spectral clustering. Lastly, we combined tf-idf and CASSIM-GR to see how including
both sets of information aect the classication accuracy. Below, we discuss the three
experiments in detail.
73
Table 4.1: Corpora Overview.
Experiment One Experiment Two Experiment Three
Corpus Syntactically Sim-
ilar Sentences
Dogmatism in
New York Times
Political Weblog
Posts
Number of Groups 4 2 2
Number of Documents 272 500 452
Table 4.2: Accuracy of approaches in three experiments.
Experiment One Experiment Two Experiment Three
CASSIM-GR 95% 54.8% 69.9%
TF-IDF Approach 84.5% 61% 64.4%
Combined Approach 97.8% 66.6% 71.9%
Table 4.3: Comparison of approaches in three experiments.
Experiment One Experiment Two Experiment Three
CASSIM-GR vs.
TF-IDF Approach
X
2
(1) =
17:01;p<:001
X
2
(1) = 3:94;p <
:05
X
2
(1) = 3:13;p =
:07
TF-IDF Approach
vs. Combined Ap-
proach
X
2
(1) =
29:61;p<:001
X
2
(1) = 3:39;p =
:06
X
2
(1) = 5:89;p <
:05
CASSIM-GR
vs. Combined
Approach
X
2
(1) = 2:67;p =
:10
X
2
(1) =
14:59;p<:001
X
2
(1) = :43;p =
:51
74
4.3.1 Experiment One
Experiment one was conducted on a corpus of syntactically similar documents. The
corpus was generated by Amazon Mechanical Turk participants and consists of four groups
of documents; each has high within-group syntactic similarity and low between-group
syntactic similarity.
We used CASSIM-GR along with tf-idf, to group documents into clusters. Further,
we combined these two approaches and calculated the overall accuracy. We rst introduce
the dataset and then report the results.
4.3.1.1 Data
118 MTurk participants answered a set of four questions. In each question they were asked
to generate sentences with similar grammar rules to the sentence prompts in the question.
Each of the four prompts had a dierent syntactic structure. Later, two independent
coders, coded whether a sentence generated by a participant was grammatically similar
to its prompt. Sentences which were identied as dissimilar by both coders were excluded
from the dataset. Finally, a total of 272 documents, 68 documents in each group, were
collected. See Boghrati et al. (2017a) for more details.
Since participants were asked to write sentences similar to four dierent sets of
prompts, the corpus is therefore divided to four separate groups, each associated to a
question and its responses. Documents which are in the same group are considered to
have similar syntactic structures.
75
4.3.1.2 Analysis
We performed leave-one-out cross-validation for both of the clustering techniques. Namely,
we ran the analysis on all the documents except for document i. Next, we labeled the
clusters with the name of the group to which most of the documents belong. Then, we
calculated similarity of document i to each cluster's center. Finally, document i was as-
signed to the cluster with which it had the highest syntactic similarity. The classication
was considered successful if the assigned cluster's label and the document's group were
identical.
We used the following approach to combine tf-idf and CASSIM-GR: First, we used
CASSIM-GR and tf-idf approach separately to cluster documents intok clusters. Cluster
j;j 2 [1;k] in tf-idf approach and cluster j0;j0 2 [1;k] in CASSIM-GR were labeled
with the same name, that is, the majority of documents in cluster j and the majority
of documents in cluster j0 were from the same group (e.g. `liberals'). We averaged the
syntactic similarity of document i to center of cluster j and the syntactic similarity of
document i to center of cluster j0. We repeated this procedure k times to measure the
similarity of document i to all k clusters and assigned document i to the cluster with
highest similarity score. If the cluster's label and document i's label were the same, we
would conclude that prediction was successful.
4.3.1.3 Results
Our results demonstrate that CASSIM-GR is able to accurately cluster the corpus. Fol-
lowing the instructions discussed in above, we performed leave-one-out cross-validation
on 272 documents. In each step, 271 documents were clustered in four groups and later
76
the left-out document was assigned to one of the four clusters based on its similarity to
the center of clusters.
Following this mechanism, CASSIM-GR yielded 95% accuracy while tf-idf approach
was only 84:5% accurate. Running a chi-squared test demonstrates that CASSIM-GR
results in signicantly higher accuracy than tf-idf, X
2
(1) = 17:01;p < :001. Since the
dataset consists of groups of syntactically similar documents, it is not surprising that
clustering based on syntactic structures surpasses the word-based approach and achieves
a higher accuracy.
Next, we combined the two approaches and obtained an accuracy of 97:8%. While
this result is not signicantly higher than CASSIM-GR accuracy, X
2
(1) = 2:67;p =:10,
we may conclude that incorporating syntactic and semantic information together could
potentially improve clustering accuracy.
4.3.2 Experiment Two
In the second experiment, we used the Dogmatism Dataset collected by Fast and Horvitz
(2016). This dataset includes comments from New York Times which are rated based
on their level of dogmatism. As explained below, we rst categorized the documents as
dogmatic or non-dogmatic based on this ratings. Next, we followed the procedure which
was explained in the rst experiment and clustered the documents using CASSIM-GR
and the tf-idf approach. In the following subsections, we rst introduce the dataset and
then report the results.
77
4.3.2.1 Data
The Dogmatism Dataset includes comments from New York Times. Amazon Mechanical
Turk participants were asked to rate the level of dogmatism of each of the collected
comments on a 5-point Likert scale. More details on the dataset and the annotation
process are available at Fast and Horvitz (2016).
4.3.2.2 Analysis
Dogmatism is subjective, and consequently inter-annotator agreement is higher for com-
ments in both extreme sides of the spectrum. In other words, human coders tend to agree
more on posts rated as very high in dogmatism and posts rated as very low in dogmatism
Fast and Horvitz (2016). Following the method used by Fast and Horvitz (2016), to
have a representative and balanced dataset, we selected the top 250 and the bottom 250
documents based on the dogmatism rating. We labeled the top 250 posts as dogmatic
and the bottom 250 as non-dogmatic, hence the nal dataset contained 500 posts with
250 in each group.
4.3.2.3 Results
Following the instruction in Experiment 1, we performed leave-one-out cross-validation;
we ran the clustering algorithm with 499 documents and left document i;i2 [1; 500],
out. Then, we predicted to which cluster document i belonged. CASSIM-GR and tf-
idf approach resulted in 55% and 61% accuracy respectively. Even though, the tf-idf
approach outperformed our approach signicantly,X
2
(1) = 3:94;p<:05, combining these
78
two approaches resulted in a higher accuracy of 66:6%, which is a marginally signicant
improvement over the tf-idf accuracy, X
2
(1) = 3:39;p =:06.
This result provides evidence for the importance of syntactic structure similarity in
clustering documents. It demonstrates that not only what dierent groups of people say,
but also how they say what they say provide important information about the character-
istics of the group. This is evident by the fact adding syntactic similarity to word-level
similarity can improve the clustering accuracy.
4.3.3 Experiment Three
In this experiment, we applied CASSIM-GR on a corpus of political discussions taken
from a set of conservative and liberal weblogs, and focus on the discussion about the
Ground Zero Mosque Dehghani et al. (2013b).
4.3.3.1 Data
The top ve popular conservative and liberal news blogs were selected according to
www.blogs.com. Next, a dataset of these weblogs posts which contained word mosque
and were written in the time frame of the debate, were complied. For more details about
the dataset and the data collection process please refer to Dehghani et al. (2013b).
4.3.3.2 Analysis
In this experiment, we randomly selected 250 posts from conservative weblogs posts and
250 posts from liberal weblogs posts, but due to encoding issues the nal dataset included
226 posts from each group (total of 452 posts).
79
4.3.3.3 Results
Similar to the previous experiments, we used the leave-one-out cross-validation procedure
described above. Specically, we trained the clustering algorithm on 451 documents and
predicted to which cluster the left-out document belonged. This process was repeated
452 so that each document was tested once.
CASSIM-GR was able to successfully predict the correct cluster for a document with
70% accuracy, while tf-idf was 64:4% accurate. This dierence is only marginally signi-
cant,X
2
(1) = 3:134;p =:0767. Next, we combined these two approaches as described in
the Experiment section. The total accuracy was 72% which is signicantly more accurate
than tf-idf approach alone, X
2
(1) = 5:8905;p =:0152.
These results demonstrated that, in some cases, syntactic structures similarity may
capture more crucial features needed for clustering compared to tf-idf approach. However,
there are some features that only tf-idf approach can pick up. Thus, the combination of
these two sets of features is needed for more accurate clustering.
4.4 Discussion and Future Work
Across three studies, we presented and validated a new approach called CASSIM-GR.
CASSIM-GR clusters documents into separate groups based on their syntactic similarity
and calculates a generalized representation of group-level syntax usage by performing four
general steps: First, it creates a syntactic structure similarity matrix of documents using
CASSIM. Second, it uses spectral clustering to group the documents into a pre-dened
number of clusters using the syntactic similarity matrix generated in the previous step.
80
Next, the algorithm selects the document which has the highest syntactic similarity to
the other documents within each cluster and identies it as the centroid of that cluster.
Finally, it can be used to classify unknown documents based on the document's syntactic
similarity to the clusters' centers.
We applied CASSIM-GR to three unique corpora (Table 4.1) across three experiments
to compare its accuracy to both a bag-of-words approach and a combined approach incor-
porating tf-idf semantic information and CASSIM-GR. As Table 4.2 demonstrates, tf-idf
and CASSIM-GR varied in their relative strength for clustering accuracy across studies.
The combined approach incorporating both syntactic (CASSIM-GR) and semantic (tf-
idf) information resulted in the highest clustering accuracy across all three experiments.
While not a signicant improvement beyond both single approaches, the combination
approach signicantly outperformed tf-idf in two of the three experiments and CASSIM-
GR in the second experiment. Therefore, we may conclude that word-level similarity and
syntactic similarity capture dierent aspects of language, and consequently, combining
the two features' similarities results in more accurate clusters.
Our results indicate that methods assessing syntactic similarity may more accurately
cluster documents than methods which rely on semantics alone. While there may be
situations in which groups use the same general words to discuss a topic, syntactic sim-
ilarity dierences could still allow researchers to distinguish between dierent subsets of
individuals.
More importantly, CASSIM-GR gives researchers an opportunity to study syntactic
dierences between groups by analyzing the prototypical syntactic structures at the clus-
ters' centers. The syntactic structures used by a cluster's center document is dened
81
as a generalized representation of syntactic structures of the documents in that cluster.
Assessing dierences in these structures may help to capture underlying psychological
dierences between groups in the ways that they conceptualize a topic or how they com-
municate with each other.
A vital component of CASSIM-GR is measuring syntactic similarity among docu-
ments using CASSIM. As mentioned previously, CASSIM's general focus is on comparing
constituency parse trees. Building on CASSIM, we intend to compare dependency parse
trees among sentences and documents to add another syntactic similarity measurement
to CASSIM. Unlike constituency parse trees which posit the connection between part
of speech tags, dependency parse trees reveals the relationship between the words in
a sentence. By incorporating this feature into CASSIM, researchers may further use
CASSIM-GR not only to generalize syntactic structure of a group of documents, but also
their dependency structures. This extension will help researchers study human language
in ner grained detail by looking at the relationship between words.
In summary, we introduced a new method for computing generalized representations
of syntactic structures of documents, allowing researchers to distinguish between groups of
documents based on syntactic dierences. Further, In the three experiments, we demon-
strated the benets of including syntactic structure similarity scores in clustering docu-
ments. In each experiment, we repeated a clustering procedure, once using CASSIM-GR
and once using tf-idf similarity matrix. Then, we calculated clustering accuracy of each
approach using leave-one-out cross-validation mechanism. Finally, we combined the re-
sults of these two approaches and calculated the accuracy when both sets of features were
present. Our results support our assumption and demonstrated that syntactic similarity
82
scores capture dierent aspects of language compared to bag-of-words, and therefore help
improve clustering accuracy.
83
Chapter 5
Effect of Syntactic Similarity on Recall and
Agreement
Structural priming theories suggest that individuals more easily process and are more
likely to reproduce syntactic structures to which they have been exposed. Building on
these theories, we hypothesize that familiar syntactic structures enhance both recall and
persuasiveness of arguments. In two separate pre-registered experiments, we demonstrate
that there is a positive relationship between syntax familiarity and individuals' 1. recall of
sentences and 2. acceptance of arguments to which they would normally be opposed. In
our rst study, we show that an individual's writing style has an impact on their ability to
recall sentences. By obtaining written samples from individuals prior to evaluating them
on a sentence recall task, we show higher similarity of individuals' writing styles with
sentences' syntactic structure results in higher recall. In our second study, we provide
evidence that individuals also exhibit higher levels of agreement with statements whose
syntactic structures are familiar, even when the statements disagree with their stated
views. The results of these two studies support our main hypothesis that an individual's
84
increased familiarity with a syntactic structure renders that structure easier to recall and
more agreeable.
5.1 Introduction
Humans have at their disposal a subtle generative capability for sentence production in
their native languages. Chomsky (1965) details a number of potential circumscriptions on
that capability, including constraints on memory, error in productions and reproductions,
and external distractors. Over the last few decades, however, a handful of studies have
demonstrated an activation process which may provide additional limitations on the full
range of human language reproduction abilities (Bock, 1986b). Through this activation
process, known as syntactic priming, individuals are more likely to reproduce or generate
syntactic structures similar to those to which they have been exposed.
In Bock's original study on the topic, participants were more likely to produce a
syntactic form that had been visually primed when providing descriptions of images
(Bock, 1986b). Another study found similar results, except that participants were exposed
to syntactic primes through spoken conversation instead of in written form (Branigan
et al., 2000). Other studies have focused on underlying mechanisms of syntactic priming
such as short-term activation and long-term adaptation. For example, some argued that
structural priming persist over fairly long period of time (Bock and Grin, 2000; Branigan
et al., 2000; Hartsuiker and Kolk, 1998). While, others demonstrated rapid decay in
syntactic priming when other structures were subsequently produced (Levelt and Kelter,
1982; Branigan et al., 1999; Wheeldon and Smith, 2003).
85
Meanwhile, Potter and Lombardi (1990) and Lombardi and Potter (1992) observed
that sentence recall related primarily to the underlying meaning of a sentence rather
than an exact reproduction. However, they also found that when primed concepts have
more than one possible syntactic representation, such as passive or active voice, a primed
syntactic structure is more likely to be reproduced compared to an unprimed structure.
Although semantic structure is more fundamental to sentence recall, syntax is integral
after controlling for meaning. However, the authors demonstrated in a subsequent study
that perceiving a message accounts for the syntactic priming and its verbatim immediate
recall (Potter and Lombardi, 1998).
In a dierent line of research, prior work on consumer psychology has found that
syntactic complexity aects both recall and eectiveness of advertising (Lowrey, 1998),
primarily by impacting participants' motivation to process information. Contrary to
classic advertising principles to focus on simple syntax (\Buy car now!"), moderate com-
plexity has been found to be most eective (Bradley and Meeds, 2002). Further work
has found these eects to be moderated by factors including prior knowledge and gender
(Putrevu et al., 2004). These studies suggest that the key factor is not an abstract notion
of complexity but rather individual comfort, a concept which moderating demographic
variables can at best roughly approximate. This provided part of our motivation for
directly assessing individuals' default syntactic patterns. One person might nd passive
voiced sentences more natural while another is more comfortable with simple declarative
sentences. To our knowledge, this is the rst work to evaluate the persuasive impact of
syntactic congruence in terms of individual-level measures of syntactic familiarity.
86
These ndings raise questions about the broader impact of syntactic similarity on gen-
eral interactibility with written content. To address this, we rst considered whether an
individual's own writing style has an impact on their ability to recall sentence structures.
That is, will an individual demonstrate deeper interaction vis-a-vis short-term memory
with a syntactic structure that is more familiar to them or more closely related to their
own writing style? In the rst study, we attempt to answer this question by combining a
text analysis tool with online written corpora drawn from the ICIC Corpus of Fundraising
Texts from American National Corpus (Reppen et al., 2005). The initial study evaluates
how individuals' own syntactic styles impact recall by obtaining written samples from
individuals prior to sentence recall of syntactically-diverse sentences. We predict that
individuals' recall abilities will be directly proportional to similarity between the style
of a sentence for recall and each participant's writing style. In other words, increased
syntactic similarity between an individual's syntactic style and the recall prompt would
correspond with greater recall accuracy.
Another manner in which individuals interact with written content is based on their
congruence with positional material. A further question, then, is whether they also
nd statements with similar syntactic structures more acceptable. That is, given two
semantically-similar statements that take a certain position, will an individual express
more agreement with a statement that more closely resembles their syntactic style? Thus,
in the second study we further assess the impact of syntactic structure on interactabil-
ity by analyzing agreement with politically-valenced statements of dierent syntactic
structures. The primary hypothesis for the second study is that individuals will express
87
stronger agreement with a sentence that they disagree with but which more closely resem-
bles their writing style than with a similarly-valenced statement that diers more from
their own writing style.
Since this question also lacks a meaningful amount of previous research, this hy-
pothesis is also exploratory. We predict that, when individuals are given politically-
valenced statements opposing their views on global warming, they will express stronger
disagreement with syntactically-dissimilar statements compared to syntactically-similar
statements.
In the following, we explain the two experiments and their observed results. For each,
we rst introduce our method and data, and then report the results. Finally, we discuss
the ndings and possible future directions.
5.2 Experiment 1
In this experiment, we explore how syntactic similarity between a sentence and an indi-
vidual's writing style impacts their ability to recall the sentence. Sentence structure is
a fundamental component of conceptual understanding and writing style. Our primary
goal is to explore the extent at which its role also extends to the ability to recall and
agree with a statement. In the rst study, we aim to examine the rst component of our
goal. We hypothesize that when the syntactical style of an individual is more similar to
prompts they are asked to recall, their recall accuracy is increased.
To test this hypothesis, participants, recruited through Amazon TurkPrime platform
(Litman et al., 2017), were rst asked to write a few sentences about a polarizing topic.
88
Next, they were given nine sentence prompts with three distinct syntactic structures to
read and then reproduce. Syntactic similarity was measured between 1. the participants
rst writing piece and the nine recall prompts and 2. the nine recall prompts and the
nine participant-generated reproductions of the prompts. We used linear mixed eects
models to assess trends of syntactic similarity and recall performance. This experiment
was approved by IRB (UP-17-00502) and pre-registered on OSF
1
.
5.2.1 Method
In the rst experiment we are primarily interested in the eect of syntactic similarity
on recall performance; therefore, we used two computational tools to produce syntactic
structures and examine individuals' ability to recall prompt sentences. We brie
y describe
these two tools that have been used in this study, CASSIM-GR (Boghrati et al., 2017d)
and CASSIM (Boghrati et al., 2017b), before delving into the design and results of the
experiment.
5.2.1.1 Syntax Prompt Production
To measure participants' recall performance, we presented them with a set of nine sentence
prompts to memorize and recall later. The procedure to generate the nine prompts was
as follows:
We applied ConversAtion level Syntax SImilarity Metric-Group Representation (CASSIM-
GR; Boghrati et al., 2017d) on the ICIC Corpus of Fundraising Texts from American
National Corpus (Reppen et al., 2005). CASSIM-GR is a tool which clusters sentences
1
https://osf.io/yp7ch/register/565fb3678c5e4a66b5582f67
89
based on the similarity of their syntactic structures, i.e., sentences with similar syntactic
structures fall within the same cluster. Figure 4.1 outlines the steps that CASSIM-GR
undergoes in order to cluster sentences and compute a generalized syntactic representa-
tion. We do not use the prediction component in our study and therefore only explain
the rst three components step by step.
First, at its core, CASSIM-GR utilizes ConversAtion level Syntax SImilarity Metric
(CASSIM; Boghrati et al., 2017b) to calculate syntactic similarity score between each
two given sentences. CASSIM is a computational tool which calculates the syntactic sim-
ilarity between each set of two given documents and works as follows: First, it generates
constituency parse trees for all the sentences in the two documents. Next, it uses edit
distance as a measure of dierence between each pair of parse trees across the two docu-
ments. Third, it uses a minimum weight perfect matching algorithm, here the Hungarian
algorithm (Kuhn, 1955), to match the most syntactically similar sentences across the two
documents. Finally, it averages the similarity of matched sentences and calculates a nal
syntactic similarity score for the two documents.
Using the syntactic similarity scores measured by CASSIM between each pair of doc-
uments, CASSIM-GR builds a syntactic similarity matrix. Assuming N documents in a
corpus, the syntax similarity matrix isA
NN
; whereA
i;j
is the syntactic similarity of the
two documentsi andj. Second, spectral clustering Shi and Malik (2000) is used to cluster
documents into a pre-dened number of groups. The third step in CASSIM-GR is calcu-
lating a centroid for each cluster. A cluster center is dened as the document which has
the highest syntactic similarity to other documents in its cluster. To identify a cluster's
90
Table 5.1: Cluster Center Prompt for Recall
Cluster ID Sentence Prompt
1 In these volatile times, as children and their parents are subjected to
mounting pressures in schools, our mission is even more critical.
2 There are many ways for us to become involved with local charities,
NGOs, and non prot organizations.
3 The auction will benet the numerous scholarship programs of the Com-
munity Center of our neighborhood, a non prot agency.
center, CASSIM-GR calculates average syntactic similarity of each document to other
documents in its cluster and return the document with the highest average similarity.
In the current experiment, applying CASSIM-GR on 738 sentences from 245 docu-
ments from the ICIC Corpus (Reppen et al., 2005) resulted in three separate clusters. As
previously discussed, sentences in each cluster tend to have dierent syntactic structures
from sentences in the other clusters.
For each cluster, we extracted the cluster center, or the the representative sentence, as
a general representation of syntactic structures used in that cluster. Table 5.1 displays the
cluster centers for the three clusters. Further, we manually generated two syntactically
similar sentences per cluster center. Generating the sentences gave us the opportunity
to control for sentence length, sentence complexity, word length and frequency as shown
in table 5.2, which would have not been possible if we had simply chosen the two other
sentences from sentences in the clusters. To verify the representativeness of these sen-
tences, we applied CASSIM-GR on the nine sentences (six manually generated sentences
and three representative sentences). CASSIM-GR clustered the sentences into three clus-
ters, with three sentences per cluster. The generated sentences fell into the same cluster
as their corresponding representative sentence, showing that they are syntactically more
91
Table 5.2: Generated Sentence Prompt for Recall
Cluster ID Sentence Prompt
1 In these summer months as children are exposed to increasing heat in
the outdoors, their hydration is even more critical.
1 For these upcoming projects as students are subjected to shortening
deadlines in the semester, their concentration is even more crucial.
2 It is very common for them to come prepared with relevant questions,
pressing concerns, and new suggestions.
2 There were few options for me to remain unnoticed by local press, social
media, and national television.
3 The presentation will highlight the recent technological advances in the
eld of self-driving vehicles, a fascinating subject.
3 The movie theater will present the selected works of some directors of
the independent lm movement, an older genre.
similar to their representative sentence compare to other sentences. Therefore, we may
conclude that the nine sentences are drawn from three distinct syntactic structures.
5.2.1.2 Study
First, to assess participants' syntactic style, we asked Amazon TurkPrime participants to
write down their opinion about global warming. We specically chose a publicly \contro-
versial topic" to ensure that participants were aware of, and had opinions about the topic
prior to study entry, and that they could easily write down a few sentences about their
opinion on the topic and provide enough information for us to measure their writing style
and syntactic structure use. Participants were told to only use formal language when
writing their response.
We used the nine validated sentences as prompts in our recall trials. The nine prompts
were presented to the participants in a randomized order, one word at a time, using the
Rapid Serial Visual Presentation paradigm (Van de Velde and Meyer, 2014). At the
beginning of each trial, we gave participants instructions about how the prompts would
92
be presented on the screen. After clicking on `continue', a cross symbol was shown on
the screen for 400ms to draw participants' focus to where words would be presented;
following the presentation of the cross symbol, a sentence was shown word by word, with
each word being displayed on the screen for 200ms.
Participants were asked to solve a very simple math task (addition) to prevent them
from mentally maintaining the sentences they were exposed to and to verify their en-
gagement in the survey Lombardi and Potter (1992). After each sentence was revealed,
a math problem was displayed on the screen and participants had ve seconds to solve
the problem and enter the sum. After providing their solution or running out of time,
participants were given a text block in which to write down the sentence they had just
read to the best of their ability. Figure 5.1 summarizes the recall experiment steps.
5.2.2 Analysis
Using CASSIM (Boghrati et al., 2017b), we calculated the syntactic similarity between
essays and the nine prompts for each participant. Further, to measure recall performance,
we calculated the syntactic similarity between each recalled sentence and its corresponding
sentence prompt.
Since we are interested in the eect of syntax on participants' recall performance,
we conducted our analysis in the syntax cluster level; that is, we averaged both essays'
similarity to sentence prompts and recalled sentences' similarity to sentence prompts over
the three sentence prompts per cluster.
93
Global Warming Essay
Sentence Prompt 1
Math Task 1
Sentence Recall 1
Sentence Prompt 9
Math Task 9
Sentence Recall 9
…
Figure 5.1: Recall Experiment Process.
94
5.2.2.1 Participants
We launched our experiment on TurkPrime (Litman et al., 2017). We rst conducted a
pilot study to estimate the number of participants needed for our mixed eects analyses
and recruited 60 Amazon TurkPrime participants, 30 with liberal and 30 with conservative
political views to get diverse opinions about global warming. All the participants were
recruited from the United States with English being their native language.
Next we ran simulations with the pilot data to determine the number of participants
necessary with sucient power and acceptable Type I error rate. To test our hypothesis,
we applied a linear mixed eect model: Syntactic similarity of recalled sentences and
prompt sentences was the dependent variable, syntactic similarity of participants' essay
and prompt sentences was the xed eect, and participant identier and cluster identier
(corresponding to the rst, second, or third cluster) were random eects. Simulating the
mixed eect model with 500 participants, we achieved power of 0.89.
We collected data for 510 participants, including 329 liberals and 181 conservatives.
Each participant wrote an essay about global warming and then recalled nine sentence
prompts with three dierent syntactic structures.
5.2.2.2 Results
Our results indicate that participants more accurately recalled sentence prompts in a
cluster with which they had higher syntactic similarity compared to the other sentence
prompts. Specically, using lme4 package (1.1-15) in R (Bates et al., 2014) and tting a
linear mixed eects model revealed that higher similarity with a syntactic structure leads
to a more accurate recall, =:44;SE = 0:0028;p<:001; 0:0438[0:0372; 0:0499]. We used
95
average syntactic similarity of participants' essay to the prompts over the three prompts
in each cluster as xed eect, average syntactic similarity of recalled sentences to their
corresponding prompt over the three prompts in each cluster as dependent variable, and
user id and cluster id (corresponding to the rst, second, or third cluster) as random
eects.
5.2.3 Discussion
In this experiment, we examined the rst component of our hypothesis which is the
relationship between individuals' syntax style and their ability to recall a sentence with
a certain syntactic structure. We hypothesized that individuals interact with familiar
syntactic structures more eectively. Notably, in this study, we addressed individuals
recall performance when exposed to familiar syntactic structures. In our study, rst we
assessed participants' writing style and syntactic familiarity by asking them to write a
short paragraph about global warming. Then, they were presented with nine validated
sentence prompts with three distinct syntactic structures. After answering a distractor
task, participants were instructed to write down the sentence they just saw to the best
of their ability. Our linear mixed eects model showed that participants have higher
recall performance toward sentences whose syntax is similar to their own writing style.
This result supports our hypothesis that familiarity with a syntactic structure may assist
individuals to memorize it more accurately.
96
5.3 Experiment 2
In this section, we explain our approach to examine the relationship between syntactic
familiarity and individuals' congruence with positional material. The primary research
question in this study is whether syntactic familiarity in
uences individuals' interactions
with written content. Notably, as we demonstrated in Experiment 1, being familiar with
a certain syntactic structure facilitates the ability to recall that structure. In the Ex-
periment 2, we build on our ndings from the rst experiment by examining if syntactic
familiarity also increases the level of agreement with a polarized statement. We hypoth-
esize that individuals perceive statements with familiar syntax as more favorable even if
the statements disagree with their previously established view. To test our hypothesis, we
rst asked participants whether they are pro or anti global warming. Next, we randomly
displayed a statement with an opposing view to the participant's stated view and asked
them to specify their level of agreement with that statement on scale of 1 to 7. Using the
Brunner-Munzel test (Brunner and Munzel, 2000), we showed that participants expressed
higher levels of agreement with statements which used familiar syntactic structure. This
experiment was pre-registered on OSF
2
.
5.3.1 Method
In the second experiment, we are interested in the eect of syntactic structure on partic-
ipants' levels of agreement with politically-valenced statements. In this study, we rely on
four statements that express dierent opinions and syntactic structures.
2
https://osf.io/uc8r3/register/565fb3678c5e4a66b5582f67
97
5.3.1.1 Stimuli Selection
As stated before, the second experiment stems from the previous recall experiment; there-
fore, we relied on participants' essays about global warming to select the stimuli in the
second survey.
First, we coded the global warming essays from the recall study as being either pro
or anti global warming. Next, we used CASSIM-GR to investigate whether there are
syntactic dierences between the two groups. Our leave-one-out classication approach
achieved 60% accuracy and provided evidence that syntax usage was dierent between
pro global warming and anti global warming essays, X
2
(1) = 6:5369;p<:05.
Second, we extracted the group center for both pro and anti global warming groups.
The group center is a statement with which other statements of its group have the highest
syntactic similarity to, and serves as a general representation of syntactic structures used
in that group. Further, for each group, we also extracted the statement which has the
highest average syntactic similarity with all the statements in the opposite group and
called it the group outlier statement. In sum, we selected the following four structures
for this study:
1. A pro global warming statement which re
ects the general syntactic structures used
by pro global warming essays.
2. A pro global warming statement which has a syntactically similar structure to anti
global warming statements.
3. An anti global warming statement which re
ects the general syntactic structures
used by anti global warming statements.
98
4. An anti global warming statement which has syntactically similar structure to pro
global warming statements.
We seek to examine if conveying a message in dierent syntactic structures impacts
individuals' perception of that message; therefore, the four aforementioned structures
are required. Particularly, our goal is to test whether expressing pro (or anti) global
warming opinion using either pro or anti global warming syntax, impacts anti (or pro)
global warming participants' level of agreement with the statement. Table 5.3 shows the
statements used in our study.
5.3.1.2 Study
Our study was launched on TurkPrime (Litman et al., 2017). First, we asked participants
to indicate their stance on global warming, and then asked them to indicate their level
of agreement with a one of the stimuli discussed above.
Using the four stimuli (table 5.3), we developed our study in two parts: First, par-
ticipants were asked to answer a few demographic questions, their political orientation,
and their stance toward global warming. Second, depending on participants' stance, one
of the opposing statements was displayed and participants were asked to indicate their
level of agreement toward the statement on a scale from 1 to 7. If an individual does
not believe global warming is human-made, then randomly either the rst or second pro
statement was shown to them. Similarly, if an individual believes in human made global
warming, either the third or the fourth anti statement was displayed.
99
Table 5.3: Statement Prompt for Agreement
ID Structure Statement
A Pro Cen-
ter
Global warming is a very serious issue. The temperatures all over
the globe are
uctuating and there is a lack of seasons. This will
aect wildlife, oceans, animals etc. There has also been an increase
in natural disasters.
B Pro Out-
lier
Global warming has been increasing in last decade. We have seen
devastation in the most unexpected places on our planet. It is a
serious problem today that threatens our future generations. The
sea levels have been rising. Theres a quite amount of icebergs
melting in the Antarctic. Bringing powerful storms to commu-
nities that have never seen before. Like just last year in Texas,
Puerto Rico. And this year 2018 it's expected to be one of coldest
winter the U.S. will ever experienced.
C Anti Cen-
ter
Global warming does not exist. It was made up by the gov-
ernment to create more revenue via recycling, donations, and
environmentally-friendly products. The government incites fear
of climate change and disasters to scare us into doing what it
wants. It is a hoax. People stupidly believe it.
D Anti Out-
lier
Many scientists do not believe man is responsible for global warm-
ing. In order for a statement to be a fact, there can be no doubt.
The average temperature for the past 2000 years have been roughly
the same. Many glaciers are expanding. Maybe only 2% of glaciers
are melting. Natural disasters like tornadoes and hurricanes have
not been scientically linked to global warming. The spread of dis-
ease has not been scientically linked to global warming. There is
more evidence that the spread of disease is linked to economics.
Sunspots have been linked to aect the Earth's temperature more
than global warming.
100
5.3.2 Analysis
To compare the dierence between participants' agreement level with statements which
used either familiar or unfamiliar syntactic structures, we conducted two Brunner-Munzel
tests. For each group of anti and pro global warming participants, we ran a separate model
to examine whether using familiar syntactic structures resulted in higher agreement level
with an argument which supported an opposite view.
5.3.2.1 Participants
To estimate the number of participants needed for our model, we rst conducted a pilot
study on TurkPrime and recruited 50 Amazon TurkPrime participants, 25 with liberal
political views and 25 with conservative political views. All the participants were native
English-speakers recruited from the United States.
Next, we simulated our model with the pilot data. Because we did not have enough
data for anti global warming participants, we used a sample of 26 pro global warming
individuals and their levels of agreement with a given statement. The main research ques-
tion is whether participants show higher levels of agreement with a statement containing
familiar compared to unfamiliar syntactic structure. Given that the outcome data are
ordinal (level of agreement in the scale of 1 to 7), the nal analyses are conducted using
the Brunner-Munzel test. However, power is not easily calculated for this method, and
thus power analyses were instead conducted for the independent samples ttest. Since
t-test is underpowered for the comparison of ordinal data, the resulting sample size re-
quirements are conservative (i.e., call for more participants) compared to what is actually
necessary for the Brunner-Munzel test.
101
We specied our test to be powered at 0.90 or greater, at a signicance level of 0.05.
Our initial results gave an eect size of d = 1.19; given the extremely high value of eect
size combined with the small sample size, there were concerns that the results observed
were related to small sample size and insucient power. Consequently, we powered our
test to detect a moderate eect size of d = 0.5. At this specication, power of 0.9 would be
achieved with 86 individuals per group: anti global warming participants' agreement level
on either pro global warming statement and pro global warming participants' agreement
level on either anti global warming statement. Since we cannot control for participant
allegiance to the groups of interest (i.e. pro and anti global warming), we accounted for
a potential imbalance in the groups by pre-registering at least 400 total individuals for
this study.
We collected data from 565 Amazon TurkPrime participants. Due the imbalanced
stance of participants' toward global warming, 410 participants with pro global warming
viewpoint and 155 participants with anti global warming were included in the analyses.
5.3.2.2 Results
Our results indicate that familiarity with syntactic structures of opposing statements af-
fected pro global warming participants' agreement level while no eect was evident in
the anti global warming group. We used the lawstat package (3.2) in R (Gastwirth,
2008) to conduct two separate Brunner-Munzel tests for pro and anti global warm-
ing participants' agreement level. Comparing pro global warming participants' levels
of agreement with an argument which used \anti center" syntactic structure and an
102
argument which used \anti outlier" syntactic structure showed that, participants dis-
agreed with \anti outlier" argument less strongly compare to \anti center" argument,
t(352:51) = 4:7236;p<:001; 0:6297[0:5757; 0:6837]. However, we did not nd any dier-
ence in anti global warming participants' level of agreement with \pro center" and \pro
outlier" arguments, t(148:51) =1:6068;p = 0:1102; 0:4271[0:3374; 0:5167].
5.3.3 Discussion
In Experiment 2, we sought to examine the second component of our main hypothesis.
Building on the results of the rst experiment, we hypothesized that individuals nd
opposing arguments with familiar syntactic structures more persuasive and therefore show
less strong disagreement. In this study, we rst asked participants to indicate their stance
on global warming by gauging whether they believe global warming is human-made or
not. Next, based on their position on global warming, one of the two opposing stimuli was
randomly presented on the screen and participants were instructed to specify their level of
agreement on scale of 1 to 7. Applying the Brunner-Munzel test on the pro global warming
group demonstrated that participants perceived statement D (the argument with familiar
syntactic structure) more agreeably compared to statement C (the argument with less
familiar syntactic structure). This novel nding could have widespread eects in the elds
of persuasion and negotiation. As more pro-topics utilize familiar syntactic structure, they
too may be perceived as more agreeable by those with opposing views. Conversely, the
same eect did not hold with anti global warming participants. One possible explanation
is that, 410 out of 565 of the participants recruited through Amazon TurkPrime were pro
global warming and thus we had a smaller proportion of participants for the anti global
103
warming analysis. Additionally, in the pilot study, we conducted a power analysis using
only pro global warming group data, while a power analysis with anti global warming
group might have suggested a larger number of participants. Thus increasing the number
of anti global warming participants may lead to a stronger conclusion. Also, another
explanation may be that the anti global warming group are less
exible in facing with
opposing views.
5.4 General Discussion
The overarching goal of the current research is to gain a better understanding of the
implications of syntactic familiarity, both in terms of an individual's ability to recall
statements and tendency to agree with an opinion. By investigating these questions, we
hope to ll the scientic gap on the role that syntax plays in both recall of statements
and the persuasiveness of dierent forms of the same argument.
Structural priming studies suggest that individuals tend to follow the syntactic struc-
ture of which they have been recently exposed, both in their written and spoken language.
Various relevant studies have looked at the impact of syntactic structure on recall as well
as eectiveness of statements (Potter and Lombardi, 1998; Lowrey, 1998). However, to
the best of our knowledge, our work is the rst to study the eect of syntactic familiarity
on recall performance and argument persuasiveness.
Building on these ndings, we hypothesized that individuals interact with familiar
syntactic structures more eectively. Particularly, in two pre-registered experiments, we
examined how syntactic familiarity facilitates both argument recall and persuasiveness.
104
Specically, in the rst experiment, we sought to investigate the rst component of our
hypothesis: whether syntactic familiarity aects individuals' recall performance. We
instructed our participants to rst write down a short paragraph describing their opinion
toward global warming, a controversial topic. Next, we presented the participants with
nine sentence prompts in a randomized order and asked them to write down what they
recall to the best of their ability. Nine sentence prompts from the American National
Corpus were drawn from three distinct syntactic structure clusters. We then calculated
the syntactic similarity of 1. each individual's essay to the nine sentence prompts and
2. each recalled sentence to its corresponding sentence prompt. The results of this
experiment support our hypothesis that individuals recall sentences which use similar
syntactic structures to their own writing style more accurately.
Later, in the second experiment, we focused on the eect of syntactic familiarity on
persuasiveness of arguments. After participants indicated their stance on global warming,
they were presented with an argument with an opposing view to their own and were asked
to specify their agreement level. Our results demonstrate that pro global warming partic-
ipants perceived arguments with familiar syntactic structure as more agreeable. However,
our analysis did not show any signicant eect for anti global warming participants. The
ndings in this study provide preliminary evidence for our main hypothesis. However,
we did not achieve the pre-registered eects for the the anti global warming group. One
potential (post-hoc) explanation may be that anti global warming individuals are less
exible when they are exposed to opposing arguments. Also, a larger number of partici-
pants may give us a better insight into the syntax familiarity eect on agreement level of
this group.
105
Our work is an example of how advances in computational text analysis tools have
paved the way for researchers to explore a wider variety of novel and existing psychological
questions by focusing not only on word usage but also on syntactic features (Lu, 2010; Kyle
and Crossley, 2015; Graesser et al., 2004; Kyle, 2016; Niederhoer and Pennebaker, 2002;
Ireland and Pennebaker, 2010). For example, CASSIM and CASSIM-GR, which have
been developed recently, provide researchers with an opportunity to study and explore
patterns in syntactic structures used by either groups or individuals.
Despite the novelty of the current studies, there are a few limitations. In the rst
study, creating three clusters from the American National Corpus likely does not fully
account for all the variations of syntactic structures used in our participants essays. A
starting point could involve using more clusters that can map on to an even broader range
of syntax. Clusters could also be generated based on the participants essay response
to cater directly to their personal style. Additionally, the topic of the stimuli in both
experiments were limited to a single topic (global warming). These studies should be
repeated on other polarizing and other weaker topics to show that the observed trends
still stand. Lastly, all our participants were recruited through a single platform, which
limits the generalizability of our studies to a limited cluster of the society. Further
replication of this study in various contexts and among dierent populations should be
completed to examine its generalizability.
106
Appendix A
A.1 Hungarian Algorithm
The Hungarian algorithm operates as follows: First, the complete bipartite graph is
converted to a matrix in which rows correspond to nodes in setA and columns correspond
to nodes in set B. The value in cell i,j represents the weight of the edge between nodes
i and j. Next, the minimum number in each row (and column) is calculated and all the
elements of that row (or column) are subtracted by the minimum number. Then if the
minimum number of lines that can cover all the zeros in the matrix is equal to number of
rows or columns, one zero from each row-column is selected and the edges which represent
those cells are determined as the nal output of Hungarian algorithm. Otherwise, the
minimum number in the matrix that is uncovered is subtracted from all the cells which
are not covered and added to the cells that are covered twice. This step continues until
the minimum number of lines for covering all the zeros in the matrix is equal to number
of rows or columns (Figure A.1)
107
1
4
8
Step+1
b
1
b
2
b
3
a
1
1 6 7
a
2
2 4 3
a
3
5 9 8
a1
a2
a3
b1
b2
b3
b
1
b
2
b
3
a
1
0 5 6
a
2
0 2 1
a
3
0 4 3
b
1
b
2
b
3
a
1
0 3 5
a
2
0 0 0
a
3
0 2 2
Step+2 Step+3
Step+4
b
1
b
2
b
3
a
1
0 1 3
a
2
2 0 0
a
3
0 0 0
1
4
a1
a2
a3
b1
b2
b3
Step+5
b
1
b
2
b
3
a
1
0 3 5
a
2
0 0 0
a
3
0 2 2
Step+ 7
1+3+9+=+12
b
1
b
2
b
3
a
1
0 1 3
a
2
2 0 0
a
3
0 0 0
Step+ 6
Figure A.1: An example of Hungarian Algorithm process. 1- Convert the complete
bipartite graph to a matrix, with cells representing edges' weights. 2- In each row, subtract
the minimum number from all elements of the row. For example, in the rst row, the
smallest number is 1, and therefore all elements of the rst row are subtracted by 1. 3- In
each column, subtract the minimum number from all elements of the column. 4- Cover all
the zeros in the matrix with minimum number of lines. 5- If the minimum number of lines
is less than number of rows (or columns), nd the minimum uncovered number in the
matrix (here 2). Then subtract this number from all the cells which are not highlighted
and add it to the cells which are highlighted twice, in our example a
2
b
1
. Then go to step
4. Until the number of lines is equal to number of rows (or columns), then go to step
6. 6- From each row-column, choose a zero. 7- The cells that are chosen in the previous
step are the edges in the optimal solution for minimum weight perfect matching problem.
The sum of the actual weights of the highlighted edges is the minimum weight, in our
example 12.
108
A.2 Instructions for Experiment 1
Thank you for agreeing to participate in this task. In this task, you'll be prompted to
write sentences which are grammatically similar to other sentences.
A sentence with very simple grammar would be something like \Julia chased Billy".
This sentence has a simple subject (Julia), a verb (chased) and a single object (Billy).
If asked to write a sentence similar to that, you might try something like \The cat eats
its food." This sentence is about dierent things, but uses a very similar grammatical
structure.
From there, we can continue to add complexity. A sentence like \Julia gives Billy the
book" adds an indirect object (the book) while a sentence like \Julia chases Billy into
the cage" adds a prepositional phrase (into the cage).
Don't worry about remembering the details of the rules! There won't be any tests
asking you to identify the past perfect progressive! Just use your intuitive understanding
of how complicated a given sentence is and what its structure is like.
Each task has between one and three sentences. For each set of sentences, please
create new sentences that are grammatically similar to the original. Please follow similar
grammatical rules as the ones used in the example sentences but do not use the same
exact words when creating these sentences.
Here is an example with one sentence (base) and three other sentences (Solutions 1,
2 and 3) that use the same grammatical rule as the base sentence.
First Example: Base: If you haven't got anything nice to say about anybody, come
sit next to me.
109
Solution #1: If you don't know where you're going, any road'll take you there.
Solution #2: When you judge a sh by its ability to climb a tree, it will live its whole
life believing that it is stupid.
Solution #3: When you do things from your soul, you feel a river moving in you.
All of the above sentences are complex sentences with a dependent clause followed by
an independent clause, but you don't have to know that in order to generate a solution.
Just use your intuition.
And here is an example with two sentences in the base, and three solutions that use
the same grammar.
Second Example: Base: He couldn't come because he was sick. I can't go out because
I have a lot of homework.
Solution #1: I couldn't sleep well because it was noisy outside. I have to go soon
because I left the engine running.
Solution #2: They had to change their schedule because the train arrived late. I
couldn't get out of my garage because there was a car in the way.
Solution #3: She always speaks to him in a loud voice because he's hard of hearing.
She hired him as an interpreter because she had heard that he was the best.
Once you are ready to start, please press the continue button.
110
A.3 Prompts
A.3.1 Prompt 1
Please compose two sentences that have a similar syntax structure as the following two
sentences:
The two most important days in your life are the day you are born and the day you
nd out why. The nice thing about being a celebrity is that you bore people and they
think it's their fault.
A.3.2 Prompt 2
Please compose two sentences that have a similar syntax structure as the following two
sentences:
I am enough of an artist to draw freely upon my imagination. Imagination is more
important than knowledge. Knowledge is limited. Imagination encircles the world.
A.3.3 Prompt 3
Please compose two sentences that have a similar syntax structure as the following two
sentences:
When we love, we always strive to become better than we are. When we strive to
become better than we are, everything around us becomes better too.
A.3.4 Prompt 4
Please compose a sentence that have a similar syntax structure as the following sentence:
111
What is the point of being alive if you don't at least try to do something remarkable?
A.4 Detailed Results of Null and Alternative Models
Table A.1: Validation Study, Analysis 1
Method Model Type Estimate Std. Error p
CASSIM Null -0.1080 0.0153
Alternative 1.7541 0.0696 <.0001
LSM Null -0.0789 0.0241
Alternative 1.2396 0.0721 <.0001
SYNMEDPOS Null 0.3174 0.0220
Alternative -1.4161 0.0653 <.0001
SYNSTRUT Null -0.3325 0.0177
Alternative 1.4589 0.0749 <.0001
Table A.2: CAT Study, Analysis 1
Method Model Type Estimate Std. Error p
CASSIM Null 0.1105 0.0283
Alternative 0.1694 0.0262 <.0001
LSM Null 0.138 0.041
Alternative 0.1287 0.0306 <.0001
SYNMEDPOS Null -0.2119 0.0539
Alternative -0.0128 0.0213 =0.547
SYNSTRUT Null 0.0669 0.0338
Alternative 0.0585 0.0220 <.01
112
Table A.3: CAT Study, Analysis 2
Method Model Type Estimate Std. Error p
CASSIM Null 0.0544 0.0455
Alternative 0.0825 0.0179 <0.0001
LSM Null 0.0831 0.0687
Alternative 0.110 0.0160 <.0001
SYNMEDPOS Null -0.0865 0.0535
Alternative -0.0462 0.0183 <.05
SYNSTRUT Null 0.0141 0.0125
Alternative -0.0295 0.0195 =0.13
113
A.5 Comparison of Processing Time
To compare the processing time of the dierent algorithms, we picked a random post with
10 of its comments and used the four algorithms below to calculate the syntax similarity of
each comment to the post. As stated in the paper, since SYNMEDPOS and SYNSTRUT
indices are built for documents coherency calculations, we followed the instructions and
implemented these two indices to t our needs. Therefore there might be undocumented
underlying optimization techniques that we are not aware of. Table 4 and 5 show the
statistics of post and comments length and processing time, respectively.
Table A.4: Processing Time Corpus Statistics
Post Comment (mean, SD, range)
Words 378 137.7, 100.89, [27,322]
Sentences 14 8.1, 5.7, [2,17]
Table A.5: Processing Time
Method Time
CASSIM 6 Minutes
LSM <1 Minute
SYNMEDPOS 60 Minutes
SYNSTRUT 70 Minutes
114
Bibliography
D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and
M. Collins. Globally normalized transition-based neural networks. arXiv preprint
arXiv:1603.06042, 2016.
D. J. Barr, R. Levy, C. Scheepers, and H. J. Tily. Random eects structure for conr-
matory hypothesis testing: Keep it maximal. Journal of memory and language, 68(3):
255{278, 2013.
D. Bates, M. M achler, B. Bolker, and S. Walker. Fitting linear mixed-eects models using
lme4. arXiv preprint arXiv:1406.5823, 2014.
R. C. Berwick, A. D. Friederici, N. Chomsky, and J. J. Bolhuis. Evolution, brain, and
the nature of language. Trends in cognitive sciences, 17(2):89{98, 2013.
J. K. Bock. Meaning, sound, and syntax: Lexical priming in sentence production. Journal
of Experimental Psychology: Learning, Memory, and Cognition, 12(4):575, 1986a.
J. K. Bock. Syntactic persistence in language production. Cognitive psychology, 18(3):
355{387, 1986b.
K. Bock and Z. M. Grin. The persistence of structural priming: Transient activation
or implicit learning? Journal of experimental psychology: General, 129(2):177, 2000.
R. Boghrati, J. Hoover, K. M. Johnson, J. Garten, and M. Dehghani. Conversation level
syntax similarity metric. Journal of Behavior Research Methods, 2017a.
R. Boghrati, J. Hoover, K. M. Johnson, J. Garten, and M. Dehghani. Conversation level
syntax similarity metric. Behavior research methods, pages 1{19, 2017b.
R. Boghrati, K. M. Johnson, and M. Dehghani. Generalized representation of syntactic
structures. In CogSci, 2017c.
R. Boghrati, K. M. Johnson, and M. Dehghani. Generalized representation of syntactic
structures. CogSci, 2017d.
S. D. Bradley and R. Meeds. Surface-structure transformations and advertising slogans:
The case for moderate syntactic complexity. Psychology & Marketing, 19(7-8):595{619,
2002.
115
H. P. Branigan, M. J. Pickering, and A. A. Cleland. Syntactic priming in written pro-
duction: Evidence for rapid decay. Psychonomic Bulletin & Review, 6(4):635{640,
1999.
H. P. Branigan, M. J. Pickering, and A. A. Cleland. Syntactic co-ordination in dialogue.
Cognition, 75(2):B13{B25, 2000.
H. P. Branigan, M. J. Pickering, J. Pearson, and J. F. McLean. Linguistic alignment
between people and computers. Journal of Pragmatics, 42(9):2355{2368, 2010.
H. P. Branigan, M. J. Pickering, J. Pearson, J. F. McLean, and A. Brown. The role
of beliefs in lexical alignment: Evidence from dialogs with humans and computers.
Cognition, 121(1):41{57, 2011.
W. D. Brent. Introduction to graph theory, 1999.
J. Bresnan and J. Hay. Gradient grammar: An eect of animacy on the syntax of give
in new zealand and american english. Lingua, 118(2):245{259, 2008.
M. B. Brewer. The social self: On being the same and dierent at the same time.
Personality and social psychology bulletin, 17(5):475{482, 1991.
E. Brill. Automatic grammar induction and parsing free text: A transformation-based
approach. In Proceedings of the workshop on Human Language Technology, pages 237{
242. Association for Computational Linguistics, 1993.
E. Brunner and U. Munzel. The nonparametric behrens-sher problem: Asymptotic
theory and small-sample approximation. Biometrical Journal, 42:286{317, 2000.
F. Chang, G. S. Dell, and K. Bock. Becoming syntactic. Psychological review, 113(2):
234, 2006.
N. Chomsky. Aspects of the theory of syntax. Cambridge, MA: MIT Press., 1965.
S. A. Crossley, J. Greeneld, and D. S. McNamara. Assessing text readability using
cognitively based indices. Tesol Quarterly, 42(3):475{493, 2008.
S. A. Crossley, D. S. McNamara, et al. Adaptive Educational Technologies for Literacy
Instruction. Routledge, 2016.
C. Danescu-Niculescu-Mizil, M. Gamon, and S. Dumais. Mark my words!: linguistic style
accommodation in social media. In Proceedings of the 20th international conference on
World wide web, pages 745{754. ACM, 2011.
C. Danescu-Niculescu-Mizil, L. Lee, B. Pang, and J. Kleinberg. Echoes of power: Lan-
guage eects and power dierences in social interaction. In Proceedings of the 21st
international conference on World Wide Web, pages 699{708. ACM, 2012.
116
M. Dehghani, M. Bang, D. Medin, A. Marin, E. Leddon, and S. Waxman. Epistemologies
in the text of children's books: Native-and non-native-authored books. International
Journal of Science Education, 35(13):2133{2151, 2013a.
M. Dehghani, K. Sagae, S. Sachdeva, and J. Gratch. Linguistic analysis of the debate
over the construction of the ground zero mosque. Journal of Information Technology
& Politics. Advance online publication. doi, 10(19331681.2013):826613, 2013b.
M. Dehghani, K. Sagae, S. Sachdeva, and J. Gratch. Analyzing political rhetoric in
conservative and liberal weblogs related to the construction of the ground zero mosque.
Journal of Information Technology & Politics, 11(1):1{14, 2014.
M. Dehghani, K. Johnson, J. Hoover, E. Sagi, J. Garten, N. J. Parmar, S. Vaisey, R. Iliev,
and J. Graham. Purity homophily in social networks. Journal of Experimental Psy-
chology: General, 2016a.
M. Dehghani, K. M. Johnson, J. Garten, R. Boghrati, J. Hoover, V. Balasubramanian,
A. Singh, Y. Shankar, L. Pulickal, A. Rajkumar, and N. J. Parmar. Tacit: An open-
source text analysis, crawling and interpretation tool. Behavior Research Methods,
2016b.
F. M. del Prado Martn and J. W. Du Bois. Syntactic alignment is an index of aective
alignment: An information-theoretical study of natural dialogue. In In Proceedings of
the Cognitive Science Society, 2015.
J. Earley. An ecient context-free parsing algorithm. Communications of the ACM, 13
(2):94{102, 1970.
R. M. Emerson. Power-dependence relations. American sociological review, pages 31{41,
1962.
E. Fast and E. Horvitz. Identifying dogmatism in social media: Signals and models. arXiv
preprint arXiv:1609.00425, 2016.
D. Fern andez-Gonz alez and A. F. Martins. Parsing as reduction. arXiv preprint
arXiv:1503.00030, 2015.
J. F orster, R. S. Friedman, and N. Liberman. Temporal construal eects on abstract
and concrete thinking: consequences for insight and creative cognition. Journal of
personality and social psychology, 87(2):177, 2004.
S. Freud. The psychopathology of everyday life. WW Norton & Company, 1966.
R. Fusaroli and K. Tyl en. Investigating conversational dynamics: Interactive alignment,
interpersonal synergy, and collective task performance. Cognitive science, 40(1):145{
171, 2016.
R. Fusaroli, B. Bahrami, K. Olsen, A. Roepstor, G. Rees, C. Frith, and K. Tyl en. Coming
to terms quantifying the benets of linguistic coordination. Psychological science, page
0956797612436816, 2012.
117
D. Garcia and S. Sikstr om. The dark side of facebook: Semantic representations of status
updates predict the dark triad of personality. Personality and Individual Dierences,
67:92{96, 2014.
J. Gastwirth. lawstat: an r package for law, public policy and biostatistics. 2008.
B. Gawda. Syntax of emotional narratives of persons diagnosed with antisocial personal-
ity. Journal of psycholinguistic research, 39(4):273{283, 2010.
H. Giles. en p. smith. 1979. accommodation theory: Optimal levels of convergence. H.
Giles and St. Clair (eds.), Language and Social Psychology, pages 45{87, 1979.
H. Giles. Communication accommodation theory. Sage Publications, Inc, 2008.
H. Giles, J. Coupland, and N. Coupland. Contexts of accommodation: Developments in
applied sociolinguistics. Cambridge University Press, 1991.
A. C. Graesser, D. S. McNamara, M. M. Louwerse, and Z. Cai. Coh-metrix: Analysis of
text on cohesion and language. Behavior research methods, instruments, & computers,
36(2):193{202, 2004.
J. Graham, J. Haidt, and B. A. Nosek. Liberals and conservatives rely on dierent sets
of moral foundations. Journal of personality and social psychology, 96(5):1029, 2009.
C. J. Groom and J. W. Pennebaker. The language of love: Sex, sexual orientation, and
language use in online personal advertisements. Sex Roles, 52(7-8):447{461, 2005.
N. Gu eguen. Mimicry and seduction: An evaluation in a courtship context. Social
In
uence, 4(4):249{255, 2009.
R. J. Hartsuiker and H. H. Kolk. Syntactic persistence in dutch. Language and Speech,
41(2):143{184, 1998.
T. Hawes, J. Lin, and P. Resnik. Elements of a computational model for multi-party dis-
course: The turn-taking behavior of supreme court justices. Journal of the Association
for Information Science and Technology, 60(8):1607{1615, 2009.
P. G. Healey, M. Purver, and C. Howes. Divergence in dialogue. PloS one, 9(6):e98598,
2014.
M. E. Ireland and M. D. Henderson. Language style matching, engagement, and impasse
in negotiations. Negotiation and con
ict management research, 7(1):1{16, 2014.
M. E. Ireland and J. W. Pennebaker. Language style matching in writing: synchrony in
essays, correspondence, and poetry. Journal of personality and social psychology, 99
(3):549, 2010.
M. E. Ireland, R. B. Slatcher, P. W. Eastwick, L. E. Scissors, E. J. Finkel, and J. W.
Pennebaker. Language style matching predicts relationship initiation and stability.
Psychological Science, 22(1):39{44, 2011.
118
C. Jacob, N. Gu eguen, A. Martin, and G. Boulbry. Retail salespeople's mimicry of
customers: Eects on consumer behavior. Journal of Retailing and Consumer Services,
18(5):381{388, 2011.
E. H. Jahr. Middle-aged male syntax. International Journal of the Sociology of Language,
94(1):123{134, 1992.
E. Kacewicz, J. W. Pennebaker, M. Davis, M. Jeon, and A. C. Graesser. Pronoun use
re
ects standings in social hierarchies. Journal of Language and Social Psychology,
page 0261927X13502654, 2013.
Y. Kim and K. McDonough. Learners' production of passives during syntactic priming
activities. Applied Linguistics, 29(1):149{154, 2008.
D. Klein and C. D. Manning. Accurate unlexicalized parsing. In Proceedings of the
41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages
423{430. Association for Computational Linguistics, 2003.
A. Koller and K. Striegnitz. Generation as dependency parsing. In Proceedings of the
40th Annual Meeting on Association for Computational Linguistics, pages 17{24. As-
sociation for Computational Linguistics, 2002.
E. Kross, E. Bruehlman-Senecal, J. Park, A. Burson, A. Dougherty, H. Shablack,
R. Bremner, J. Moser, and O. Ayduk. Self-talk as a regulatory mechanism: how
you do it matters. Journal of Personality and Social Psychology, 106(2):304, 2014.
H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics
quarterly, 2(1-2):83{97, 1955.
K. Kyle. Measuring Syntactic Development in L2 Writing: Fine Grained Indices of
Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication. PhD thesis,
Georgia State University, 2016.
K. Kyle and S. A. Crossley. Automatically assessing lexical sophistication: Indices, tools,
ndings, and application. TESOL Quarterly, 49(4):757{786, 2015.
C. M. Laserna, Y.-T. Seih, and J. W. Pennebaker. Um... who like says you know ller
word use as a function of age, gender, and personality. Journal of Language and Social
Psychology, page 0261927X14526993, 2014.
W. J. Levelt and S. Kelter. Surface form and memory in question answering. Cognitive
psychology, 14(1):78{106, 1982.
D. D. Lewis. Feature selection and feature extraction for text categorization. In Proceed-
ings of the workshop on Speech and Natural Language, pages 212{217. Association for
Computational Linguistics, 1992.
D. D. Lewis and W. B. Croft. Term clustering of syntactic phrases. In Proceedings of
the 13th annual international ACM SIGIR conference on Research and development in
information retrieval, pages 385{404. ACM, 1989.
119
J. Lin and W. J. Wilbur. Syntactic sentence compression in the biomedical domain:
facilitating access to related articles. Information Retrieval, 10(4-5):393{414, 2007.
Y.-S. Lin, J.-Y. Jiang, and S.-J. Lee. A similarity measure for text classication and
clustering. IEEE transactions on knowledge and data engineering, 26(7):1575{1590,
2014.
L. Litman, J. Robinson, and T. Abberbock. Turkprime. com: A versatile crowdsourcing
data acquisition platform for the behavioral sciences. Behavior research methods, 49
(2):433{442, 2017.
L. Liu, J. Kang, J. Yu, and Z. Wang. A comparative study on unsupervised feature
selection methods for text clustering. In Natural Language Processing and Knowledge
Engineering, 2005. IEEE NLP-KE'05. Proceedings of 2005 IEEE International Con-
ference on, pages 597{601. IEEE, 2005.
T. Liu, S. Liu, Z. Chen, and W.-Y. Ma. An evaluation on feature selection for text
clustering. In Icml, volume 3, pages 488{495, 2003.
L. Lombardi and M. C. Potter. The regeneration of syntax in short term memory. Journal
of Memory and Language, 31(6):713{733, 1992.
T. M. Lowrey. The eects of syntactic complexity on advertising persuasiveness. Journal
of Consumer Psychology, 7(2):187{206, 1998.
X. Lu. Automatic analysis of syntactic complexity in second language writing. Interna-
tional Journal of Corpus Linguistics, 15(4):474{496, 2010.
A. Maass, M. Karasawa, F. Politi, and S. Suga. Do verbs and adjectives play dierent
roles in dierent cultures? a cross-linguistic analysis of person representation. Journal
of personality and social psychology, 90(5):734, 2006.
C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. The
stanford corenlp natural language processing toolkit. In ACL (System Demonstrations),
pages 55{60, 2014.
M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, and A. Taylor. Treebank-3, ldc99t42.
CD-ROM. Philadelphia, Penn.: Linguistic Data Consortium, 1999.
K. McDonough. Interaction and syntactic priming: English l2 speakers' production of
dative constructions. Studies in Second Language Acquisition, 28(02):179{207, 2006.
K. McDonough and W. Chaikitmongkol. Collaborative syntactic priming activities and
e
learners' production of wh-questions. Canadian Modern Language Review, 66(6):
817{841, 2010.
D. S. McNamara, A. C. Graesser, P. M. McCarthy, and Z. Cai. Automated evaluation of
text and discourse with Coh-Metrix. Cambridge University Press, 2014.
120
M. R. Mehl, M. L. Robbins, and S. E. Holleran. How taking a word for a word can be
problematic: Context-dependent linguistic markers of extraversion and neuroticism.
Journal of Methods and Measurement in the Social Sciences, 3(2):30{50, 2012.
H. A. Murray. Thematic apperception test. 1943.
G. Navarro. A guided tour to approximate string matching. ACM computing surveys
(CSUR), 33(1):31{88, 2001.
T. Nguyen, D. Q. Phung, B. Adams, and S. Venkatesh. A sentiment-aware approach to
community formation in social media. In ICWSM, 2012.
K. G. Niederhoer and J. W. Pennebaker. Linguistic style matching in social interaction.
Journal of Language and Social Psychology, 21(4):337{360, 2002.
M. A. Pasca and S. M. Harabagiu. High performance question/answering. In Proceedings
of the 24th annual international ACM SIGIR conference on Research and development
in information retrieval, pages 366{374. ACM, 2001.
J. W. Pennebaker. The secret life of pronouns. New Scientist, 211(2828):42{45, 2011.
J. W. Pennebaker and L. D. Stone. Words of wisdom: language use over the life span.
Journal of personality and social psychology, 85(2):291, 2003.
J. W. Pennebaker, M. E. Francis, and R. J. Booth. Linguistic inquiry and word count:
Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71:2001, 2001.
J. W. Pennebaker, R. J. Booth, and M. E. Francis. Linguistic inquiry and word count:
Liwc [computer software]. Austin, TX: liwc. net, 2007.
M. J. Pickering and S. Garrod. Toward a mechanistic psychology of dialogue. Behavioral
and brain sciences, 27(02):169{190, 2004.
M. C. Potter and L. Lombardi. Regeneration in the short-term recall of sentences. Journal
of Memory and Language, 29(6):633{654, 1990.
M. C. Potter and L. Lombardi. Syntactic priming in immediate recall of sentences. Journal
of Memory and Language, 38(3):265{282, 1998.
S. Putrevu, J. Tan, and K. R. Lord. Consumer responses to complex advertisements:
The moderating role of need for cognition, knowledge, and gender. Journal of Current
Issues & Research in Advertising, 26(1):9{24, 2004.
N. Ramirez-Esparza, C. K. Chung, E. Kacewicz, and J. W. Pennebaker. The psychology
of word use in depression forums in english and in spanish: Texting two text analytic
approaches. In ICWSM, 2008.
D. Reitter and J. D. Moore. Alignment and task success in spoken dialogue. Journal of
Memory and Language, 76:29{46, 2014.
121
D. Reitter, F. Keller, and J. D. Moore. Computational modeling of structural prim-
ing in dialogue. In Proceedings of the Human Language Technology Conference of the
NAACL, Companion Volume: Short Papers, pages 121{124. Association for Computa-
tional Linguistics, 2006.
D. Reitter, F. Keller, and J. D. Moore. A computational cognitive model of syntactic
priming. Cognitive science, 35(4):587{637, 2011.
R. Reppen, N. Ide, and K. Suderman. American national corpus (anc) second release.
Linguistic Data Consortium, 2005.
M. A. Riley, M. Richardson, K. Shockley, and V. C. Ramenzoni. Interpersonal synergies.
Frontiers in psychology, 2:38, 2011.
H. Rorschach. Psychodiagnostic: methodology, and results of a perception-diagnostic
experiment [psychodiagnostics: A diagnostic test based on perception]. Bern, Switzer-
land: Ernst Bucher, 1921.
E. Sagi and M. Dehghani. Measuring moral rhetoric in text. Social Science Computer
Review, 32(2):132{144, 2014.
M. Sasaki and H. Shinnou. Spam detection using text clustering. In Cyberworlds, 2005.
International Conference on, pages 4{pp. IEEE, 2005.
L. Schoot, E. Heyselaar, P. Hagoort, and K. Segaert. Does syntactic alignment eectively
in
uence how speakers are perceived by their conversation partner? PloS one, 11(4):
e0153521, 2016.
J. R. Searle. Indirect speech acts. na, 1975.
J. Sedding and D. Kazakov. Wordnet-based text document clustering. In proceedings
of the 3rd workshop on robust methods in analysis of natural language data, pages
104{113. Association for Computational Linguistics, 2004.
G. R. Semin and K. Fiedler. The linguistic category model, its bases, applications and
range. European review of social psychology, 2(1):1{30, 1991.
C. A. Shepard, H. Giles, and B. A. Le Poire. Communication accommodation theory.
The new handbook of language and social psychology, (1.2):33{56, 2001.
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on
pattern analysis and machine intelligence, 22(8):888{905, 2000.
J. A. Skoyen, A. K. Randall, M. R. Mehl, and E. A. Butler. " we" overeat, but" i"
can stay thin: Pronoun use and body weight in couples who eat to regulate emotion.
Journal of Social and Clinical Psychology, 33(8):743, 2014.
R. B. Slatcher, S. Vazire, and J. W. Pennebaker. Am i more important than we? couples
word use in instant messages. Personal Relationships, 15(4):407{424, 2008.
122
C. E. Snow. The development of conversation between mothers and babies. Journal of
child language, 4(01):1{22, 1977.
W. Song, C. H. Li, and S. C. Park. Genetic algorithm for text clustering using ontology
and evaluating the validity of various semantic similarity measures. Expert Systems
with Applications, 36(5):9095{9104, 2009.
R. J. Tanner, R. Ferraro, T. L. Chartrand, J. R. Bettman, and R. Van Baaren. Of
chameleons and consumption: The impact of mimicry on choice and preferences. Jour-
nal of Consumer Research, 34(6):754{766, 2008.
P. J. Taylor and S. Thomas. Linguistic style matching and negotiation outcome. Negoti-
ation and Con
ict Management Research, 1(3):263{281, 2008.
M. Tomita. Lr parsers for natural languages. In Proceedings of the 10th International
Conference on Computational Linguistics and 22nd annual meeting on Association for
Computational Linguistics, pages 354{357. Association for Computational Linguistics,
1984.
Y. Trope and N. Liberman. Construal-level theory of psychological distance. Psychological
review, 117(2):440, 2010.
R. B. Van Baaren, R. W. Holland, B. Steenaert, and A. van Knippenberg. Mimicry for
money: Behavioral consequences of imitation. Journal of Experimental Social Psychol-
ogy, 39(4):393{398, 2003.
M. Van de Velde and A. S. Meyer. Syntactic
exibility and planning scope: the eect of
verb bias on advance planning during sentence recall. Frontiers in psychology, 5, 2014.
G. Vigliocco and J. Franck. When sex and syntax go hand in hand: Gender agreement
in language production. Journal of Memory and Language, 40(4):455{478, 1999.
U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):
395{416, 2007.
D. Weiss, C. Alberti, M. Collins, and S. Petrov. Structured training for neural network
transition-based parsing. arXiv preprint arXiv:1506.06158, 2015.
T. Weninger, X. A. Zhu, and J. Han. An exploration of discussion threads in social news
sites: A case study of the reddit community. In Advances in Social Networks Analysis
and Mining (ASONAM), 2013 IEEE/ACM International Conference on, pages 579{
583. IEEE, 2013.
R. West and L. H. Turner. Introducing communication theory: Analysis and application
(2013 ed.). 2013.
L. Wheeldon and M. Smith. Phrase structure priming: A short-lived eect. Language
and cognitive processes, 18(4):431{442, 2003.
123
K. Yamada and K. Knight. A syntax-based statistical translation model. In Proceedings of
the 39th Annual Meeting on Association for Computational Linguistics, pages 523{530.
Association for Computational Linguistics, 2001.
H. Zhang and R. McDonald. Generalized higher-order dependency parsing with cube
pruning. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, pages 320{331.
Association for Computational Linguistics, 2012.
H.-T. Zheng, B.-Y. Kang, and H.-G. Kim. Exploiting noun phrases and semantic re-
lationships for text document clustering. Information Sciences, 179(13):2249{2262,
2009.
124
Abstract (if available)
Abstract
It has long been known that language both affects and encodes our psychological states (Freud, 1966
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Identifying Social Roles in Online Contentious Discussions
PDF
Computational models for multidimensional annotations of affect
PDF
Language understanding in context: incorporating information about sources and targets
PDF
Generating psycholinguistic norms and applications
PDF
Modeling, learning, and leveraging similarity
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Speech and language understanding in the Sigma cognitive architecture
PDF
How to think before you speak: getting from abstract thoughts to sentences
PDF
Neural networks for narrative continuation
PDF
Low-dimensional material based devices for neuromorphic computing and other applications
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Neural sequence models: Interpretation and augmentation
PDF
Architectural evolution and decay in software systems
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Neural creative language generation
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Information as a double-edged sword in strategic interactions
Asset Metadata
Creator
Boghrati, Reihane
(author)
Core Title
Conversation-level syntax similarity metric
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/01/2018
Defense Date
07/24/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
communication accommodation,language alignment,OAI-PMH Harvest,syntactic alignment,syntax similarity
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dehghani, Morteza (
committee chair
), Kaiser, Elsi (
committee member
), Rosenbloom, Paul (
committee member
)
Creator Email
boghrati@usc.edu,r.boghrati@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-44694
Unique identifier
UC11672114
Identifier
etd-BoghratiRe-6454.pdf (filename),usctheses-c89-44694 (legacy record id)
Legacy Identifier
etd-BoghratiRe-6454.pdf
Dmrecord
44694
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Boghrati, Reihane
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
communication accommodation
language alignment
syntactic alignment
syntax similarity