Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Lexical complexity-driven representation learning
(USC Thesis Other)
Lexical complexity-driven representation learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEXICAL COMPLEXITY-DRIVEN REPRESENTATION LEARNING
by
Nikhil Wani
A Thesis Presented to the
FACULTY OF THE USC VITERBI SCHOOL OF ENGINEERING
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(COMPUTER SCIENCE)
August 2022
Copyright 2022 Nikhil Wani
Acknowledgements
I’d like to highlight that this master’s thesis was worked on and drafted at USC in Los Angeles
during the peak of the 2020 COVID-19 pandemic. It wouldn’t have seen the light of the day
without a few people that I must thank. I will always be indebted to them.
First, I want to thank my mother who selflessly sacrificed her comfort in India so that I could
afford an education and this master’s degree in the US. This thesis is a result of her constant
support at each step during the entire course of the pandemic. She taught me how to think
clearly during extremely adverse times, which in turn has had a lot of positive impact on my
research progress.
I would also like to extend thanks to my advisor Prof. Saty, who believed in me more than
I did in myself. I’m very grateful to him for his flexibility in letting me take on risky projects
and for teaching me the art of communicating ideas. I would also like to thank my committee
members Prof. Rejati and Prof. Nakano for their insights and guidance.
Lastly, I’m indebted and very thankful to my late uncle, who helped me board my first flight
to the US, but couldn’t be with us today to see this day after fighting a long and hard battle
with COVID-19. His contribution will never be forgotten, and I hope to have made him proud. I
wouldalsoliketothankmyfamily(mymom, tai, pappa), andallofmyfriendsfortheirsupport.
ii
Table of Contents
Acknowledgements ii
List Of Tables v
List Of Figures vi
Abstract vii
Chapter 1: Introduction 1
Chapter 2: Related Work 3
2.1 Shared Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 SemEval 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 SemEval 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 NAACL 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 CNNs vs Feature Engineering (Voting Ensembles) . . . . . . . . . . . . . . . . . . 6
2.3 CWI as a Sequence Labeling Task (BiLSTM) . . . . . . . . . . . . . . . . . . . . . 6
Chapter 3: Complex Word and Phrase Identification as an Entity Recognition
Task 7
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data Formatting for Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 WordPiece Embedding and Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Character and Word-level Transformations. . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 Training and Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7 Testing and Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.8 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.9 Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.10 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter4: ComplexWordandPhraseIdentificationasaTextClassificationTask 14
4.1 SBERT as a feature extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Lexical WordNet Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Lexicon-vocabulary based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Frequency-based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Size based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 Named Entity Recognition (NER) tags, Part-of-Speech (POS) tags, Word N-gram 17
4.7 Dependency Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.8 Complex Phrase Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.9 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iii
4.10 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.11 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 5: Conclusions 22
References 23
iv
List Of Tables
3.1 Sentence distribution across datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 T-Wani and H-Wani system results for complex words + phrases compared with
other architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Results of each of the classifiers ten-fold cross-validated on the training for the
complex class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 H-Wani (Hybrid) model results compared with other models . . . . . . . . . . . . . 20
v
List Of Figures
3.1 Character-level replacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Word-level replacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
vi
Abstract
Lexical complexity encoding requires multi-granular syntactic and semantic context; however,
the currently available representations for lexically challenging NLP shared tasks such as complex
phraseandwordidentificationrelysolelyonword-levelpropertieswitheithernocontextorlimited
contextualattention. Lexicalsimplificationpipelinesalsorelyonaccuratecontextualsubstitution
of individual complex words making it imperative to encode context.
In this thesis, we leverage the self-attention mechanism of the Transformer architecture. We
showthatatparresultscanbeachievedwithouttheneedforexpensivefeatureengineering,which
has been dominant with the best performing models for the task of complex word and phrase
identification. Experiments show that BERT can be used in its dual form both as a feature
extractor and fine-tuned as a token classification task for complex word and phrase identification.
Wecontrastresultsandshowthatourmodeloutperformstheexistingdominantensembles,CNNs,
and performs equally well as Bi-LSTMs.
A part of this thesis was published in NAACL 2018 CWI Shared task (Wani et al., 2018)
vii
Chapter 1
Introduction
Text miscomprehension can often be undetected and unforced, however, it has had serious conse-
quences globally. A larger human goal has always been to understand and be understood. Often
times when text is not understood, it is not appreciated. A recent study by (Gerritsen et al.,
2000) shows that one-third of the Dutch commercials contain English words and phrases that
are announced incorrectly due to contextual misinterpretation. Adversely, consumers display a
rather negative attitude towards the English used in the commercials and only one-third were
able to give a rough indication of its meaning. Furthermore, a reported staggering 5 to 15 percent
of Americans - 14.5 to 43.5 million children and adults - have dyslexia
1
, a language compre-
hension disability. These statistics suggest that not only does comprehension ability vary across
all ages but it also confirms the need for a sturdy computational textual simplification pipeline.
The probability of miscomprehension is even higher amongst people from countries where the na-
tional language is not English, ESL (English as a Second Language) learners, and low educational
background.
Lexicalsimplification(LS)isasubprocessofTextualsimplificationwherethegoalistoreplace
a word or a phrase with more simpler and popular alternatives. The first step in the process is to
identifythecomplexword(s)orphrasesthatthereadermayfinddifficulttounderstand. Thisstage
1
https://www.ldonline.org/ld-topics/reading-dyslexia/dyslexia-what-brain-research-reveals-about-reading
1
is called complex word and phrase identification (CWI). Textual simplification consists of both
syntacticandlexicalsimplification. Multipletransformationstakeplaceintheentiresimplification
process such as lexical simplification. Applying lexical simplification has been proven to improve
readers’ comprehension and retention (Leroy et al., 2013). One popular method for substituting
the word is to use the highest frequency synonym from WordNet (Brown, 2005).
Complex word and phrase identification aim to classify a target word or phrase as complex
or not based on the context it is used in the sentence. It significantly improves the lexical
simplification process (Shardlow, 2014); (G. H. Paetzold and Specia, 2016)). CWI has been
researchedrecentlyasastandalonetaskwithtwosharedtasks(SemEval2016andNAACL2018).
Thecurrentstate-of-the-artmodels(GoodingandKochmar,2019)forthistaskrelysolelyonword-
level properties with either no context or limited contextual attention. CWI can be advantageous
in other applications such as lexical summarization as well, where complex phrases need to be
shortened without losing information.
This thesis presents two novel approaches to CWI: 1. Modeling CWI as an entity recognition
task. 2. Modeling CWI as a text classification task. For the first method, we use a fine-tuning
approach and rely on the self-attention mechanism of the Transformer without explicit feature
engineering. Forthesecondmethod,wefollowahybridapproach. WeconcatenatesentenceTrans-
formers embeddings with syntactic and semantic manual feature engineering. Both approaches
show results that nearly equal the state-of-the-art on multiple datasets.
An overview of this thesis is presented in this section. First, Chapter 2 provides a compre-
hensive overview of previous work (datasets, models, and evaluation techniques) on CWI task.
Chapter 3 describes one of our two novel approaches. It details the modeling process of CWI as
an entity recognition task. Chapter 4 describes the second approach of modeling CWI as a text
classification task. Lastly, the conclusions summarize the key findings of the thesis in Chapter 5.
2
Chapter 2
Related Work
2.1 Shared Tasks
2.1.1 SemEval 2012
There have been three major shared tasks related to complex word and phrase identification. The
first one was introduced in SemEval 2012 (Specia, Jauhar, and Mihalcea, 2012). The objective
of this shared task was to build a lexical simplification ranking system. The shared task was
primarily on English corpora. Given a sentence and a list of candidate substitutes for a target
word within the sentence context, the system was supposed to rank the candidates based on the
level of word complexity. For this task, the training set consisted of 300 contexts and the test
set consisted of 1,710 contexts. This was a larger corpus than the previously available (Shardlow,
2013). Annotators were asked to provide word alternatives as substitutes to the target word.
Ranking of the word alternative list is done based on the count of the annotators’ majority vote.
More the number of annotators that choose the word as a substitute higher was its ranking in the
list. Cohen’s kappa coefficient was used as the evaluation metric for this shared task.
The best performing submission of UOW-SHEF-SimpLex’ (SHEF) (Jauhar and Specia, 2012)
modeled an SVM with three features: an adapted N-gram model (contextual frequency), a bag-
of-words model, and psycholinguistic features. They were the first to introduce simplicity-based
3
PsycholinguisticfeaturesfromtheMRCPsycholinguisticDatabase(Wilson, 1988)andtheBristol
Norms (Stadthagen-Gonzalez and Davis, 2006). The four major features were: Correctness (level
of abstractness of thetarget word), Imageablity (the ability of the word toarouse mentalimages),
familiarity (the frequency of word exposure), age of acquisition (the age at which the word is
appropriated by a speaker).
2.1.2 SemEval 2016
ThesecondsharedtaskwasconductedinSemEval2016(G.PaetzoldandSpecia,2016b). Thiswas
the first explicit complex word identification task. The objective was to build a binary classifier
that will classify a given target word or phrase in a sentence as either complex or non-complex.
The training dataset consisted of both joint and decomposed formats. A total of 20 annotators
provided class annotation for each of the target words. For the decomposed dataset format, 20
binary complexity labels were provided and a majority voting decided the final label. For the
joint dataset format, a single complexity label was provided. If even 1 of the 20 annotators voted
for the complex class, the final label would be complex otherwise noncomplex.
The training set consisted of a total of 2237 instances, which included a sentence, the target
word, the position of the target word in the sentence, and the complexity label(s). The testing
set consisted of 88,221 instances and followed the same format as for the joint training dataset,
however, the complexity label was generated by only 1 annotator. This was done to tailor to
a specific audience - in this case, non-native English speakers. The annotators were non-native
English speakers from a range of different backgrounds. Hence, the annotator agreement was
as low as 0.244. For evaluation of the model in addition to the traditional Accuracy, Precision,
Recall, and f-score a new metric was developed for CWI - the G-score. It is the harmonic mean
between accuracy and recall. It aimed at avoiding both false negative and positive (accuracy) and
simultaneously captures a majority of the complex words (recall).
4
Thebestperformingsubmissionforthesecondsharedtaskwasby(SV000gg)(G.Paetzoldand
Specia, 2016c). They introduced two ensemble methods - the first one is with a hard voter and
the second with a performance-oriented soft voter that weights votes according to performance
rather than its prediction confidence. This allows the combination of completely heterogeneous
systems. Their system ranked first and second overall in the shared task. Some of the findings
include 1. Complex words tend to be rarer, less ambiguous, and shorter. 2. Word frequencies
remain the most reliable predictor of word complexity.
2.1.3 NAACL 2018
ThethirdsharedtaskwasconductedinNAACL2018(Yimam,Biemann,etal.,2018). Itextended
the SemEval 2016 shared task. It introduced the task of complex phrase identification along
with complex word identification. Additionally, the probability of the complexity annotation was
provided in addition to the final complexity class. The annotators were also a mix of both native
and non-native English speakers, 10 each. Annotations were performed using Amazon Mturk.
The dataset covers three different genres: 1. Wikipedia articles, 2. News articles written by
professionals (News), 3. Informal news articles (WikiNews). The training set consists of 27,299
instances, the dev set consists of 3, 328 instances, and the test set consists of 4, 252 instances.
F1 score, which is the harmonic mean of both precision and recall of a system, was used as the
evaluation metric for the shared task.
The best-performing submission of CAMB (Gooding and Kochmar, 2018) used the traditional
feature-engineering based approach. Word frequency and length features contributed heavily
to the effectiveness of the model. An interesting insight was that word embedding and Neural
Network based approaches didn’t perform well and the authors attribute it to the small amount
of data availability. (Wani et al., 2018) also used a majority vote for their ensemble along with an
embedding-based voter. Their system outperforms multiple other models and falls within 0.042
to 0.026 percent of the best-performing model’s score in the shared task.
5
2.2 CNNs vs Feature Engineering (Voting Ensembles)
CNNs have been explored for the task of CWI with reasonable results. (Aroyehun et al., 2018)
all submitted their submission NLP-CIC for the NAACL 2018 shared task. They described two
approaches: the first was a feature engineered Tree Ensemble and the second was using a CNN
with feature engineering. The performance scores for both approaches were within 0.01 f1 point
difference. The CNN model also performed well on the Spanish dataset and was placed third in
the shared task. They demonstrated that CNNs are flexible alternatives to feature engineering
and can be used on any CWI dataset where pre-trained embeddings are available. The major
shortcoming of CNNs however was that it made mistakes on longer target texts. The authors
attributed this behavior to the skewness of the training set.
2.3 CWI as a Sequence Labeling Task (BiLSTM)
BiLSTMs are the current state-of-the-art models for CWI. (Gooding and Kochmar, 2019) present
SEQ (BiLSTM) which outperforms their previous NAACL 2018 CAMB submission. SEQ relies
on a sequence modeling approach that takes limited context into account and utilizes sub-word
information and a language model objective. They use GloVe embedding (D=300) (Penning-
ton, Socher, and Manning, 2014) and a sequential architecture by (Rei, 2017) that has achieved
SOTA on multiple NLP tasks and tasks similar to CWI such as error detection. The CAMB
model requires 27 features based on different sources and relies on individually tailored systems
to maximize results across datasets. In contrast, the SEQ model is a “one size fits all” model
that works out of the box across all datasets by harnessing character-level morphology and word
contextembedding. TheSEQmodelimprovedthewinningCWIsystemscoreof(G.Paetzoldand
Specia, 2016c) from the SemEval 2016 shared task as well as the nearest centroid (NC) approach
by (Yimam,
ˇStajner, et al., 2017).
6
Chapter 3
Complex Word and Phrase Identification as an Entity
Recognition Task
We remodel the task of CWI as a token classification task. Tokens ( T) can take several forms,
however, for our task we consider each word (w
iw
) of a unique sentence (s
i
) as a single token.
Token classification is also popularly recognized as an entity recognition task. The most popular
entity recognition task is named entity recognition (NER) (Tjong Kim Sang and De Meulder,
2003), where the model must identify a word or a phrase that contains the names of a person,
organization,andlocation. InourCWImodeling,weconsiderbinaryentities. Eachtokenbelongs
to either the complex entity or the non-complex entity. While training, we pass a unique sentence
with each token parallelly as a tuple (word,tag). During inference, we pass the unique sentence
and receive a token classified sentence as a tuple ( word,tag). We then compare token word
prediction to its original label to receive our F1 score. We argue that an F
0.5
score can also be a
better evaluation metric because of the class imbalance.
3.1 Dataset
We use the NAACL 2018 CWI English monolingual dataset (Yimam, Biemann, et al., 2018) for
our experiments. It includes data from three sources: News, WikiNews, and Wikipedia articles.
7
The News dataset consisted of news articles written by professional writers while the WikiNews
dataset consisted of informal news articles. Table 3.1 shows the distribution of sentences across
all three datasets.
Dataset Sentences
Total Unquie
News-Train 14002 1016
News-Test 2095 175
WikiNews-Train 7746 652
WikiNews-Test 1287 105
Wikipedia-Train 5551 387
Wikipedia-Test 870 61
Table 3.1: Sentence distribution across datasets
Fairandbias-freeannotationgoalwastriedtoachievebyhaving10non-nativeEnglishspeakers
as well as 10 English speakers from diverse backgrounds. The Amazon Mechanical Turk tool was
used for annotations. The training set consists of 27,299 instances, the dev set consists of 3, 328
instances, and the test set consists of 4, 252 instances.
3.2 Data Formatting for Transformers
The corpora provides raw unprocessed sentences. Along with each of these sentences, the target
word and phrases from the sentence to be identified with complex or not complex is provided.
The start and end character index of the target word and phrase is also part of the metadata.
BERT (Bi-directional encoder representations from Transformer) (Devlin et al., 2019) has shown
state-of-the-art performance on NER task. We use BERT in its fine-tuning form for our task. It
requires the data to be formatted in a Word Piece model.
8
3.3 WordPiece Embedding and Tokenizer
The word tokens are first mapped to their respective BERT vocabulary IDs. Word tokens that
aren’t part of BERT’s vocabulary are eventually broken into subwords and then mapped. These
subwords are marked internally by BERT tokenizer with a prefix “##”. For example, the word
“embeddings” is broken into three tokens: “em”, “##bed”, “##ding”. We also prepend the
special [CLS] token at the start of each sentence. The special [SEP] token is appended to the end
of the sentence. All the sentences are padded to the same length (490) with the special [PAD]
token. This is required for batch processing. We also provide attention masks so that BERT does
not interpret the [PAD] token for any meaning.
3.4 Character and Word-level Transformations
Token labels need to be assigned to each of the words in the raw sentences. Since we only
have access to the target word’s start and end characters, we first perform a character-level
transformation of the available target words with labels of the complex class. This is shown in
the first part of figure 1. The rest of the characters of the remaining words are assigned with the
non-complex class as shown in the second half of figure 1. Character level replacements are also
required because different libraries tokenize sentences differently.
Figure 3.1: Character-level replacements
Once we have the character transformed sentence, we then move up in granularity and convert
eachoftheremovetherepeatedcharactersineachofthewordsandhavethetransformedsentence
is the token level class label for the original sentence. This is shown in figure 2.
We released this tokenized dataset for future CWI sequence labeling tasks.
1
1
nikhilwani.github.io
9
Figure 3.2: Word-level replacements
3.5 Design and Implementation
We design our experiments to fine-tune BERT (12 layers) + 1 additional layer to our task and
tokenized dataset of CWI. We prepare our input sentences in the Wordpiece embedding (Tok-
enizer) and Positional embedding (Word-order) model for the Transformer units. We also pad
and append special tokens as detailed in section 3.3. We obtain our final predictions in the [CLS]
token and the final layer of the model.
To obtain the correct character start index for reindexing the noise-driven sentences (tokens)
we use Stanford Stanza NLP pipeline APIs for our implementation. We use the Huggingface
library
2
for implementing and fine-tuning of the BERT model. We use the BertTokenizer class
with the tokenizer.encode() to obtain the transformation. We use the BertForTokenClassification
class with bert-base-uncased (uncased vocabulary) with an additional linear layer for the token
classification task. We use the sklearn.metrics API to compute the F1 scores for evaluation.
3.6 Training and Fine-tuning
We use a batch size of 32 and use Pytorch’s DataLoader() which helps us save memory as it
does not need to load the entire dataset in memory. We use the BertForTokenClassification class
with bert-base-uncased which is the normal 12-layered BERT with uncased vocabulary and an
additional linear layer that is used for classifying each token’s entity class. The entire pre-trained
BERT layers and our additional untrained classification layer is trained on our CWI task. We use
the Adam optimizer, train and fine-tune for 4 epochs. We observe the training loss reduce after
each epoch and converge.
2
https://huggingface.co/docs/transformers/model doc/bert
10
3.7 Testing and Evaluation
For testing, we perform all the same data formatting that we did in training. We first remove all
the pad tokens so that when we evaluate performance we only take into account predictions for
thenon-padtokens. Oncethetestsetisprepared,wefine-tunedourmodeltogeneratepredictions
on the test set. After reshaping our results, we then score the predictions.
In the final step, we compute the F
1
score for evaluation. It is the harmonic mean of recall
and precision as shown in (3.3).
Precision =
T
p
T
p
+F
p
(3.1)
Recall =
T
p
T
p
+T
n
(3.2)
F
1
=2 · Precision · Recall
Precision+Recall
(3.3)
In equations (3.1) and (3.2), T
p
are the True positives, F
p
are the False positives, T
n
are the
True negatives. AnF
0.5
score was also considered given the class imbalance, however to maintain
consistency and comparability with previous performing systems, we report only the F
1
score.
3.8 Results and Discussion
We report the results obtained by both the token classification model (T-Wani) and the hybrid
model (H-Wani). The Hybrid model is described in the next section. We also compare our results
with the current best performing model by (Gooding and Kochmar, 2019) that uses BiLSTMs
as well as previous top performing models such as CIC-CNN, CAMB (Ensemble), and our other
previous submission Wani (Ensemble) (Wani et al., 2018). The result reporting format in table
11
3.1 for T-Wani (Tranformer) is T-Wani f1 score (+Wani(Ensemble), -SEQ(BiLSTM)), where
+Wani(Ensemble) is the difference between T-Wani and Wani(Ensemble). The same follows for
-SEQ(BiLSTM).
Dataset Architecture (F1 Score)
Wani (Ensemble) CAMB (Ensemble) CIC (CNN) SEQ (BiLSTM) H-Wani (Hybrid) T-Wani (Transformer)
News 0.8554 0.8736 0.855 0.8763 0.8700 0.8754 (+0.02, -0.0009)
WikiNews 0.8213 0.8400 0.824 0.8505 0.8402 0.8413 (+0.02, -0.0092)
Wikipedia 0.777 0.8115 0.772 0.8158 0.8113 0.8113 (+0.0343, -0.0045)
Table 3.2: T-Wani and H-Wani system results for complex words + phrases compared with other
architectures.
As shown in table 3.1, f1 scores for the News dataset and WikiNews dataset are as good as
SEQ which is just +0.0009 more. For Wikipedia dataset as well the difference between the two
modelsisjust+0.0045. TheseresultsillustratethatTransformerscanbeeffectivelyappliedtothe
taskofCWI.AnothersignificantobservationisthatfortheWikipediadataset, theT-Wanimodel
has a significant change of 0.0003 from the Wani (Ensemble) model. This could be because of
the Wordpeice embeddings that BERT uses which is able to effectively handle out-of-vocabulary
words with subword information. Additionally, BiLSTMs have been reported to perform poorly
on long sentences, however the Transformer model does well on them.
3.9 Challenges and Limitations
Tokenization is the single most dominantly challenging component. For token classification or
fine-tuning, assigning the correct token label determines the effectiveness of the model. In our
CWI task, we had to create the labels from scratch. Aligning the start and the end character
indexofthetargetwordinanoise-drivensentencewasextremelychallengingandtimeconsuming.
Aftercleaningthesentence,theindexesofthecharacterwouldchange. Weovercamethisbyusing
StanfordCoreNLP’s Stanza to retokenize all the sentences and retrieve the correct start and end
character indexes.
12
Another challenge was handling multi-word expressions and phrases. For example, the word
“london-bridge” is tokenized by most NLP tokenizers as a single token, however, the BERT to-
kenizer breaks this sentence down into three individual tokens as “london”, “-”, and “bridge”.
This is done as the word london-bridge is not part of BERT’s vocabulary, but each of the three
individual words are present. When calculating and assigning token labels in such scenarios, mis-
alignmentwastakenintoaccount. Eachofthesubwordshadtobetrackedwithaseparateunique
index for resolution. Special characters such as opening and closing commas (“, “), apostrophes
(‘) were among other challenging components during the tokenization process.
One of the limitations with the SemEval 2016 dataset is that they do not provide the start
and end character index of the target word, but instead provide the start index of the word in
the sentence. Different tokenizers index words differently while tokenization hence in order to
consider the SemEval 2016 for token classification, the entire dataset must be passed through
Stanza and reindexed with start and end character index of the target word.
3.10 Future work
An analysis of the results reveals that the token verification process and unit testing of the token
label assignment will definitely boost results. Additionally, adding more data will also boost
results. ThiscanbedonebyaddingtheSemEval2016datasetandthecomplexwordidentification
2.0 dataset (Shardlow, Evans, and Zampieri, 2022) that was recently made available. It will also
helpimprovethetokenclassificationforthecomplexclassandtackletheclassimbalanceproblem.
Eye Tracking experiments and cognitive NLP are also directions that can be explored. Features
such as eye fixation count and saccade counts will be invaluable and will better represent human
understanding of text in vectors.
13
Chapter 4
Complex Word and Phrase Identification as a Text
Classification Task
Traditionally dominant models for the task of CWI took into account properties of only the
target word without encoding any context of the sentence it is in. We try to address this problem
by concatenating vectors that represent the sentence along with the target word feature vector.
We use SBERT to achieve representation of the sentence. Each of the following features were
implemented and computed independently.
4.1 SBERT as a feature extractor
We use the SentenceTransformer [32] with all-mpnet-base-v2 model from the Huggingface library
to extract a 384 dimensional sentence representation. The model truncated sentences longer than
384 tokens. It is trained on more than 1 billion sentence pairs from more than 32 sources. The
model uses a contrastive learning objective that allows it to serve as a sentence and a short
paragraph encoder. For our task, we only use the passage embedding part for our sentence
representation.
14
4.2 Lexical WordNet Features
The following features were extracted using WordNet (Fellbaum, 1998) for the target word:
• DegreeofPolysemy(DP): ItisthenumberofsensesofthetargetwordinWordNet. We
compute it by counting the number of Synsets of the target word in WordNet. We observed
that words with larger WordNet Synset sizes have several senses and were found to be more
unclear and hence complex.
• Hyponym (Ho) and Hypernym (He) Tree Depth (TD): These help in finding lexical
relations. To find the position of the word in WordNet’s hierarchical tree, we consider
capturing its depth. General and simple words tend to be at the top of the tree. By
computing the average depth among all the target-word Synsets, we count the number of
Hyponyms and Hypernyms as a feature.
• Holonym Count (HC) and Meronym Count (MC): An alternative way to traverse
Wordnet’s hierarchical tree is by considering the relationship of the target word to its com-
ponents (Meronyms) or to the things it is contained in (Holonyms). Holonyms tend to
be more simple than meronyms because meronyms are usually more specific, compared to
holonyms, as holonyms are a generalized word for a group of entities, while meronyms refer
to specific entities in that group.
• VerbEntailments(VE):Verbsbeingactionwordsoftencontainentailmentrelationships.
Forexample, theactofroostinginvolvestheactofsitting, soroostingentailssitting. Target
words on average with multiple entailments were found to be relatively complex since they
tend to be visually more vivid when trying to comprehend. Hence, the number of verb
entailments of the target word was also part of our feature set.
15
4.3 Lexicon-vocabulary based Features
• SubIMDB: We compute the top 2000 words in the subtitles of the “Movies and Series for
Children ” section of the SubIMBD corpus (G. Paetzold and Specia, 2016a). We then check
for the presence of the target word in this list.
• Ogden’s Basic Lexicons (OB): We extract the top 1000 words from Ogden’s word list
of basic English words. (Ogden, 1968).
• Ogden’s Freq. Lexicons (OF): We extract the top 1500 words from Ogden’s word list
of high-frequency English words.
• Barron’s Lexicons (BW): We use Barron’s GRE wordlist to check for the presence of a
target word. Complex words tend to occur in this list.
4.4 Frequency-based Features
• Google N-grams: To compute the frequency of the target word we use Google’s dataset
of syntactic N-grams (Goldberg and Orwant, 2013). The frequency measure is in word
occurrence per million words. Word frequencies have been shown to be universally most
information-driven features across most systems (Specia, Jauhar, and Mihalcea, 2012).
• SUBTLEX Frequencies: The SUBTLEX (Van Heuven et al., 2014) is a resource which
contains word frequencies. These are based on subtitles of the British National Corpus
(BNC) and the British television programmes. These include CBBC Subtitles, C-beebies
Subtitles, BNC and Total Freq Count. These frequency counts are also in word occurrence
per million words.
16
4.5 Size based Features
• Word Count (WC): The number of words in the target word is counted.
• Word Length (WL): The number of characters in the target word is counted.
• Vowels Count (VC): The number of vowels in the target word is counted.
• Syllable Count (SC): The number of syllables in the target word is counted.
4.6 NamedEntityRecognition(NER)tags,Part-of-Speech
(POS) tags, Word N-gram
News and Wikipedia articles contain names of cities, state, country, organizations, people, lo-
cation. To capture whether the target word is one of these entities we create a binary feature
for each of the following entity: O (Other), PERSON, LOCATION, ORGANIZATION, MISC,
CITY, STATE OR PROVINCE, COUNTRY, NATIONALITY. We use the Stanford CoreNLP
Stanza toolkit to compute the NER tags.
To compute the syntactic representation of the target word, we perform the parts-of-speech
tagging on all the unique sentences in the corpus. We use the Stanford CoreNLP Stanza toolkit
and use the 35 POS tagset. A binary value representation is used to indicate the presence of the
POS tags.
Character level n-gram representation of words has been effective and has improved perfor-
mance on several tasks (Klein et al., 2003). A certain order of characters can be seen as complex.
We use character bi-gram of the target word as vocabulary for one-hot vector representation with
binary values showing the presence of the bi-gram.
4.7 Dependency Parser
To capture the relation of the target word with the sentence, we use the NLPCore pipeline and
countthenumberofdependencyrelationsofthetargetwordanduseitasafeature. Adependency
17
parserepresentsasentence’sgrammaticalstructure. Relationshipsbetweenthe“head”wordsand
other words that modify those heads are defined. Encoding this relation of the target word with
the rest of the words in the sentence helps represent contextual information of the target word.
After SBERT, this feature is the only other feature that also tries to encode information of words
other than the target word which turned out to be very essential.
4.8 Complex Phrase Features
For phrases, we compute all the features above for each of the words in the phrase and then take
the average of their values for numeric features. For binary features, we make sure that each word
in the phrase is represented.
Additionally, for phrase based frequency encoding we use the Corpus of Contemporary Amer-
ican English (COCA) bi and tri gram corpus
1
. The corpus consists of 1 million most frequently
used 2, 3, 4 and 5 gram combinations of words. We also use the SUBTLEX bigrams (Van Heuven
et al., 2014) which is a database of 1.5 million bigram counts from a range of television subtitles.
4.9 Design and Implementation
We use the same ensemble architecture as our previous submission (Wani et al., 2018) with all
of the features mentioned in the previous sections. This includes 8 independent base classifiers
(Random Forest, Random Tree, REP Tree, Logistic Model Tree, J48 Decision Tree, J-Ripper,
PART, and SVM). We use the Java Weka APIs
2
to implement these models. We use a hard
voter that makes the final prediction by picking the majority class based on the base classifier
outputs. The only change in our current model is the input representation. We concatenate the
SBERT features (D = 384) with all the features from the previous section. We call it a hybrid
1
https://www.ngrams.info/
2
https://weka.sourceforge.io/doc.stable-3-8/
18
model (H-Wani) because we use both the pre-trained embeddings/features as well as manually
engineered features.
Classifier Precision Recall F1-Score
Selected Base Classifiers
Random Forest 0.792 0.781 0.787
J48 Decision Tree 0.777 0.777 0.777
Logistic Model Tree 0.778 0.762 0.770
REP Tree 0.768 0.765 0.766
Random Tree 0.796 0.717 0.754
SVM 0.745 0.780 0.762
PART 0.715 0.793 0.752
JRip Rules Tree 0.754 0.737 0.745
Rejected Classifiers (F1 <0.70)
Decision Table 0.739 0.652 0.693
Decision Stump 0.665 0.696 0.680
Hoeffding Tree 0.686 0.666 0.676
Logistic Regression 0.732 0.591 0.654
SMO 0.751 0.550 0.635
OneR 0.735 0.550 0.629
ZeroR 0.000 0.000 0.000
Table4.1: Resultsofeachoftheclassifiersten-foldcross-validatedonthetrainingforthecomplex
class
Table 4.1 describes our reasoning for selecting these 8 classifiers. Classifiers with more than
0.70 F
1
from our previous experiment (Wani et al., 2018) were selected based on the 10 fold cross
validation results on the training set for the complex class. The ZeroR classifier has F
1
score
as 0 as the majority class was the non-complex class. Hard voting was implemented. If more
than 4 classifiers vote for a class, we assign that as the final class. In case of a tie, we consider
rankings from the top 3 (Random Forest, J48 Decision Tree, and Logistic Model Tree) model and
determine the final class.
19
4.10 Results and Discussion
We report the results of the hybrid (H-Wani) model and compare it with other architectures in
Table 1 and Table 4.2. The hybrid model outperforms all other architectures except for the SEQ
and T-Wani. The addition of SBERT to traditional features helped boost the F
1
score.
Dataset Architecture (F1 Score)
Wani (Ensemble) SEQ (BiLSTM) H-Wani (Hybrid) T-Wani (Transformer)
News 0.8554 0.8763 0.8700 0.8754 (+0.02, -0.0009)
WikiNews 0.8213 0.8505 0.8402 0.8413 (+0.02, -0.0092)
Wikipedia 0.777 0.8158 0.8113 0.8113 (+0.0343, -0.0045)
Table 4.2: H-Wani (Hybrid) model results compared with other models
SBERT provides the missing context to the target word representation that the traditional
models in the NAACL 2018 shared task did not take into account. This also illustrates that
feature-engineered ensembles can be replaced with either hybrid pre-trained models like H-Wani
or Transformer models like T-Wani. For hybrid models representation must take into account all
levels of granularity, i.e. character, sub-word, word, and phrase-level representation.
4.11 Future Work
Since the representation is in hybrid pre-training form, we think the ensemble architecture can
also be made hybrid by attaching a neural network as a hard voter instead of just counting the
majority. We can use the KL divergence loss for this Neural network based voter. Additionally,
with more data from the upcoming CWI 2.0 dataset, we could have better results with this neural
network based hybrid model. Psycholinguistic features from the MRC Psycholinguistic Database
(Wilson, 1988) and the Bristol Norms (Stadthagen-Gonzalez and Davis, 2006) can be explored
like in SemEval 2012.
Another future direction of work could focus on optimizing the set of base classifiers. Ablation
tests can be performed to see if inclusion of Random Tree and Ranfom Forest together indeed is
20
significant. Statistical tests can be performed to justify the small noise in the results which can
be then be used to claim state-of-the-art.
21
Chapter 5
Conclusions
This thesis presented work on novel lexical representations for complexity-driven tasks such as
complex word and phrase identification. We presented two models: 1. Hybrid pretrained and
feature engineered voting ensemble (H-Wani) 2. Fine Tuning based Transformer based model (T-
Wani). Our best model (T-Wani) achieves almost equal results (-0.0009 f
1
difference) with the
current state-of-the-art BiLSTM SEQ on two datasets (News and WikiNews). For the Wikipedia
dataset the difference is also very close (-0.0045 f
1
). We believe the self-attention mechanism of
the Transformer along with Wordpeice embeddings that BERT uses to effectively handle out-of-
vocabulary words synergized to an effective CWI model (T-Wani). We show that at par results
can be achieved without the need for expensive feature engineering, which has been dominant
with the best performing models for the task of the complex word and phrase identification. We
alsoshowthattheTransformermodelperformswellonlongsentenceswhereastheBiLSTMshave
been traditionally reported to perform poorly.
In the future, we hope to run a few statistical tests to justify the diminutive difference in the
results of our models and the current state-of-the-art as statistical noise. We also look forward
to replicate our excellent results on the new complex word identification 2.0 dataset (Shardlow,
Evans, and Zampieri, 2022) that was made available recently.
22
References
1. Aroyehun, Segun Taofeek et al. (June 2018). “Complex Word Identification: Convolutional
Neural Network vs. Feature Engineering”. In: Proceedings of the Thirteenth Workshop on
Innovative Use of NLP for Building Educational Applications. New Orleans, Louisiana: As-
sociation for Computational Linguistics, pp. 322–327. doi: 10.18653/v1/W18-0538. url:
https://aclanthology.org/W18-0538.
2. Brown, Keith (2005). Encyclopedia of language and linguistics. Vol. 1. Elsevier.
3. Devlin, Jacob et al. (June 2019). “BERT: Pre-training of Deep Bidirectional Transformers
forLanguageUnderstanding”.In:Proceedingsofthe2019ConferenceoftheNorthAmerican
Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers).Minneapolis,Minnesota:AssociationforComputational
Linguistics, pp. 4171–4186. doi: 10.18653/v1/N19-1423. url: https://aclanthology.
org/N19-1423.
4. Fellbaum, Christiane (1998). “A semantic network of english: the mother of all WordNets”.
In:EuroWordNet: A multilingualdatabasewithlexicalsemanticnetworks.Springer,pp.137–
148.
5. Gerritsen, Marinel et al. (2000). “English in Dutch commercials: Not understood and not
appreciated”. In: Journal of advertising research 40.4, pp. 17–31.
6. Goldberg, Yoav and Jon Orwant (2013). “A dataset of syntactic-ngrams over time from a
very large corpus of english books”. In.
7. Gooding, Sian and Ekaterina Kochmar (June 2018). “CAMB at CWI Shared Task 2018:
Complex Word Identification with Ensemble-Based Voting”. In: Proceedings of the Thir-
teenth Workshop on Innovative Use of NLP for Building Educational Applications. New Or-
leans, Louisiana: Association for Computational Linguistics, pp. 184–194. doi: 10.18653/
v1/W18-0520. url: https://aclanthology.org/W18-0520.
8. — (July2019).“ComplexWordIdentificationasaSequenceLabellingTask”.In: Proceed-
ings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence,
Italy: Association for Computational Linguistics, pp. 1148–1153. doi: 10.18653/v1/P19-
1109. url: https://aclanthology.org/P19-1109.
9. Jauhar, Sujay Kumar and Lucia Specia (July 2012). “UOW-SHEF: SimpLex – Lexical Sim-
plicity Ranking based on Contextual and Psycholinguistic Features”. In: *SEM 2012: The
First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings
of the main conference and the shared task, and Volume 2: Proceedings of the Sixth Interna-
tional Workshop on Semantic Evaluation (SemEval 2012). Montr´ eal, Canada: Association
for Computational Linguistics, pp. 477–481. url: https://aclanthology.org/S12-1066.
23
10. Klein, Dan et al. (2003). “Named entity recognition with character-level models”. In: Pro-
ceedings of the seventh conference on Natural language learning at HLT-NAACL 2003,
pp. 180–183.
11. Leroy, Gondy et al. (2013). “User evaluation of the effects of a text simplification algorithm
using term familiarity on perception, understanding, learning, and information retention”.
In: Journal of medical Internet research 15.7, e2569.
12. Ogden, Charles Kay (1968). Basic English: international second language. Harcourt, Brace
& World.
13. Paetzold, Gustavo and Lucia Specia (2016a). “Collecting and exploring everyday language
for predicting psycholinguistic properties of words”. In: Proceedings of COLING 2016, the
26th International Conference on Computational Linguistics: Technical Papers, pp. 1669–
1679.
14. — (June 2016b). “SemEval 2016 Task 11: Complex Word Identification”. In: Proceedings
of the 10th International Workshop on Semantic Evaluation (SemEval-2016). San Diego,
California:AssociationforComputationalLinguistics,pp.560–569.doi:10.18653/v1/S16-
1085. url: https://aclanthology.org/S16-1085.
15. — (June2016c).“SV000ggatSemEval-2016Task11:HeavyGaugeComplexWordIden-
tification with System Voting”. In: Proceedings of the 10th International Workshop on Se-
mantic Evaluation (SemEval-2016). San Diego, California: Association for Computational
Linguistics, pp. 969–974. doi: 10.18653/v1/S16-1149. url: https://aclanthology.org/
S16-1149.
16. Paetzold, Gustavo H and Lucia Specia (2016). “Plumberr: An automatic error identification
framework for lexical simplification”. In: Proceedings of the first international workshop on
Quality Assessment for Text Simplification (QATS) , pp. 1–9.
17. Pennington, Jeffrey, Richard Socher, and Christopher D Manning (2014). “Glove: Global
vectorsforwordrepresentation”.In:Proceedingsofthe2014conferenceonempiricalmethods
in natural language processing (EMNLP), pp. 1532–1543.
18. Rei, Marek (July 2017). “Semi-supervised Multitask Learning for Sequence Labeling”. In:
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics,
pp.2121–2130.doi:10.18653/v1/P17-1194.url:https://aclanthology.org/P17-1194.
19. Shardlow, Matthew (2013). “A Comparison of Techniques to Automatically Identify Com-
plex Words.” In: 51st annual meeting of the association for computational linguistics pro-
ceedings of the student research workshop, pp. 103–109.
20. — (2014). “Out in the open: Finding and categorising errors in the lexical simplification
pipeline”.In: Proceedings of the Ninth International Conference on Language Resources and
Evaluation (LREC’14), pp. 1583–1590.
21. Shardlow, Matthew, Richard Evans, and Marcos Zampieri (2022). “Predicting lexical com-
plexity in English texts: the Complex 2.0 dataset”. In: Language Resources and Evaluation,
pp. 1–42.
24
22. Specia, Lucia, Sujay Kumar Jauhar, and Rada Mihalcea (July 2012). “SemEval-2012 Task
1: English Lexical Simplification”. In: *SEM 2012: The First Joint Conference on Lexical
and Computational Semantics – Volume 1: Proceedings of the main conference and the
shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic
Evaluation (SemEval 2012). Montr´ eal, Canada: Association for Computational Linguistics,
pp. 347–355. url: https://aclanthology.org/S12-1046.
23. Stadthagen-Gonzalez, Hans and Colin J Davis (2006). “The Bristol norms for age of acqui-
sition, imageability, and familiarity”. In: Behavior research methods 38.4, pp. 598–605.
24. Tjong Kim Sang, Erik F. and Fien De Meulder (2003). “Introduction to the CoNLL-2003
Shared Task: Language-Independent Named Entity Recognition”. In: Proceedings of the
Seventh Conference on Natural Language Learning at HLT-NAACL 2003,pp.142–147.url:
https://aclanthology.org/W03-0419.
25. Van Heuven, Walter JB et al. (2014). “SUBTLEX-UK: A new and improved word fre-
quencydatabaseforBritishEnglish”.In: Quarterly journal of experimental psychology 67.6,
pp. 1176–1190.
26. Wani, Nikhil et al. (June 2018). “The Whole is Greater than the Sum of its Parts: Towards
the Effectiveness of Voting Ensemble Classifiers for Complex Word Identification”. In: Pro-
ceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational
Applications. New Orleans, Louisiana: Association for Computational Linguistics, pp. 200–
205. doi: 10.18653/v1/W18-0522. url: https://aclanthology.org/W18-0522.
27. Wilson, Michael (1988). “MRC psycholinguistic database: Machine-usable dictionary, ver-
sion 2.00”. In: Behavior research methods, instruments, & computers 20.1, pp. 6–10.
28. Yimam, Seid Muhie, Chris Biemann, et al. (June 2018). “A Report on the Complex Word
Identification Shared Task 2018”. In: Proceedings of the Thirteenth Workshop on Innovative
Use of NLP for Building Educational Applications. New Orleans, Louisiana: Association
for Computational Linguistics, pp. 66–78. doi: 10.18653/v1/W18-0507. url: https:
//aclanthology.org/W18-0507.
29. Yimam, Seid Muhie, Sanja
ˇStajner, et al. (2017). “CWIG3G2-complex word identification
task across three text genres and two user groups”. In: Proceedings of the Eighth Interna-
tional Joint Conference on Natural Language Processing (Volume 2: Short Papers),pp.401–
407.
25
Abstract (if available)
Abstract
Lexical complexity encoding requires multi-granular syntactic and semantic context; however, the currently available representations for lexically challenging NLP shared tasks such as complex phrase and word identification rely solely on word-level properties with either no context or limited contextual attention. Lexical simplification pipelines also rely on accurate contextual substitution of individual complex words making it imperative to encode context. In this thesis, we leverage the self-attention mechanism of the Transformer architecture. We show that at par results can be achieved without the need for expensive feature engineering, which has been dominant with the best performing models for the task of complex word and phrase identification. Experiments show that BERT can be used in its dual form both as a feature extractor and fine-tuned as a token classification task for complex word and phrase identification. We contrast results and show that our model outperforms the existing dominant ensembles, CNNs, and performs equally well as Bi-LSTMs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
BeyondURL: learning meaningful embedding representations for Web addresses
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Building a knowledgebase for deep lexical semantics
PDF
Visual representation learning with structural prior
PDF
Aggregating symbols for language models
PDF
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Towards more human-like cross-lingual transfer learning
PDF
Modeling, learning, and leveraging similarity
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
PDF
Towards learning generalization
PDF
Toward better understanding and improving user-developer communications on mobile app stores
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Generating psycholinguistic norms and applications
PDF
Learning logical abstractions from sequential data
PDF
Learning distributed representations of cells in tables
PDF
Advanced knowledge graph embedding techniques: theory and applications
Asset Metadata
Creator
Wani, Nikhil
(author)
Core Title
Lexical complexity-driven representation learning
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
07/23/2022
Defense Date
06/22/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bert,entity recognition,machine learning,natural language processing,OAI-PMH Harvest,representation learning,token classification,transformers
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Raghavachary, Sathyanaraya (
committee chair
), Nakano, Aiichiro (
committee member
), Rajati, Mohammad Reza (
committee member
)
Creator Email
1.nikhilwani@gmail.com,nwani@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375429
Unique identifier
UC111375429
Legacy Identifier
etd-WaniNikhil-10945
Document Type
Thesis
Format
application/pdf (imt)
Rights
Wani, Nikhil
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
entity recognition
machine learning
natural language processing
representation learning
token classification
transformers