Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
(USC Thesis Other)
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Word,SentenceandKnowledge Graph Embedding Techniques: Theoryand Performance Evaluation by BinWang ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITY OFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (ElectricalEngineering) May2021 Copyright 2021 BinWang TableofContents ListofTables v ListofFigures viii Abstract x Chapter1: Introduction 1 1.1 SignificanceoftheResearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 ReviewofPreviousWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 ContributionsoftheResearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 WordEmbedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 SentenceEmbedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 KnowledgeGraphEmbedding . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 OrganizationoftheDissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter2: ResearchBackground 9 2.1 WordEmbeddingModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 DesiredPropertiesofEmbeddingModels . . . . . . . . . . . . . . . . . . . . . . 14 2.3 DesiredPropertiesofEvaluators . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 EvaluationMethodsforWordEmbedding . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 IntrinsicEvaluators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 ExtrinsicEvaluators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 SentenceEmbedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 KnowledgeGraphEmbedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Chapter3: WordRepresentationLearning: EnhancementandEvaluation 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 EnhancementofWordEmbedding . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2 EvaluationofWordEmbedding . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 ProposedTwoEnhancementMethodsforWordRepresentation . . . . . . . . . . . 31 3.2.1 Post-processingviaVarianceNormalization . . . . . . . . . . . . . . . . . 31 3.2.2 Post-processingviaDynamicEmbedding . . . . . . . . . . . . . . . . . . 33 3.2.3 ExperimentalResultsandAnalysis . . . . . . . . . . . . . . . . . . . . . . 36 3.3 ExperimentalStudyonWordEmbeddingEvaluators . . . . . . . . . . . . . . . . 41 3.3.1 ExperimentalResultsofIntrinsicEvaluators . . . . . . . . . . . . . . . . 41 ii 3.3.2 ExperimentalResultsofExtrinsicEvaluators . . . . . . . . . . . . . . . . 46 3.3.3 ConsistencyStudyviaCorrelationAnalysis . . . . . . . . . . . . . . . . . 48 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter4: SentenceEmbeddingbySemanticSubspaceAnalysis 54 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.2 ProposedMethodonStaticWordRepresentation . . . . . . . . . . . . . . 60 4.1.3 ProposedMethodonContextualizedWordRepresentation . . . . . . . . . 62 4.2 SentenceEmbeddingviaSemanticSubspaceAnalysisonStaticWordRepresentations 63 4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.2 ExperimentalResultsandAnalysis . . . . . . . . . . . . . . . . . . . . . . 66 4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 SentenceEmbeddingbyDissectingContextualizedWordModels . . . . . . . . . . 74 4.3.1 WordRepresentationEvolutionacrossLayers . . . . . . . . . . . . . . . . 74 4.3.2 ProposedSBERT-WKMethod . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4 ConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Chapter5: CommonsenseKnowledgeGraphEmbeddingandCompletion 95 5.1 IntroductionandRelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2 ProblemDefinition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3 DatasetPreparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3.1 StandardSplit: CN-100K,CN-82K,ATOMIC . . . . . . . . . . . . . . . 100 5.3.2 InductiveSplit: CN-82K-Ind,ATOMIC-Ind . . . . . . . . . . . . . . . . . 101 5.4 ProposedInductivEmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4.1 Free-TextEncoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4.2 GraphEncoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4.3 Decoder-Conv-TransE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5 ExperimentalResultsandAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.5.2 ResultandAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5.3 AblationStudyand Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Chapter6: Domain-SpecificWordEmbeddingfromPre-trainedLanguageModels 115 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2 Domain-SpecificWordEmbeddingfromPre-trainedLanguageModels . . . . . . . 117 6.2.1 DomainWE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2.2 Domain-SpecificCorpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.2.3 Domain-SpecificLanguageModels . . . . . . . . . . . . . . . . . . . . . 119 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 iii 6.3.2 EvaluationDatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3.3 ResultsandDiscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.3.4 ModelAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Chapter7: ConclusionandFuture Work 128 7.1 SummaryoftheResearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.2 FutureResearchDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Bibliography 132 iv ListofTables 3.1 Wordsimilaritydatasetsusedinourexperiments,wherepairsindicatethenumber ofwordpairsineachdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 The SRCC performance comparison (100) for SGNS alone, SGNS+PPA and SGNS+PVN against word similarity datasets, where the last row is the average performanceweightedbythepairnumberofeachdataset. . . . . . . . . . . . . . . 38 3.3 The SRCC performance comparison (100) for SGNS alone, SGNS+PPA and SGNS+PVNagainstwordanalogydatasets. . . . . . . . . . . . . . . . . . . . . . 38 3.4 Extrinsic Evaluation for SGNS alone and SGNS+PVN. The first value is from orignalmodelwhilesecondvalueisfromourpost-processingembeddingmodel. . 39 3.5 The SRCC performance comparison (100) for SGNS alone and SGNS + PDE againstwordsimilarityandanalogydatasets. . . . . . . . . . . . . . . . . . . . . . 39 3.6 The SRCC performance comparison (100) for SGNS alone and SGNS + PVN / PDEmodelagainstwordsimilarityandanalogydatasets. . . . . . . . . . . . . . . 40 3.7 Word similarity datasets used in our experiments where pairs indicate the number ofwordpairsineachdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.8 Performance comparison (100) of six word embedding baseline models against 13wordsimilaritydatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.9 Performance comparison (100) of six word embedding baseline models against wordanalogydatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.10 Performance comparison (100) of six word embedding baseline models against threeconceptcategorizationdatasets. . . . . . . . . . . . . . . . . . . . . . . . . 44 3.11 Performance comparison of six word embedding baseline models against outlier detectiondatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.12 QVECperformancecomparison(100)ofsixwordembeddingbaselinemodels. . 45 3.13 DatasetsforPOStagging,ChunkingandNER. . . . . . . . . . . . . . . . . . . . 46 3.14 Sentimentanalysisdatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 v 3.15 Extrinsicevaluationresults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 ExperimentalresultsontextualsimilaritytasksintermsofthePearsoncorrelation coefficients (%), where the best results for parameterized and non-parameterized areinboldrespectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Experimentalresultsonsupervisedtasks,wheresentenceembeddingsarefixeddur- ingthetrainingprocessandthebestresultsforparameterizedandnon-parameterized modelsaremarkedinboldrespectively. . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Examplesindownstreamtasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Inferencetimecomparison. Dataarecollectedfrom5trails. . . . . . . . . . . . . 71 4.5 Word groups based on the variance level. Less significant words in a sentence are underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 ExamplesinSTS12-STS16,STS-BandSICKdatasets. . . . . . . . . . . . . . . . 84 4.7 Experimental results on various textual similarity tasks in terms of the Pearson correlationcoefficients(%),wherethebestresultsareshowninboldface. . . . . . 85 4.8 Datasetsusedinsupervised downstreamtasks. . . . . . . . . . . . . . . . . . . . . 87 4.9 Experimental results on eight supervised downstream tasks, where the best results areshowninboldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.10 Experimentalresultson10probingtasks,wherethebestresultsareshowninbold face. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.11 Comparison of different configurations to demonstrate the effectiveness of each module of the proposed SBERT-WK method. The averaged Pearson correlation coefficients(%)forSTS12-STS16andSTSBdatasetsarereported. . . . . . . . . . 92 4.12 InferencetimecomparisonofInferSent,BERT,XLNET,SBERTandSBERT-WK. Dataarecollectedfrom5trails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.1 Statistics of CKG datasets. Unseen Entity % indicates the percentages of unseen entitiesinalltestentities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Comparison of CKG completion results on CN-100K, CN-82K and ATOMIC datasets. Improvementiscomputedbycomparingwith[103]. . . . . . . . . . . . 107 5.3 Comparison of CKG completion results on unseen entities for CN-82K-Ind and ATOMIC-Ind. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4 AblationstudyonCN-82K-Inddataset. . . . . . . . . . . . . . . . . . . . . . . . 110 5.5 Analysisonourgraphdensifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.6 Top-3nearestneighborsofunseenentities . . . . . . . . . . . . . . . . . . . . . . 113 vi 6.1 Experimental results on Sentence Relatedness Domain. Both Pearson correlation andSpearman’scorrelationarereported. (Pearson/Spearman) . . . . . . . . . . . 122 6.2 ExperimentalresultsonSentimentAnalysis. Accuracyarereported. . . . . . . . . 123 6.3 ExperimentalresultsonScientificTextClassification. Accuracyarereported. . . . 123 6.4 ModelperformanceofDomainWEwithdifferentcorpus. Accuracyisreported. . . 124 6.5 Maximumnumberofsentencesamplesforeachword. . . . . . . . . . . . . . . . . 125 6.6 Modelsizeandinferencespeechfordifferentmodels . . . . . . . . . . . . . . . . 126 vii ListofFigures 3.1 Pearson’s correlation between intrinsic and extrinsic evaluator, where the x-axis shows extrinsic evaluators while the y-axis indicates intrinsic evaluators. The warm indicates the positive correlation while the cool color indicates the negative correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 OverviewoftheproposedS3Emethod. . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Comparing results with different settings on cluster numbers. STSB result is presentedinPearsonCorrelationCoefficients(%). SICK-EandSST2arepresented inaccuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Evolvingwordrepresentationpatternsacrosslayersmeasuredbycosinesimilarity, where (a-d) show the similarity across layers and (e-h) show the similarity over different hops. Four contextualized word representation models (BERT, SBERT, RoBERTaandXLNET)aretested. . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4 IllustrationfortheproposedSBERT-WKmodel. . . . . . . . . . . . . . . . . . . . 79 4.5 Performance comparison with respect to (a) window size< and (b) starting layer ; ( ,wheretheperformancefortheSTSdatsetisthePearsonCorrelationCoefficients (%)whiletheperformancefortheSICK-EandtheSST2datasetsistestaccuracy. . 93 5.1 Illustration of a fragment of ConceptNet commonsense knowledge graph: Link predictionforseenentities(transductive)andunseenentities(inductive). . . . . . . 96 5.2 Thetriplet-degreedistributionfortheConceptNetdataset,wherethetriplet-degree is the average of head and tail degrees. Triplets with high degrees are easier to predictingeneral. CN-100K splitisclearlyunbalancedcomparingwithCN-82K. . 100 5.3 Illustration of the proposed InductivE model. The encoder contains multiple GR- GCNlayersappliedtothegraphtoaggregatenode’slocalinformation. Thedecoder usesanenhancedConv-TransEmodel. . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1 Performance of DomainWE when changing the maximum number of samples for eachword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2 PerformanceonMRdatasetwithdifferentworddropprobabilities. . . . . . . . . . 126 viii 6.3 PerformanceonSST-2datasetwithdifferentworddropprobabilities. . . . . . . . . 126 7.1 ComputationalflowchartusedinBERT . . . . . . . . . . . . . . . . . . . . . . . 130 ix Abstract Natural language is one kind of unstructured data source which does not have a natural numerical representation as to its original form. In most natural language processing tasks, a distributed representation of text entities is necessary. Current state-of-the-art textual representation learning tries to extract information of interest and embed it with vector representations (i.e. embeddings). Even though a lot of progress has been witnessed on representation learning techniques in natural languageprocessing,especiallytherecentdevelopmentinlargepre-trainedlanguagemodels,there are still several research challenges especially considering that languages have multiple levels of entries and different domains. This thesis investigates and proposes representation learning techniques to learn semantic-enhanced embeddings for both words and sentences and study the application to different domains including commonsense knowledge representation and domain- specificwordembeddings. First, we analyze the desired properties of word embeddings and discuss different evaluation metricsforwordembeddingsandtheirconnectionwithdownstreamtasks. Withthat,weintroduce spaceenhancementmethodstoavoidthehubnessproblemandincreaseitscapabilitiesincapturing wordsemanticsandsequentialinformationinthecontext. Second,throughtheanalysisofsemantic groupsofwords,weintroduceanewsentencerepresentationtechniquebymeasuringthecorrelation betweendifferentsemanticgroupswithinasentence. Also,weproposetoleveragedeeppre-trained languagemodelsandintroducesapoolingmethodtofusiontheinformationlearningfromdifferent layers for an enhanced sentence representation. A detailed discussion and motivation with our proposed pooling method are also introduced. Last, we study the problem of transfer learning to x differenttextualdomains. Therefore,weproposeamethodtoextractdomain-specificwordembed- dingmodelsfrompre-trainedlanguagemodelsandapplyittodomainssuchassentimentsentence classification, sentence similarity, and scientific articles. Also, we investigate the application of representationlearningtoenabletheinductivelearningcapabilityofthecommonsenseknowledge basecompletiontask. xi Chapter1 Introduction 1.1 SignificanceoftheResearch Naturallanguageprocessing(NLP)isabranchofartificialintelligencethataimstohelpcomputers to understand, interpret and process human language. This field is developing very fast and has lots of real-world applications. Machine translation [17, 43] is trying to encode the meaning from the source language and generate a sentence for the target language. Dialogue system or conversational agent [94, 170] are modeling the converse between humans and it highly relies on natural language understanding. Sentiment Analysis [1, 117] is the task of looking at a piece of textandextractinginformationindeterminingtheowner’sopinion. Mosttasksinnaturallanguage processing, including the above-mentioned ones, are highly rely on understanding the semantic meaningofnaturallanguages. Therefore,howtomodelthemeaningconveyedinnaturallanguages isthekeytomoresophisticatedproblems. The symbolic approach is the first attempt in modeling semantics in natural languages with hand-coding rules such as writing grammar. Because of the complexity and variety of natural languages,itisimpossibletomodelalllanguagerulesmanually. Withthedevelopmentofmodern dataprocessingmethods,statisticalapproachesandmachinelearningarebecomingeffectivetools inmodelingandextractingbothsyntheticandsemanticbyexploitingthelanguagepatternlearned inalargecorpus. 1 Neural network models have recently become the most successful solution for NLP tasks. The successofneuralnetworkmodelshighlyreliesonthefactthattheycanlearnandusetheircontinuous numerical representations for words and sentences. Finding compact representation of word and sentenceavoidsthedimensionexplosionproblemintroducedbyusingone-hotrepresentations. At the same time, high-quality word and sentence embedding are desired for any natural language models because those embeddings can well capture the semantic pattern learned from a large corpus and be transferred to the desired tasks. Therefore, the semantic representations of words and sentences are generally useful and not limited to any specific tasks. How to compute a better representation that can be widely applied to downstream tasks is one of the hot areas in natural language processing. Meantime, how to adapt generic embedding to domain-specific is also challengingconsideringthedifferentpropertiesofdomaintext. Technically, learning good semantic representations of words and sentences is not trivial. Comparingwithstructurelanguagelikeaprogramminglanguage,naturallanguageisunstructured which makes it hard to model. Moreover, natural language is still developing and evolving as timeflies. Therefore,modelingthedevelopmentofmeaningandextractingfromexistinglanguage resources automatically is necessary yet challenging. Expert knowledge about linguistics, signal processing,anddataprocessingisimportantinsolvingtheseproblems. Solidmathematicalmodels arebecomingthekeyinextractinglanguageinformationandlanguagerepresentation. Thesemantic representationproblemisstillnotverythoroughlystudiedandrequiresmoreresearchefforts. 1.2 ReviewofPreviousWork Word embedding has a long history of development. It can be rooted in the 1960s with the development of the vector space model for information retrieval on text data. Later, statistical approaches including apply singular value decomposition and latent semantic analysis into word models lead to a better study of the word embedding space representation. However, due to the 2 highdimensionalityproblemfromthestatisticalapproach,neuralprobabilisticlanguagemodelsare employedtoreducehighdimensionalwordrepresentationstocompactdistributedrepresentations. Word embedding then developed into two major streams. One studies the co-occurrence statistics of words appearing within a certain window threshold. Matrix factorization techniques arewidelyusedtofindthesemanticrepresentationembeddedfromtheco-occurrencematrix[93]. By leveraging knowledge from linguistics, possible assumptions can be made to model different semantic aspects of words. Kernel methods are also adopted in this direction. Neural network models are mostly modeling language as a generation process and learn word vectors from its linguistics contexts in which the word occurs. The neural network methods develop gradually and take over this field after 2013 with the proposed word2vec model [109]. The quality of obtained wordvectorsandtrainingspeedofthemodelhasimprovedbyalargemargin. Recently,deepcontextualizedwordrepresentation[121]andlanguagemodels[52]haveachieve supreme performance in NLP field. Different from the vector space model which treats words as staticvectorrepresentations,deepcontextualizedwordrepresentationsadjustthewordrepresenta- tions based on their context words. Therefore, the input for deep contextualized models must be a sentence and do not have a lookup table for the word to vector correspondence. Therefore, deep contextualized models are powerful in solving polysemy problems in word representations which isahardprobleminastaticwordembedding. Word embedding has been studied for decades and great improvements have been made. Yet, many NLP applications operate at sentence level or longer pieces of text. Based on the success of word representation, sentence embedding has also attracted a lot of attention. Currently, sentence embeddingcanbecategorizedintotwotypes: non-parameterizedmodelsandparameterizedmod- els. Non-parameterizedmodelsrelyonhigh-qualitywordembedding. Thesimpleideaistoaverage individual word embeddings that already provide a tough-to-beat baseline and extremely efficient in computation. Also, analyze semantic components through semantic subspace construction is anotherbranchofwork. Bydividingsentencecomponentsintosub-groups,moredetailedanalysis can be made and complexity in each cluster is reduced. Capturing the sequential information 3 in a sentence is an important step. The dynamic modeling tools in signal processing are also incorporatedinthisprocess. Incontrast,parameterizedmodelsmainlyrelyonneuralnetworkmodelsanddemandtrainingin theirparameterupdates. Bothsupervisedandunsupervisedobjectivescanbeincorporatedintothe neuralnetworkmodel. Yet,howtodesigntheobjectivesforuniversalsentenceembeddingremains a challenging problem. The bias between multi-tasking training objectives should be avoided in ordertoobtainhigh-qualitysentenceencoder. 1.3 Contributionsofthe Research 1.3.1 WordEmbedding Word embedding is a crucial step for natural language processing. Although embedded vector representationsofwordsofferimpressiveperformanceonmanynaturallanguageprocessing(NLP) applications, the information from the corpus is not completely modeled. Our research on word embeddingfocusesontwomajoraspects. First,weproposetwopost-processingtechniquesforex- istingwordembeddingmodelsto1)handlesequentialinformationinthecontextbetter2)enhanced the geometry of word embedding space and leads to a more isotropic representation. Second, we studythepossibilitytoobtaindomain-specificwordembeddingfrompre-trainedlanguagemodels andobtainacompetitiveyetrobustrepresentationtorespectiveapplicationdomains. • Embedded words usually share a large mean and several principal components. As a con- sequence, the distribution of embedded words are not isotropic. Word vectors that are isotropically distributed (or uniformly distributed in spatial angles) are more differentiable from each other. To make the embedding scheme more robust and alleviate the hubness problem [127], previous work proposed to remove dominant principal components of em- bedded words. On the other hand, some linguistic properties are still captured by these 4 dominant principal components. Instead of removing dominant principal components com- pletely,weproposeanewpost-processingtechniquebyimposingregularizationonprincipal components. WecallthisPost-processingviaVarianceNormalization(PVN)method. • Ordered information in language models also plays an important role in context-dependent representation. It is not used effectively and should be taken into consideration of context- independent word embedding methods. Therefore, we propose post-processing method via Dynamic Embedding (PDE) to inject contextual information into existing word embedding models. • Evaluation of word embedding models involves intrinsic evaluator and extrinsic evaluator. However,therelationshipandcorrelationbetweenintrinsicevaluatorandextrinsicevaluator arenotwellstudiedyet. Westudiedthelinguisticpropertiesofevaluatorsandgiveguidance tofuturewordembeddingevaluationanddesignofwordembeddingmodels. • Most existing works focus on generic word embedding. However, tasks from different domains favor different types of word embedding. With the current development of pre- trainedlanguagemodelstodomain-specifictasks,wetrytoobtainamorerobustandefficient solution for domain-specific word embedding by transferring knowledge from pre-trained languagemodels. 1.3.2 SentenceEmbedding Two methods are proposed for the universal sentence embedding problem to model the semantics conveyedinthesentence. • Anovelsentenceembeddingmethodbuiltuponsemanticsubspaceanalysis,calledsemantic subspace sentence embedding (S3E) is proposed in this work. Given the fact that word embeddingscancapturesemanticrelationshipswhilesemanticallysimilarwordstendtoform semanticgroupsinhigh-dimensionalembeddingspace,wedevelopasentencerepresentation 5 scheme by analyzing the semantic subspaces of its constituent words. Specifically, we construct a sentence model from two aspects. First, we represent words that lie in the same semantic group using the intra-group descriptor. Second, we characterize the interaction between multiple semantic groups with the inter-group descriptor. S3E method is evaluated on both textual similarity tasks and supervised tasks. Experimental results show that it offerscomparableorbetterperformancethanthestate-of-the-art. ThecomplexityofourS3E methodisalsomuchlowerthanotherparameterizedmodels. • A contextualized word representation, called BERT, achieves state-of-the-art performance in quite a few NLP tasks. Yet, it is an open problem to generate a high-quality sentence representation from BERT-based word models. It was shown in a previous study that different layers of BERT capture different linguistic properties. This allows us to fusion information across layers to find better sentence representation. We propose SBERT-WK to obtainsentencerepresentationbydissectingBERT-basedwordmodelsandachievedsupreme performance. • Even though a lot of work has been done to explain the success of deep contextualized models,howtheisolatedtokenrepresentationevolvesacrosslayersisnotstudiedbefore. We study the layer-wise pattern of the word representation of deep contextualized models. And provideinsightsonhowtheevolvingpatternreflectsonlanguageproperties. Thesefindings shed light on future research directions about the understanding of BERT-based models and inspirefuturesentenceembeddingwork. 1.3.3 KnowledgeGraphEmbedding Commonsense knowledge is hard to capture and the current commonsense knowledge graph is far from complete. We use textual-enriched entity embedding to the knowledge graph field to enable its inductive learning ability for commonsense knowledge graph completion. Inspiring result for inductivelearningabilityisobtainedwhenconsideringsemanticinformationofentities. 6 Commonsense knowledge graph (CKG) is a special type of knowledge graph (KG), where entities are composed of free-form text. Existing CKG completion methods focus on transductive learning setting, where all the entities are present during training. Here, we propose the first inductive learning setting for CKG completion, where unseen entities may appear at test time. We emphasize that the inductive learning setting is crucial for CKGs, because unseen entities are frequently introduced due to the fact that CKGs are dynamic and highly sparse. We propose InductivE as the first framework targeted at the inductive CKG completion task. InductivE first ensurestheinductivelearningcapabilitybydirectlycomputingentityembeddingsfromrawentity attributes. Second, a graph neural network with novel densification process is proposed to further enhanceunseenentityrepresentationwithneighboringstructuralinformation. Experimentalresults show that InductivE performs especially well on inductive scenarios where it achieves above 48% improvement over previous methods while also outperforms state-of-the-art baselines in transductivesettings. 1.4 OrganizationoftheDissertation The rest of the thesis is organized as follows. In Chapter 2, we review the background including word embedding, evaluation methods for word embedding, and sentence embedding models. In Chapter 3, we propose two off-the-shelf post-processing methods based on the variance of word vectorspaceandwords’sequentialinformation. Also,thecorrelationbetweenintrinsicandextrinsic evaluatorsisbrieflystudied. InChapter4,twomodelsareproposedtoobtainhighqualityuniversal sentence embedding. One is based on static word embedding and another one is through the analysis of contextualized word embedding models. In Chapter 5, we investigate the possibility to enhance commonsense knowledge graph with semantic sentence embeddings. The inductive learning capability for commonsense knowledge graph is discussed and further enabled with our proposed InductivE model. In Chapter 6, we obtain domain-specific word embedding from pre- trainedlanguagemodels,whichprovidesarobustandefficientsolutionforrepresentationlearning 7 on domain-specific data. Finally, conclusion remarks and future research directions are given in Chapter7. 8 Chapter2 ResearchBackground First,weintroducepopularwordembeddingmodelsanddiscussdesiredpropertiesofwordmodels andevaluationmethods(orevaluators). Then,wecategorizeevaluatorsintointrinsicandextrinsic two types. Intrinsic evaluators test the quality of a representation independent of specific natural language processing tasks while extrinsic evaluators use word embeddings as input features to a downstreamtaskandmeasurechangesinperformancemetricsspecifictothattask. Static word embedding is a popular learning technique that transfers prior knowledge from a largeunlabeledcorpus[30,108,119]. Mostrecentsentenceembeddingmethodsarerootedinthat static word representations can be embedded with rich syntactic and semantic information. It is desiredtoextendtheword-levelembeddingtothesentence-level,whichcontainsalongerpieceof text. Inthefollowing,wewilldiscusstheresearchbackgroundonwordembedding,wordembedding evaluation,andsentenceembeddingmethods. 2.1 WordEmbeddingModels Word embedding is a real-valued vector representation of words by embedding both semantic and syntactic meanings obtained from an unlabeled large corpus. It is a powerful tool widely used in modern natural language processing (NLP) tasks, including semantic analysis [181], information retrieval [140], dependency parsing [41, 114, 145], question answering [71, 184] and machine 9 translation [40, 182, 184]. Learning a high-quality representation is extremely important for these tasks,yetthequestion“whatisagoodwordembeddingmodel”remainsanopenproblem. Severalwordembeddingmodelsaresummarizedbelow: NeuralNetworkLanguageModel(NNLM) TheNeuralNetworkLanguageModel(NNLM)[28]jointlylearnsawordvectorrepresentation andastatisticallanguagemodelwithafeed-forwardneuralnetworkthatcontainsalinearprojection layer and a non-linear hidden layer. An #-dimensional one-hot vector that represents the word is used as the input, where # is the size of the vocabulary. The input is first projected onto the projection layer. Afterward, a softmax operation is used to compute the probability distribution over all words in the vocabulary. As a result of its non-linear hidden layers, the NNLM model is very computationally complex. To lower the complexity, NNLM is first trained using continuous word vectors learned from simple models. Then, another N-gram NNLM has trained from the wordvectors. Continuous-Bag-of-Words(CBOW)andSkip-Gram Two iteration-based methods were proposed in the word2vec paper [108]. The first one is the Continuous-Bag-of-Words (CBOW) model, which predicts the center word from its surrounding context. Thismodelmaximizestheprobabilityofawordbeinginaspecificcontextintheformof %¹F 8 jF 82 F 82¸1 F 81 F 8¸1 F 8¸21 F 8¸2 º (2.1) whereF 8 isawordatposition8 and2 isthewindowsize. Thus,ityieldsamodelthatiscontingent onthedistributionalsimilarityofwords. We focus on the first iteration in the discussion below. Let, be the vocabulary set containing allwords. TheCBOWmodeltrainstwomatrices: 1)aninputwordmatrixdenotedby+2R #j,j , wherethe8 C columnof+ isthe#-dimensionalembeddedvectorforinputwordE 8 ,and2)anoutput wordmatrixdenotedby*2R j,j# ,wherethe9 C rowof* isthe#-dimensionalembeddedvector foroutputwordD 9 . Toembedinputcontextwords,weusetheone-hotrepresentationforeachword initially,andapply+ ) togetthecorrespondingwordvectorembeddingsofdimension#. Weapply 10 * ) to an input word vector to generate a score vector and use the softmax operation to convert a score vector into a probability vector of size,. This process is to yield a probability vector that matchesthevectorrepresentationoftheoutputword. TheCBOWmodelisobtainedbyminimizing thecross-entropylossbetweentheprobabilityvectorandtheembeddedvectoroftheoutputword. Thisisachievedbyminimizingthefollowingobjectivefunction: ¹D 8 º=D ) 8 ˆ E¸log j,j Õ 9=1 exp¹D ) 9 ˆ Eº (2.2) whereD 8 isthe8 C rowofmatrix* and ˆ E istheaverageofembeddedinputwords. Initialvaluesformatrices+ and*arerandomlyassigned. Thedimension# ofwordembedding can vary based on different application scenarios. Usually, it ranges from 50 to 300 dimensions. Afterobtainingbothmatrices+ or*,theycaneitherbeusedsolelyoraveragedtoobtainthefinal wordembeddingmatrix. The skip-gram model [108] predicts the surrounding context words given a center word. It focuses on maximizing probabilities of context words given a specific center word, which can be writtenas %¹F 82 F 82¸1 F 81 F 8¸1 F 8¸21 F 8¸2 jF 8 º (2.3) The optimization procedure is similar to that for the CBOW model but with a reversed order for contextandcenterwords. The softmax function mentioned above is a method to generate probability distributions from wordvectors. Itcanbewrittenas %¹F 2 jF 8 º= exp¹E ) F 2 E F 8 º Í j,j F=1 exp¹E ) F E F 8 º (2.4) This softmax function is not the most efficient one since we must take a sum over all, words to normalize this function. Other functions that are more efficient include negative sampling and hierarchical softmax [109]. Negative sampling is a method that maximizes the log probability of 11 the softmax model by only summing over a smaller subset of, words. Hierarchical softmax also approximates the full softmax function by evaluating only log 2 , words. Hierarchical softmax uses a binary tree representation of the output layer where the words are leaves and every node represents the relative probabilities of its child nodes. These two approaches do well in making predictions for local context windows and capturing complex linguistic patterns. Yet, it could be furtherimprovedifglobalco-occurrencestatisticsareleveraged. Co-occurrenceMatrix In our current context, the co-occurrence matrix is a word-document matrix. The¹89º entry, - 89 , of co-occurrence matrix X is the number of times for word8 in document 9. This definition can be generalized to a window-based co-occurrence matrix where the number of times that a certain word appearing in a specific sized window around a center word is recorded. In contrast withthewindow-basedlog-linearmodelrepresentations(e.g. CBOWorSkip-gram)thatuselocal informationonly,theglobalstatisticalinformationisexploitedbythisapproach. Onemethodtoprocessco-occurrencematricesisthesingularvaluedecomposition(SVD).The co-occurrence matrix is expressed in form of*(+ ) matrices product, where the first : columns of both* and+ are word embedding matrices that transform vectors into a:-dimensional space with an objective that is sufficient to capture the semantics of words. Although embedded vectors derived by this procedure are good at capturing semantic and syntactic information, they still face problems such as an imbalance in word frequency, sparsity and high dimensionality of embedded vectors,andcomputationalcomplexity. TocombinebenefitsfromtheSVD-basedmodelandthelog-linearmodels,theGlobalVectors (GloVe)method[119]adoptsaweightedleast-squaredmodel. Ithasaframeworksimilartothatof the skip-gram model, yet it has a different objective function that contains co-occurrence counts. We first define a word-word co-occurrence matrix that records the number of times word 9 occurs 12 inthecontextofword8. Bymodifyingtheobjectivefunctionadoptedbytheskip-grammodel,we deriveanewobjectivefunctionintheformof ˆ = , Õ 8=1 , Õ 9=1 5¹- 89 º¹D ) 9 E 8 log- 89 º 2 (2.5) where 5¹- 89 º isthenumberoftimesword 9 occursinthecontextofword8. The GloVe model is more efficient as its objective function contains nonzero elements of the word-word co-occurrence matrix only. Besides, it produces a more accurate result as it takes co-occurrencecountsintoaccount. FastText The methods described above do not use the subword information explicitly. Embeddings of rarely used words can sometimes be poorly estimated. Several methods have been proposed to remedy this issue, including the FastText method. It is still based on the skip-gram model, where each word is represented as a bag of character =-grams or subword units [30]. A vector representation is associated with each character’s=-grams, and the average of these vectors gives the final representation of the word. This model improves the performance on syntactic tasks significantlybutnotmuchinsemanticquestions. N-gramModel The N-gram model is an important concept in language models. It has been used in many NLP tasks. The ngram2vec method [183] incorporates the n-gram model in various baseline embeddings models such as word2vec, GloVe, PPMI, and SVD. Furthermore, instead of using traditionaltrainingsamplepairsorthesub-wordlevelinformationsuchasFastText,thengram2vec methodconsidersword-wordlevelco-occurrenceandenlargesthereceptionwindowbyaddingthe word-ngram and the ngram-ngram co-occurrence information. Its performance on word analogy and word similarity tasks has significantly improved. It is also able to learn negation word pairs/phraseslike’notinteresting’,whichisadifficultcaseforothermodels. DictionaryModel 13 Even with larger text data available, extracting and embedding all linguistic properties into a word representation directly is a challenging task. Lexical databases such as the WordNet are helpful to the process of learning word embeddings, yet labeling large lexical databases is a time- consuming and error-prone task. In contrast, a dictionary is a large and refined data source for describingwords. Thedict2vecmethodlearnswordrepresentationfromdictionaryentriesaswell as large unlabeled corpus [156]. Using the semantic information from a dictionary, semantically- related words tend to be closer in high-dimensional vector space. Also, negative sampling is used tofilteroutpairsthatarenotcorrelatedinadictionary. DeepContextualizedModel Torepresentcomplexcharacteristicsofwordsandwordusageacrossdifferentlinguisticcontexts effectively,anewmodelfordeepcontextualizedwordrepresentationwasintroducedin[121]. First, an Embeddings from Language Models (ELMo) representation is generated with a function that takes an entire sentence as the input. The function is generated by a bidirectional LSTM network that is trained with a coupled language model. Existing embedding models can be improved by incorporatingtheELMorepresentationasitiseffectiveinincorporatingthesentenceinformation. ByfollowingELMo,aseriesofpre-trainedneuralnetworkmodelsforlanguagetasksareproposed suchasBERT[52]andOpenAIGPT[124]. Theireffectivenessisprovedinlotsoflanguagetasks. 2.2 DesiredPropertiesof Embedding Models Differentwordembeddingmodelsyielddifferentvectorrepresentations. Thereareafewproperties thatallgoodrepresentationsshouldaimfor. • Non-conflation[176] Different local contexts around a word should give rise to specific properties of the word, e.g.,thepluralorsingularform,thetenses,etc. Embeddingmodelsshouldbeabletodiscern differences in the contexts and encode these details into a meaningful representation in the wordsubspace. 14 • RobustnessAgainstLexicalAmbiguity[176] Allsenses(ormeanings)ofawordshouldberepresented. Modelsshouldbeabletodiscern the sense of a word from its context and find the appropriate embedding. This is needed to avoid meaningless representations from conflicting properties that may arise from the polysemy of words. For example, word models should be able to represent the difference betweenthefollowing: “thebow ofaship"and“bow andarrows". • DemonstrationofMultifacetedness[176] The facet, phonetic, morphological, syntactic, and other properties, of a word should con- tributetoitsfinalrepresentation. Thisisimportantaswordmodelsshouldyieldmeaningful word representations and perhaps find relationships between different words. For example, therepresentationofawordshouldchangewhenthetenseischangedoraprefixisadded. • Reliability [73] The results of a word embedding model should be reliable. This is important as word vectors are randomly initialized when being trained. Even if a model creates different representations from the same dataset because of random initialization, the performance of variousrepresentationsshouldscoreconsistently. • GoodGeometry [66] The geometry of an embedding space should have a good spread. Generally speaking, a smaller set of more frequent, unrelated words should be evenly distributed throughout the space while a larger set of rare words should cluster around frequent words. Word models should overcome the difficulty arising from the inconsistent frequency of word usage and derivesomemeaningfromwordfrequency. 15 2.3 DesiredPropertiesof Evaluators The goal of an evaluator is to compare characteristics of different word embedding models with a quantitative and representative metric. However, it is not easy to find a concrete and uniform way in evaluating these abstract characteristics. Generally, a good word embedding evaluator should aimforthefollowingproperties. • GoodTestingData To ensure a reliable representative score, testing data should be varied with a good spread in the span of word space. Frequently and rarely occurring words should be included in the evaluation. Furthermore, data should be reliable in the sense that they are correct and objective. • Comprehensiveness Ideally, an evaluator should test for many properties of a word embedding model. This is notonlyanimportantproperty forgivingarepresentativescorebutalsofordeterminingthe effectivenessofanevaluator. • Highcorrelation The score of a word model in an intrinsic evaluation task should correlate well with the performanceofthemodelindownstreamnaturallanguageprocessingtasks. Thisisimportant fordeterminingtheeffectivenessofanevaluator. • Efficiency Evaluators should be computationally efficient. Most models are created to solve computa- tionally expensive downstream tasks. Model evaluators should be simple yet able to predict thedownstreamperformanceofamodel. • StatisticalSignificance 16 The performance of different word embedding models with respect to an evaluator should have enough statistical significance or enough variance between score distributions, to be differentiated [142]. This is needed in judging whether a model is better than another and helpfulindeterminingperformancerankingsbetweenmodels. 2.4 EvaluationMethods for Word Embedding Various evaluation methods (or evaluators) have been proposed to test the qualities of word em- bedding models. As introduced in [18], there are two main categories for evaluation methods – intrinsic and extrinsic evaluators. Extrinsic evaluators use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task. Examples include part-of-speech tagging [97], named-entity recognition [175], sentiment analysis [129] and machine translation [17]. Extrinsic evaluators are more computationally expensive, and they may not be directly applicable. Intrinsic evaluators test the quality of a representation independent of specific natural language processing tasks. They measure syntactic or semantic relationships amongwordsdirectly. Aggregatescoresaregivenfromtestingthevectorsinselectedsetsofquery terms and semantically related target words. One can further classify intrinsic evaluators into two types: 1) absolute evaluation, where embeddings are evaluated individually and only their final scoresarecompared,and2)comparativeevaluation,wherepeopleareaskedabouttheirpreferences amongdifferentwordembeddings[139]. Sincecomparativeintrinsicevaluatorsdemandadditional resourcesforsubjectivetests,theyare notaspopularastheabsoluteones. 2.4.1 IntrinsicEvaluators Asetofabsoluteintrinsicevaluatorswillbediscussedinthissection. 17 2.4.1.1 WordSimilarity The word similarity evaluator correlates the distance between word vectors and human perceived semantic similarity. The goal is to measure how well the notion of human perceived similarity is captured by the word vector representations and validate the distributional hypothesis where the meaning of words is related to the context they occur in. For the latter, the way distributional semanticmodelssimulatesimilarityisstillambiguous[61]. Onecommonlyusedevaluatoristhecosinesimilaritydefinedby cos¹F G F H º= F G F H jjF G jjjjF H jj (2.6) whereF G andF H are two word vectors andjjF G jj andjjF H jj are the 2 norm. This test computes thecorrelationbetweenallvectordimensions,independentoftheirrelevanceforagivenwordpair orforasemanticcluster. Becauseitsscoresarenormalizedbythevectorlength,itisrobusttoscaling. Itiscomputation- allyinexpensive. Thus,itiseasytocomparemultiplescoresfromamodelandcanbeusedinword model’s prototyping and development. Furthermore, word similarity can be used to test model’s robustnessagainstlexicalambiguity,asadatasetaimedattestingmultiplesensesofawordcanbe created. On the other hand, it has several problems as discussed in [61]. This test is aimed at finding the distributional similarity among pairs of words, but this is often conflated with morphological relations and simple collocations. The similarity may be confused with relatedness. For example, 20A andCA08= are two similar words while20A andA>03 are two related words. The correlation between the score from the intrinsic test and other extrinsic downstream tasks could be low in some cases. There is doubt about the effectiveness of this evaluator because it might not be comprehensive. 18 2.4.1.2 WordAnalogy When given a pair of words0 and0 and a third word1, the analogy relationship between0 and 0 canbeusedtofindthecorrespondingword1 to1. Mathematically,itisexpressedas 0 :0 ::1 : __ (2.7) wheretheblankis1 . Oneexamplecouldbe write : writing :: read : reading (2.8) The3CosAddmethod[110]solvesfor1 usingthefollowingequation: 1 = argmax 1 0 ¹cos¹1 0 0 0¸1ºº (2.9) Thus, high cosine similarity means that vectors share a similar direction. However, it is important to note that the 3CosAdd method normalizes vector lengths using the cosine similarity [110]. Alternatively,thereisthe3CosMul[92]method,whichisdefinedas 1 = argmax 1 0 cos¹1 0 1ºcos¹1 0 0 º cos¹1 0 0º¸Y (2.10) whereY = 0001 is used to prevent division by zero. The 3CosMul method has the same effect with taking the logarithm of each term before summation. That is, small differences are enlarged while large ones are suppressed. Therefore, it is observed that the 3CosMul method offers better balanceindifferentaspects. It was stated in [132] that many models score under 30% on analogy tests, suggesting that not all relations can be identified in this way. In particular, lexical-semantic relations like synonymy andantonymaremostdifficult. Theyalsoconcludedthattheanalogytestismostsuccessfulwhen all three source vectors are relatively close to the target vector. Accuracy of this test decreases 19 as their distance increases. Another seemingly counter-intuitive finding is that words with denser neighborhoods yield higher accuracy. This is perhaps because of its correlation with distance. Another problem with this test is subjectivity. Analogies are fundamental to human reasoning and logic. The dataset on which current word models are trained does not encode our sense of reasoning. It is rather different from the way how humans learn natural languages. Thus, given a wordpair,thevectorspacemodelmayfindadifferentrelationshipfromwhathumansmayfind. Generally speaking, this evaluator serves as a good benchmark in testing multifacetedness. A pair of words0 and0 can be chosen based on the facet or the property of interest with the hope thattherelationshipbetweenthemispreservedinthevectorspace. Thiswillcontributetoabetter vectorrepresentationofwords. 2.4.1.3 ConceptCategorization An evaluator that is somewhat different from both word similarity and word analogy is concept categorization. Here, the goal is to split a given set of words into different categorical subsets of words. For example, given the task of separating words into two categories, the model should be abletocategorizewordsB0=3F82,C40,?0BC0,F0C4A intotwogroups. Ingeneral,thetestcanbeconductedasfollows. First,thecorrespondingvectortoeachwordis calculated. Then,aclusteringalgorithm(e.g.,the: meansalgorithm)isusedtoseparatethesetof word vectors into= different categories. A performance metric is then defined based on cluster’s purity, where purity refers to whether each cluster contains concepts from the same or different categories[23]. Bylookingatthedatasetsprovidedforthisevaluator,wewouldliketopointoutsomechallenges. First, the datasets do not have standardized splits. Second, no specific clustering methods are defined for this evaluator. It is important to note that clustering can be computationally expensive, especially when there are a large number of words and categories. Third, the clustering methods may be unreliable if there are either uneven distributions of word vectors or no clearly defined clusters. 20 Subjectivityisanothermainissue. AsstatedbySeneletal. [141],humanscangroupwordsby inferenceusingconceptsthatwordembeddingscanglossover. Givenwords;4<>=BD=10=0=0 1;D414AAH>240=8A8B. One could group them into yellow objects (;4<>=BD=10=0=0) and red objects (1;D414AAH>240=8A8B). Since words can belong to multiple categories, we may argue that;4<>=, 10=0=0, 1;D414AAH, and8A8B are in the ?;0=C category whileBD= and>240= are in the=0CDA4 category. However,duetotheuncompromisingnatureoftheperformancemetric,there isnoadequatemethodinevaluatingeachcluster’squality. Thepropertythatthesetsofwordsandcategoriesseemtotestforissemanticrelation,aswords aregroupedintoconceptcategories. Onegoodpropertyofthisevaluatorisitsabilitytotestforthe frequencyeffectandthehub-nessproblemsinceitisgoodatrevealingwhetherfrequentwordsare clusteredtogether. 2.4.1.4 OutlierDetection A relatively new method that evaluates word clustering in vector space models is outlier detection [36]. The goal is to find words that do not belong to a given group of words. This evaluator tests the semantic coherence of vector space models, where semantic clusters can be first identified. Thereisacleargoldstandardforthisevaluatorsincehumanperformanceonthistaskisextremely high as compared to word similarity tasks. It is also less subjective. To formalize this evaluator mathematically,wecantakeasetofwords , =F 1 F 2 F =¸1 (2.11) wherethereisoneoutlier. Next,wetakeacompactnessscoreofwordF as 2¹Fº= 1 =¹=1º Õ F 8 2,nF Õ F 9 2,nFF 9 <F 8 B8<¹F 8 F 9 º (2.12) Intuitively, the compactness score of a word is the average of all pairwise semantic similarities of the words in cluster,. The outlier is the word with the lowest compactness score. There is less 21 amount of research on this evaluator as compared with that of word similarity and word analogy. Yet, it provides a good metric to check whether the geometry of an embedding space is good. If frequent words are clustered to form hubs while rarer words are not clustered around the more frequentwordstheyrelateto,theevaluatorwillnotperformwellinthismetric. There is subjectivity involved in this evaluator as the relationship of different word groups can be interpreted in different ways. However, since human perception is often correlated, it may be safetoassumethatthisevaluatorisobjectiveenough[36]. Also,beingsimilartothewordanalogy evaluator, this evaluator relies heavily on human reasoning and logic. The outliers identified by humans are strongly influenced by the characteristics of words perceived to be important. Yet, the recognizedpatternsmightnotbeimmediatelycleartowordembeddingmodels. 2.4.1.5 QVEC QVEC[161]isanintrinsicevaluatorthatmeasuresthecomponent-wisecorrelationbetweenword vectors from a word embedding model and manually constructed linguistic word vectors in the SemCor dataset. These linguistic word vectors are constructed in an attempt to give well-defined linguistic properties. QVEC is grounded in the hypothesis that dimensions in the distributional vectorscorrespondtolinguisticpropertiesofwords. Thus,linearcombinationsofvectordimensions produce relevant content. Furthermore, QVEC is a recall-oriented measure, and highly correlated alignmentsprovideevaluationandannotationsofvectordimensions. Missinginformationornoisy dimensionsdonotsignificantlyaffect thescore. The most prevalent problem with this evaluator is the subjectivity of man-made linguistic vectors. Current word embedding techniques perform much better than man-made models as they are based on statistical relations from data. Having a score based on the correlation between the word embeddings and the linguistic word vectors may seem to be counter-intuitive. Thus, the QVEC scores are not very representative of the performance in downstream tasks. On the other hand, because linguistic vectors are manually generated, we know exactly which properties the methodistestingfor. 22 2.4.2 ExtrinsicEvaluators Based on the definition of extrinsic evaluators, any NLP downstream task can be chosen as an evaluation method. Here, we present five extrinsic evaluators: 1) part-of-speech tagging, 2) chunking,3)named-entityrecognition,4)sentimentanalysisand5)neuralmachinetranslation. 2.4.2.1 Part-of-speech(POS)Tagging Part-of-speech(POS)tagging,alsocalledgrammartagging,aimstoassigntagstoeachinputtoken with its part-of-speech like noun, verb, adverb, conjunction. Due to the availability of labeled corpora, many methods can complete this task by either learning probability distribution through linguisticpropertiesorstatisticalmachinelearning. Aslow-levellinguisticresources,POStagging canbeusedforseveralpurposessuch astextindexingandretrieval. 2.4.2.2 Chunking Thegoalofchunking,alsocalledshallowparsing,istolabelsegmentsofasentencewithsyntactic constitutes. Each word is first assigned with one tag indicating its properties such as noun or verb phrases. Itisthenusedtosyntacticallygroupingwordsintocorrelatedphrases. Ascomparedwith POS,chunkingprovidesmorecluesaboutthestructureofthesentenceorphrasesinthesentence. 2.4.2.3 Named-entityRecognition Thenamed-entityrecognition(NER)taskiswidelyusedinnaturallanguageprocessing. Itfocuses onrecognizinginformationunitssuchasnames(includingperson,location,andorganization)and numericexpressions(e.g.,timeandpercentage). LikethePOStaggingtask,NERsystemsuseboth linguistic grammar-based techniques and statistical models. A grammar-based system demands lots of effort from experienced linguists. In contrast, a statistical-based NER system requires a large amount of human-labeled data for training, and it can achieve higher precision. Moreover, thecurrentNERsystemsbasedonmachinelearningareheavilydependentontrainingdata. Itmay notberobustandcannotgeneralizewelltodifferentlinguisticdomains. 23 2.4.2.4 SentimentAnalysis Sentiment analysis is a particular text classification problem. Usually, a text fragment is marked withabinary/multi-levellabelrepresentingthepositivenessornegativenessofthetext’ssentiment. AnexampleofthiscouldbetheIMDbdatasetby[102]onwhetheragivenmoviereviewispositive or negative. Word phrases are an important factor for final decisions. Negative words such as ’no’ or ’not’ will totally reverse the meaning of the whole sentence. Because we are working on sentence-level or paragraph-level data extraction, word sequence and parsing plays important role inanalyzingsentiment. Traditionmethodsfocusmoreonhuman-labeledsentencestructures. With the development of machine learning, more statistical and data-driven approaches are proposed to deal with the sentiment analysis task [129]. As compared to unlabeled monolingual data, labeled sentiment analysis data are limited. Word embedding is commonly used in sentiment analysis tasks, serving as transferred knowledge extracted from a generic large corpus. Furthermore, the inference tool is also an important factor, and it might play a significant role in the final result. Forexample,whenconductingsentimentalanalysistasks,wemayuseBag-of-words,SVM,LSTM or CNN based on a certain word model. The performance boosts could be totally different when choosingdifferentinferencetools. 2.4.2.5 NeuralMachineTranslation(NMT) Neural machine translation (NMT) [17] refers to a category of deep-learning-based methods for machinetranslation. Withlarge-scaleparallelcorpusdataavailable,NMTcanprovidestate-of-the- artresultsformachinetranslationandhasalargegainovertraditionalmachinetranslationmethods. Evenwithlarge-scaleparalleldataavailable,domainadaptationisstillimportanttofurtherimprove the performance. Domain adaption methods are able to leverage monolingual corpus for existing machine translation tasks. As compared to the parallel corpus, monolingual corpus is much larger andtheycanprovideamodelwithricherlinguisticproperties. Onerepresentativedomainadaption method is word embedding. This is the reason why NMT can be used as an extrinsic evaluation task. 24 2.5 SentenceEmbedding The word embedding technique is widely used in natural language processing (NLP) tasks. For example,itimprovesdownstreamtaskssuchasmachinetranslation[88],syntacticparsing[56],and textclassification[144]. Yet,manyNLPapplicationsoperateatthesentenceleveloralongerpiece oftext. Althoughsentenceembeddinghasreceivedalotofattentionrecently,encodingasentence intoafixed-lengthvectortocapturesdifferentlinguisticpropertiesremainstobeachallenge. Universal sentence embedding methods can be categorized into two types: i) parameterized models and ii) non-parameterized models. Parameterized models demand training in their hy- perparameters update. The skip-thought model [81] adopts an encoder-decoder model to predict context sentences in an unsupervised manner. The InferSent model [47] is trained by high-quality supervised data; namely, the Natural Language Inference data [34]. The STN model [151] lever- ages a multi-tasking framework for sentence embedding. Different parameterized models attempt to capture semantic and syntactic meanings from different aspects. Yet, their performance gain is marginal as compared with simple averaging [120], which is not satisfying given their high computationalcostintraining. Non-parameterized sentence embedding methods rely on high-quality word embeddings. The simplest idea is to average individual word embeddings, which already offers a tough-to-beat baseline. By following along this line, several weighted averaging methods have been proposed, including tf-idf, SIF [11], uSIF [59] and GEM [179]. Concatenating vector representations of different resources yields another family of methods. Examples include SCDV [106] and ?-mean [135]. To better capture the sequential information, DCT [9] and EigenSent [79] were proposed fromasignalprocessingperspective. 2.6 KnowledgeGraphEmbedding In the past several, lots of work are proposed and targeting on KG completion. Knowledge graph completion by predicting missing links has been intensively investigated in recent years. Most 25 methods are embedding-based. However, relatively few work are targeting on CKG completion. Given that CKGs are quite important in many NLP applications including question answering and naturallanguageunderstanding,itisdesiredtohavemoreexplorationwithCKGdatasets. Most KG embedding models are focusing on non-attributed graphs. Even though some work are trying to incorporate entity descriptions, it is not the mainstream given that descriptions are sometimes incomplete or noisy [143]. Different from KG datasets, all entities in CKG datasets haveitsownshorttextualdescriptions. Itprovidesextrainformationtoleveragebutalsoleadstoa largerandsparergraph. InductivelearningproblembecomesevenmoreessentialinCKGdatasets. OnepreviousworkonCKGcompletionalsoleveragetextualdescription[103]. However,they still focus on transductive setting and can not be directly applied to inductive setting because of two reasons. First, the text embeddings for entities are only used for initialization and are jointly optimized during training. Therefore, the final entity embeddings are obtained during training and require all entities are present in order to make predictions. Second, their input embeddings to graph encoder are randomly initialized and optimized during training, which also requires all entitiestopresentinthetrainingstage. Therefore,theycannotbeusedforinductivelinkprediction taskandourworkprovidethefirstbenchmarkoninductiveCKGcompletion. 26 Chapter3 WordRepresentationLearning: Enhancement and Evaluation 3.1 Introduction In this section, we will introduced our work on enhancement and evaluation of word embedding models. 3.1.1 EnhancementofWord Embedding Language processing becomes more and more important in multimedia processing. Although embeddedvectorrepresentationsofwordsofferimpressiveperformanceonmanynaturallanguage processing (NLP) applications, the information of ordered input sequences is lost to some extent if only context-based samples are used in the training. For further performance improvement, two newpost-processingtechniques,calledpost-processingviavariancenormalization(PVN)andpost- processingviadynamicembedding(PDE),areproposedinthiswork. ThePVNmethodnormalizes the variance of principal components of word vectors, while the PDE method learns orthogonal latent variables from ordered input sequences. The PVN and the PDE methods can be integrated to achieve better performance. We apply these post-processing techniques to several popular wordembeddingmethodstoyieldtheirpost-processedrepresentations. Extensiveexperimentsare conductedtodemonstratetheeffectivenessoftheproposedpost-processingtechniques. 27 By transferring prior knowledge from large unlabeled corpus, one can embed words into high- dimensional vectors with both semantic and syntactic meanings in their distributional representa- tions. The design of effective word embedding methods has attracted the attention of researchers in recent years because of their superior performance in many downstream natural language pro- cessing(NLP)tasks,includingsentimentalanalysis[146],informationretrieval[140]andmachine translation[53]. PCA-based post-processing methods have been examined in various research fields. In the word embedding field, it is observed that learned word vectors usually share a large mean and several dominant principal components, which prevents word embedding from being isotropic. Word vectors that are isotropically distributed (or uniformly distributed in spatial angles) can be differentiated from each other more easily. A post-processing algorithm (PPA) was recently proposed in [112] to exploit this property. That is, the mean and several dominant principal components are removed by the PPA method. On the other hand, their complete removal may not be the best choice since they may still provide some useful information. Instead of removing them,weproposeanewpost-processingtechniquebynormalizingthevarianceofembeddedwords and call it the PVN method here. The PVN method imposes constraints on dominant principal componentsinsteadoferasingtheircontributionscompletely. Existing word embedding methods are primarily built upon the concept that “You shall know a word by the company it keeps." [64]. As a result, most of current word embedding methods are based ontraining samples of “(word, context)". Most context-based wordembedding methods do not differentiate the word order in sentences, meaning that, they ignore the relative distance between the target and the context words in a chosen context window. Intuitively, words that are closer in a sentence should have stronger correlation. This has been verified in [80]. Thus, it is promisingtodesignanewwordembeddingmethodthatnotonlycapturesthecontextinformation butalsomodelsthedynamicsinawordsequence. 28 Toachievefurtherperformanceimprovement,weproposethesecondpost-processingtechnique, which is called the PDE method. Inspired by dynamic principal component analysis (Dynamic- PCA)[55],thePDEmethodprojectsexistingwordvectorsintoanorthogonalsubspacethatcaptures the sequential information optimally under a pre-defined language model. The PVN method and the PDE method can work together to boost the overall performance. Extensive experiments are conducted to demonstrate the effectiveness of PVN/PDE post-processed representations over their originalones. Post-processing and dimensionality reduction techniques in word embeddings have primarily beenbasedontheprincipalcomponentanalysis(PCA).Thereisalonghistoryinhighdimensional correlateddataanalysisusinglatentvariableextraction,includingPCA,singularspectrumanalysis (SSA) and canonical correlation analysis (CCA). They are shown to be effective in various appli- cations. Among them, PCA is a widely used data-driven dimensionality reduction technique as it maximizes the variance of extracted latent variables. However, the conventional PCA focuses on static variance while ignoring time dependence between data distributions. It demands additional workinapplyingthePCAtodynamicdata. Itispointedoutin[112]thatembeddedwordsusuallysharealargemeanandseveraldominant principal components. As a consequence, the distribution of embedded words are not isotropic. Wordvectorsthatareisotropicallydistributed(oruniformlydistributedinspatialangles)aremore differentiable from each other. To make the embedding scheme more robust and alleviate the hubness problem [127], they proposed to remove dominant principal components of embedded words. Ontheotherhand,somelinguisticpropertiesarestillcapturedbythesedominantprincipal components. Instead of removing dominant principal components completely, we propose a new post-processingtechniquebyimposingregularizationonprincipalcomponentsinthiswork. Recently,contextualizedwordembeddinghasgainedattentionsinceittacklesthewordmeaning problem using the information from the whole sentence [121]. It contains a bi-directional long short-termmemory(bi-LSTM)moduletolearnalanguagemodelwhoseinputsaresequences. The performance of this model indicates that the ordered information plays an important role in the 29 context-dependent representation, and it should also be taken into consideration in the design of context-independentwordembeddingmethods. There are three main contributions in the enhancement part. We propose two new post- processing techniques in Sec. 3.2.1 and Sec. 3.2.2, respectively. Then, we apply the developed techniquestoseveralpopularwordembeddingmethodsandgeneratetheirpost-processedrepresen- tations. ExtensiveexperimentsareconductedovervariousbaselinemodelsincludingSGNS[109], CBOW [109], GloVe [119] and Dict2vec [156] in Sec. 3.2.3 to demonstrate the effectiveness of post-processedrepresentationsovertheiroriginalones. 3.1.2 EvaluationofWordEmbedding Extensiveevaluationonalargenumberofwordembeddingmodelsforlanguageprocessingappli- cationsisconductedinthiswork. First,weintroducepopularwordembeddingmodelsanddiscuss desired properties of word models and evaluation methods (or evaluators). Then, we categorize evaluatorsintointrinsicandextrinsictwotypes. Intrinsicevaluatorstestthequalityofarepresenta- tion independent of specific natural language processing tasks while extrinsic evaluators use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task. We report experimental results of intrinsic and extrinsic evaluators on six word embedding models. It is shown that different evaluators focus on different aspects of word models, and some are more correlated with natural language processing tasks. Finally, we adopt correlationanalysistostudyperformanceconsistencyofextrinsicandintrinsicevalutors. A good word representation should have certain good properties. An ideal word evaluator should be able to analyze word embedding models from different perspectives. Yet, existing evaluators put emphasis on a certain aspect with or without consciousness. There is no unified evaluator that analyzes word embedding models comprehensively. Researchers have a hard time in selecting among word embedding models because models do not always perform at the same level on different intrinsic evaluators. As a result, the gold standard for a good word embedding modeldiffersfordifferentlanguagetasks. Inthiswork,wewillconductcorrelationstudybetween 30 intrinsic evaluators and language tasks so as to provide insights into various evaluators and help peopleselectwordembeddingmodelsforspecificlanguagetasks. Although correlation between intrinsic and extrinsic evaluators was studied before [42, 123], thistopicisneverthoroughlyandseriouslytreated. Forexample,producingmodelsbychangingthe windowsizeonlydoesnothappenofteninrealworldapplications,andtheconclusiondrawnin[42] might be biased. The work in [123] only focused on Chinese characters with limited experiments. We provide the most comprehensive study and try to avoid the bias as much as possible in this work. Representative performance metrics of intrinsic evaluation experimental results are offered in Sec. 3.3.1. Representative performance metrics of extrinsic evaluation experimental results are provided in Sec. 3.3.2. We conduct consistency study on intrinsic and extrinsic evaluators using correlationanalysisinSec. 3.3.3. 3.2 ProposedTwoEnhancementMethodsforWordRepresentation 3.2.1 Post-processingviaVarianceNormalization WemodifythePPAmethod[112]byregularizingthevariancesofleadingcomponentsatasimilar levelandcallitthepost-processingalgorithmwithvariancenormalization(PVN).ThePVNmethod isdescribedinAlgorithm1,where+ denotesthevocabularyset. InStep4,¹D ) 8 ˜ E¹Fºº istheprojectionof ˜ E¹Fº tothe8 C principalcomponent. Wemultiplyitby aratiofactor f 8 f 3¸1 f 8 toconstrainitsvariance. Then,weprojectitbacktotheoriginalbasesand subtractit fromthemean-removedwordvector. 31 Algorithm1Post-ProcessingviaVarianceNormalization(PVN) Input: GivenwordrepresentationsE¹Fº,F2+,andthresholdparameter3. 1. RemovethemeanoffE¹FºF2+g ` 1 j+j Í F2+ E¹Fº and ˜ E¹Fº E¹Fº`. 2. Computethefirst3¸1PCAcomponents D 1 D 3¸1 PCA¹˜ E¹FºF2+º 3. Computethestandarddeviation forthefirst3¸1 PCAcomponents f 1 f 3¸1 variancesofD 1 D 3¸1 , 4. Determinethenewrepresentation E 0 ¹Fº ˜ E¹Fº Í 3 8=1 f 8 f 3¸1 f 8 ¹D ) 8 ˜ E¹FººD 8 Output: ProcessedrepresentationsE 0 ¹Fº,F2+. To compute the standard deviation of the8 C principal component of processed representation E 0 ¹Fº,weprojectE 0 ¹Fº tobasesD 9 : D ) 9 E 0 ¹Fº=D ) 9 ˜ E¹Fº 3 Õ 8=1 f 8 f 3¸1 f 8 ¹D ) 8 ˜ E¹FººD ) 9 D 8 =D ) 9 ˜ E¹Fº f 9 f 3¸1 f 9 ¹D ) 9 ˜ E¹FººD ) 9 D 9 = f 3¸1 f 9 D ) 9 ˜ E¹Fº (3.1) Thus, the standard deviation of all post-processed 9 C principal component, 1 9 3, is equal to f 3¸1 . Thus, all variances of leading 3¸1 principal components will be normalized to the same levelbythePVN.Thismakesembeddingvectorsmoreevenlydistributedinalldimensions. Theonlyhyper-parametertotuneisthresholdparameter3. Theoptimalsettingmayvary with differentwordembeddingbaselines. Agoodruleofthumbistochoose350,where isthe dimensionofwordembeddings. Also,wecandeterminethedimensionthreshold,3,byexamining energyratiosofprincipalcomponents. 32 3.2.2 Post-processingviaDynamicEmbedding 3.2.2.1 LanguageModel Ourlanguagemodelisalineartransformationthatpredictsthecurrentwordgivenitsorderedcontext words. For a sequence of words: F 1 F 2 F 3 F = , the word embedding format is represented asE¹F 1 ºE¹F 2 ºE¹F 3 º E¹F = º. Two baseline models, SGNS and GloVe, are considered. In other words, E¹F 8 º is the embedded word ofF 8 using one of these methods. Our objective is to maximizetheconditionalprobability ?¹F 8 jF 82 F 81 F 8¸1 F 8¸2 º (3.2) where2 is the context window size. As compared to other language models that use tokens from thepast[121],weconsiderthetwo-sidedcontextasshowninEq. (3.2)sincetheyareaboutequally importanttothecenterworddistributioninlanguagemodeling. Thelinearlanguagemodelcanbewrittenas ˜ E¹F 8 º= Õ 8298¸29<8 1 9 ˜ E¹F 9 º¸ (3.3) where ˜ E¹Fºisthewordembeddingrepresentationafterthelatentvariabletransformtobediscussed inSec. 3.2.2.2. Theterm,,isusedtorepresenttheinformationlossthatcannotbewellmodeled in the linear model. We treat as a negligible term and Eq. (3.3) as a linear approximation of the originallanguagemodel. 3.2.2.2 DynamicLatentVariableExtraction Weapplythedynamiclatentvariabletechniquetothewordembeddingprobleminthissubsection. To begin with, we define an objective function to extract the dynamic latent variables. The word sequencedataisdenotedby W=»F 1 F 2 F = ¼ (3.4) 33 and the data matrix,W 8 , derived fromW is formed using the chosen context window size and its wordembeddingrepresentationV 8 fromdataW 8 : W 8 = »F 82 F 81 F 8¸1 F 8¸2 ¼2 ' 122 (3.5) V 8 = »E¹F 82 º E¹F 81 ºE¹F 8¸1 º (3.6) E¹F 8¸2 º¼2 ' 22 (3.7) where is the word embedding dimension. Then, the objective function used to extract the dynamiclatentvariablecanbewrittenas max Ab Õ 8 A ) V 8 b A ) E¹F 8 º ¡ (3.8) whereAisamatrixofdimension:,b2 ' 22 isaweightedsumofcontextwordrepresentations, andwhere: istheselecteddynamicdimension. We interpretA in Eq. (3.8) as a matrix that stores dynamic latent variables. IfA contains all learneddynamicprincipalcomponentsofdimension:,A ) V 8 istheprojectiontodynamicprincipal componentsfromallcontextwordrepresentationandA ) E¹F 8 º istheprojectionofthecenterword representationE¹F 8 º. Vector b is a weighted sum of context representations used for prediction. We seek optimal A and b to maximize the sum of all inner products of predicted center word representation A ) V 8 b and the8th center word representation A ) E¹F 8 º. The choice of the inner product rather than other distance measures is to maximize the variance over extracted dynamic latentvariables. Forfurtherdetails,wereferto[55]. 3.2.2.3 Optimization There is no analytical solution to Eq. (3.8) since A and b are coupled [39]. Besides, we need to impose constraints on A and b. That is, the columns of A are orthonormal vectors while jjbjj = 1. The orthogonality constraint on matrix A plays an important role. For example, the orthogonality constraint is introduced for bilingual word embedding for several reasons. The 34 originalwordembeddingspaceisinvariantandself-consistentundertheorthogonalprincipal[12, 14]. Moreover, without such a constraint, the learned dynamic latent variables has to be extracted iteratively, which is time consuming. Furthermore, the extracted latent variables tend to be close toeachotherwithasmallangle. We adopt the optimization procedure in Algorithm 2 to solve the optimization problem. Note thatparameterV isusedtocontrolthe orthogonality-wiseconvergencerate. Algorithm2Optimizationforextractingdynamiclatentvariables InitializeAandb randomlyaslearnableparameters. for TrainingbatchW 8 =1,...,m do FindcorrespondingwordembeddingV 8 . Predictthecenterword ˆ E¹F 8 º=A ) V 8 b. Extractnegativesamplesbasedonwordfrequency. UpdateAandb bygradientdescenttooptimizeEq. (3.9). b :=bjjbjj A :=¹1¸VºAVAA ) A endfor We maximize the inner product over all tokens as shown in Eq. (3.8) in theory, yet we adopt negativesamplingforparameterupdateinpracticetosavethecomputation. Theobjectivefunction canberewrittenas max Ab ¹Abº where ¹Abº= logf¹ ¹A ) V 8 bº ) A ) E¹F 8 º º¸ # Õ ==1 E F = % = ¹Fº logf¹ ¹A ) V 8 bº ) A ) E¹F = º º (3.9) and where f¹Gº = 1¹1¸4G?¹Gºº, # is the amount of the negative samples used per posi- tive training sample, and negative samples E¹F = º are sampled based on their overall frequency distribution. Thefinalwordembeddingvectorisaconcatenationoftwoparts: 1)thedynamicdimensionsby projectingE¹F 8 º to the learned dynamic subspaceA in form ofA ) E¹F 8 º, and 2) static dimensions obtainedfromstaticPCAdimensionreductioninformof%¹E¹F 8 ºº. 35 3.2.3 ExperimentalResultsandAnalysis 3.2.3.1 BaselinesandHyper-parameterSettings We conduct the PVN on top of several baseline models. For all SGNS related experiments, wiki2010corpus1(around6G)isusedfortraining. Thevocabularysetcontains830kvocabularies, which means words occur more than 10 times are included. For CBOW, GloVe and Dict2vec, we all adopt the official released code and trained on the same dataset as SGNS. Here we set 3=11 acrossallexperiments. ForthePDEmethod,weobtaintrainingpairsfromthewiki2010corpustomaintainconsistency withtheSGNSmodel. Thevocabularysizeis800k. Wordswithlowfrequencyareassignedtothe samewordvectorsforsimplicity. Name Pairs Year WS-353 353 2002 WS-353-SIM 203 2009 WS-353-REL 252 2009 Rare-Word 2034 2013 MEN 3000 2012 MTurk-287 287 2011 MTurk-771 771 2012 SimLex-999 999 2014 Verb-143 143 2014 SimVerb-3500 3500 2016 Table 3.1: Word similarity datasets used in our experiments, where pairs indicate the number of wordpairsineachdataset. 3.2.3.2 Datasets We consider two popular intrinsic evaluation benchmarks in our evaluation: 1) word similarity and 2) word analogy. Detailed introduction can be found in [168]. Our proposed post-processing methodsworkwellinbothevaluationmethods. WordSimilarity 1http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 36 Word similarity evaluation is widely used in evaluating word embedding quality. It focuses on the semantic meaning of words. Here, we use the cosine distance measure and Spearman’s rank correlation coefficients (SRCC) [149] to measure the distance and evaluate the similarity between ourresultsandhumanscores,respectively. Weconducttestson10populardatasets(seeTable3.1) toavoidevaluationoccasionality. Formoreinformationofeachdataset,werefertothewebsite2. WordAnalogy Duetothelimitationofperformancecomparisonintermsofwordsimilarity[61],performance comparison in word analogy is adopted as a complementary tool to evaluate the quality of word embeddingmethods. Bothadditionandmultiplicationoperationsareimplementedtopredictword 3 here. In PDEmethod,wereportcommonlyusedadditionoperationforsimplicity. We choose two major datasets for word analogy evaluation. They are: 1) the Google dataset [108] and 2) the MSR dataset [110]. The Google dataset contains 19,544 questions. They belong totwomajorcategories: semanticandmorpho-syntactic,eachofwhichcontains8,869and10,675 questions, respectively. We also report the results conducted on two Google subsets. The MSR dataset contains 8,000 analogy questions. Out-of-vocabulary words were removed from both datasets.3 ExtrinsicEvaluation For PVN method, we conduct further experiments over extrinsic evaluation tasks including sentiment analysis and neural machine translation (NMT). For both tasks, Bi-directional LSTM is utilized as the inference tool. Two sentiment analysis dataset is utilized: Internet Movie Database (IMDb)andSentimentTreebankdataset(SST).Europarlv8datasetforEnglish-Frenchtranslation is utilized in our neural machine translation task. We report accuracy for IMDb and SST dataset andvalidationaccuracyforNMT. 37 Type SGNS PPA PVN(ours) WS-353 65.7 67.6 68.1 WS-353-SIM 73.2 73.8 73.9 WS-353-REL 58.1 59.4 60.7 Rare-Word 39.5 42.4 42.9 MEN 70.2 72.5 73.2 MTurk-287 62.8 64.7 66.4 MTurk-771 64.6 66.2 66.8 SimLex-999 41.6 42.6 42.8 Verb-143 35.0 38.9 39.5 SimVerb-3500 26.5 28.1 28.5 Average 47.8 49.8 50.3 Table 3.2: The SRCC performance comparison (100) for SGNS alone, SGNS+PPA and SGNS+PVN against word similarity datasets, where the last row is the average performance weightedbythepairnumberofeachdataset. Type: SGNS PPA PVN(ours) Google Add 59.6 61.3 62.1 Mul 61.2 60.3 61.9 Semantic Add 57.8 62.4 62.4 Mul 59.3 59.5 60.9 Syntactic Add 61.1 60.5 61.8 Mul 62.7 61.0 62.7 MSR Add 51.0 53.0 53.4 Mul 53.3 53.3 54.9 Table 3.3: The SRCC performance comparison (100) for SGNS alone, SGNS+PPA and SGNS+PVNagainstwordanalogydatasets. 3.2.3.3 PerformanceEvaluationofPVN The performance of the PVN as a post-processing tool for the SGNS baseline methods is given in Table 3.3. It also shows results of the baselines and the baseline+PPA [112] for performance bench-marking. Table3.2comparestheSRCCscoresoftheSGNSalone,theSGNS+PPAandtheSGNS+PVN againstwordsimilaritydatasets. WeseethattheSGNS+PVNperformsbetterthantheSGNS+PPA. We observe the largest performance gain of the SGNS+PVN reaches 5.2% in the average SRCC 2http://www.wordvectors.org/ 3Out-of-vocabularywordsarethoseappear less than 10 times in the wiki2010 dataset. 38 Baselines IMDb SST NMT SGNS 80.92/86.03 66.00/66.76 50.50/50.62 CBOW 85.20/85.81 67.12/66.94 49.78/49.97 GloVe 83.51/84.88 64.53/67.74 50.31/50.58 Dict2vec 80.62/84.40 65.06/66.89 50.45/50.56 Table 3.4: Extrinsic Evaluation for SGNS alone and SGNS+PVN. The first value is from orignal modelwhilesecondvalueisfromourpost-processingembeddingmodel. scores. It is also interesting to point out that the PVN is more robust than the PPA with different settingsin3. Table 3.3 compares the SRCC scores of SGNS, SGNS+PPA and SGNS+PVN against word analogy datasets. We use addition as well as multiplication evaluation methods. PVN performs betterthanPPAinboth. Forthemultiplicationevaluation,theperformanceofPPAisworsethanthe baseline. In contrast, the proposed PVN method has no negative effect as it performs consistently well. This can be explained below. When the multiplication evaluation is adopted, the missing dimensions of the PPA influence the relative angles of vectors a lot. This is further verified by the fact that some linguistic properties are captured by these high-variance dimensions and their total eliminationissub-optimal. Table 3.4 indicates extrinsic evaluation results. We can see that our PVN post-processing methodperformsmuchbettercomparedwiththeoriginalresultinvariousdownstreamtasks. Type SGNS PDE(ours) WS-353 65.7 65.9 WS-353-SIM 73.2 73.6 WS-353-REL 58.1 59.3 Rare-Word 39.5 38.6 Google 59.6 60.8 Semantic 57.8 59.6 Syntactic 61.1 61.8 MSR 51.0 51.6 Table 3.5: The SRCC performance comparison (100) for SGNS alone and SGNS + PDE against wordsimilarityandanalogydatasets. 39 3.2.3.4 PerformanceEvaluationofPDE WeadoptthesamesettinginevaluatingthePDEmethodsuchasthewindowsize,vocabularysize, numberofnegativesamples,trainingdata,etc. whenitisappliedtoSGNSbaseline. Thefinalword representationiscomposedbytwoparts: ˜ E¹Fº=»E B ¹Fº ) E 3 ¹Fº ) ¼ ) ,whereE B ¹Fº isthestaticpart obtained from dimension reduction using PCA andE 3 ¹Fº = A ) E¹Fº is the projection ofE¹Fº to dynamicsubspaceA. Here,wesetthedimensionsofE B ¹Fº andE 3 ¹Fº to240and60,respectively. The SRCC performance comparison of SGNS alone and SGNS+PDE against the word similarity and analogy datasets are shown in Table 3.5. By adding the ordered information via PDE, we see thatthequalityofwordrepresentationsisimprovedinbothevaluationtasks. Type SGNS PVN+PDE(ours) WS-353 65.7 69.0 WS-353-SIM 73.2 75.3 WS-353-REL 58.1 61.9 Verb-143 35.0 44.1 Rare-Word 39.5 42.5 Google 59.6 62.8 Semantic 57.8 62.8 Syntactic 61.1 62.8 MSR 51.0 53.7 Table 3.6: The SRCC performance comparison (100) for SGNS alone and SGNS + PVN / PDE modelagainstwordsimilarityandanalogydatasets. 3.2.3.5 PerformanceEvaluationofIntegratedPVN/PDE We can integrate PVN and PDE together to improve their individual performance. Since the PVN provides a better word embedding, it can help PDE learn better. Furthermore, normalizing variances for dominant principal components is beneficial since they occupy too much energy and mask the contributions of remaining components. On the other hand, components with very low variances may contain much noise. They should be removed or replaced while the PDE can be usedtoreplacethenoisycomponents. 40 The SRCC performances of the baseline SGNS method and the SGNS+PVN/PDE method for the word similarity and the word analogy tasks are listed in Table 3.6. Better results are obtained across all datasets. The improvement over the Verb-143 dataset has a high ranking among all datasets with either joint PVN/PDE or PDE alone. This matches our expectation since the context orderhasmorecontributionoververbs. 3.3 ExperimentalStudy on Word Embedding Evaluators Besides the study on post-processing techniques on word embedding, Extensive evaluation on a largenumberofwordembeddingmodelsforlanguageprocessingapplicationsisconductedinthis work. We report experimental results of intrinsic and extrinsic evaluators on six word embedding models. It is shown that different evaluators focus on different aspects of word models, and some are more correlated with natural language processing tasks. Finally, we adopt correlation analysis tostudyperformanceconsistencyofextrinsicandintrinsicevalutors. 3.3.1 ExperimentalResultsofIntrinsicEvaluators We conduct extensive evaluation experiments on six word embedding models with intrinsic eval- uators in this section. The performance metrics of consideration include: 1) word similarity, 2) wordanalogy,3)conceptcategorization,4)outlierdetectionand5)QVEC. 3.3.1.1 ExperimentalSetup We select six word embedding models in the experiments. They are SGNS, CBOW, GloVe, FastText, ngram2vec and Dict2vec. For consistency, we perform training on the same corpus – wiki20104. Itisadatasetofmediumsize(around6G)withoutXMLtags. Afterpreprocessing,all special symbolsare removed. By choosing amiddle-sized trainingdataset, we attemptto keepthe generality of real world situations. Some models may perform better when being trained on larger 4http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 41 datasets while others are less dataset dependent. Here, the same training dataset is used to fit a moregeneralsituationforfaircomparisonamongdifferentwordembeddingmodels. Forallembeddingmodels,weusedtheirofficialreleasedtoolkitanddefaultsettingfortraining. ForSGNSandCBOW,weusedthedefaultsettingprovidedbytheofficialreleasedtoolkit5. GloVe toolkit is available from their official website6. For FastText, we used their codes7. Since FastText usessub-wordasbasicunits,itcandealwiththeout-of-vocabulary(OOV)problemwell,whichis one of the main advantages of FastText. Here, to compare the word vector quality only, we set the vocabularysetforFastTexttobethesameasothermodels. Forngram2vecmodel8,becauseitcan betrainedovermultiplebaselines,wechosethebestmodelreportedintheiroriginalpaper. Finally, codesforDict2veccanbeobtainedfromwebsite9. Thetrainingtimeforallmodelsareacceptable (within several hours) using a modern computer. The threshold for vocabulary is set to 10 for all models. Itmeans,forwordswithfrequencylowerthan10,theyareassignedwiththesamevectors. 3.3.1.2 ExperimentalResults WordSimilarity We choose 13 datasets for word similarity evaluation. They are listed in Table 3.7. The information ofeach datasetis provided. Amongthe 13datasets, WS-353, WS-353-SIM,WS-353- REL,Rare-Wordaremorepopularonesbecauseoftheirhighqualityofwordpairs. TheRare-Word (RW)datasetcanbeusedtotestmodel’sabilitytolearnwordswithlowfrequency. Theevaluation result is shown in Table 3.8. We see that SGNS-based models perform better generally. Note that ngram2vec is an improvement over the SGNS model, and its performance is the best. Also, The Dict2vec model provides the best result against the RW dataset. This could be attributed to that Dict2vec is fine-tuned word vectors based on dictionaries. Since infrequent words are treated 5https://code.google.com/archive/p/word2vec/ 6https://nlp.stanford.edu/projects/glove/ 7https://github.com/facebookresearch/fastText 8https://github.com/zhezhaoa/ngram2vec 9https://github.com/tca19/dict2vec 42 Name Pairs Year WS-353 [63] 353 2002 WS-353-SIM[2] 203 2009 WS-353-REL[2] 252 2009 MC-30[111] 30 1991 RG-65[134] 65 1965 Rare-Word(RW)[101] 2034 2013 MEN[35] 3000 2012 MTurk-287[126] 287 2011 MTurk-771[68] 771 2012 YP-130[162] 130 2006 SimLex-999[75] 999 2014 Verb-143[19] 143 2014 SimVerb-3500[65] 3500 2016 Table 3.7: Word similarity datasets used in our experiments where pairs indicate the number of wordpairsineachdataset. WordSimilarity Datasets WS WS-SIM WS-REL MC RG RW MEN Mturk287 Mturk771 YP SimLex Verb SimVerb SGNS 71.6 78.7 62.8 81.1 79.3 46.6 76.1 67.3 67.8 53.6 39.8 45.6 28.9 CBOW 64.3 74.0 53.4 74.7 81.3 43.3 72.4 67.4 63.6 41.6 37.2 40.9 24.5 GloVe 59.7 66.8 55.9 74.2 75.1 32.5 68.5 61.9 63.0 53.4 32.4 36.7 17.2 FastText 64.8 72.1 56.4 76.3 77.3 46.6 73.0 63.0 63.0 49.0 35.2 35.0 21.9 ngram2vec 74.2 81.5 67.8 85.7 79.5 45.0 75.1 66.5 66.5 56.4 42.5 47.8 32.1 Dict2vec 69.4 72.8 57.3 80.5 85.7 49.9 73.3 60.0 65.5 59.6 41.7 18.9 41.7 Table 3.8: Performance comparison (100) of six word embedding baseline models against 13 wordsimilaritydatasets. equally with others in dictionaries, the Dict2vec model is able to give better representation over rarewords. WordAnalogy Twodatasetsareadoptedforthewordanalogyevaluationtask. Theyare: 1)theGoogledataset [108] and 2) the MSR dataset [110]. The Google dataset contains 19,544 questions. They are divided into “semantic" and “morpho-syntactic" two categories, each of which contains 8,869 and 10,675 questions, respectively. Results for these two subsets are also reported. The MSR dataset contains 8,000 analogy questions. Both 3CosAdd and 3CosMul inference methods are implemented. WeshowthewordanalogyevaluationresultsinTable3.9. SGNSperformsthebest. One word set for the analogy task has four words. Since ngram2vec considers n-gram models, the 43 WordAnalogyDatasets Google Semantic Syntactic MSR Add Mul Add Mul Add Mul Add Mul SGNS 71.8 73.4 77.6 78.1 67.1 69.5 56.7 59.7 CBOW 70.7 70.8 74.4 74.1 67.6 68.1 56.2 56.8 GloVe 68.4 68.7 76.1 75.9 61.9 62.7 50.3 51.6 FastText 40.5 45.1 19.1 24.8 58.3 61.9 48.6 52.2 ngram2vec 70.1 71.3 75.7 75.7 65.3 67.6 53.8 56.6 Dict2vec 48.5 50.5 45.1 47.4 51.4 53.1 36.5 38.9 Table 3.9: Performance comparison (100) of six word embedding baseline models against word analogydatasets. relationshipwithinwordsetsmaynotbeproperlycaptured. Dictionariesdonothavesuchwordsets and, thus, word analogy is not well-represented in the word vectors of Dict2vec. Finally, FastText usessub-words,itssyntacticresultismuchbetterthanitssemanticresult. ConceptCategorization ConceptCategorizationDatasets AP BLESS BM SGNS 68.2 81.0 46.6 CBOW 65.7 74.0 45.1 GloVe 61.4 82.0 43.6 FastText 59.0 73.0 41.9 ngram2vec 63.2 80.5 45.9 Dict2vec 66.7 82.0 46.5 Table3.10: Performancecomparison(100)ofsixwordembeddingbaselinemodelsagainstthree conceptcategorizationdatasets. Threedatasetsareusedinconceptcategorizationevaluation. Theyare: 1)theAPdataset[10], 2) the BLESS dataset [24] and 3) the BM dataset [25]. The AP dataset contains 402 words that are divided into 21 categories. The BM dataset is a larger one with 5321 words divided into 56 categories. Finally, the BLESS dataset consists of 200 words divided into 27 semantic classes. The results are showed in Table 3.10. We see that the SGNS-based models (including SGNS, ngram2vecandDict2vec)performbetterthanothersonallthreedatasets. Outlier Detection 44 OutlierDetectionDatasets WordSim-500 8-8-8 Accuracy OPP Accuracy OPP SGNS 11.25 83.66 57.81 84.96 CBOW 14.02 85.33 56.25 84.38 GloVe 15.09 85.74 50.0 84.77 FastText 10.68 82.16 57.81 84.38 ngram2vec 10.64 82.83 59.38 86.52 Dict2vec 11.03 82.5 60.94 86.52 Table 3.11: Performance comparison of six word embedding baseline models against outlier detectiondatasets. We adopt two datasets for the outlier detection task: 1) the WordSim-500 dataset and 2) the 8-8-8 dataset. The WordSim-500 consists of 500 clusters, where each cluster is represented by a set of 8 words with 5 to 7 outliers [29]. The 8-8-8 dataset has 8 clusters, where each cluster is representedbyasetof8wordswith8outliers[36]. BothAccuracyandOutlierPositionPercentage (OPP)arecalculated. TheresultsareshowninTable3.11. Theyarenotconsistentwitheachother for the two datasets. For example, GloVe has the best performance on the WordSim-500 dataset butitsaccuracyonthe8-8-8datasetistheworst. Thiscouldbeexplainedbythepropertiesofthese twodatasets. WewillconductcorrelationstudyinSec. 3.3.3toshedlightonthisphenomenon. QVEC QVEC QVEC SGNS 50.62 FastText 49.20 CBOW 50.61 ngram2vec 50.83 GloVe 46.81 Dict2vec 48.29 Table3.12: QVECperformancecomparison(100)ofsixwordembeddingbaselinemodels. We use the QVEC toolkit10 and report the sentiment content evaluation result in Table 3.12. Among six word models, ngram2vec achieves the best result while SGNS ranks the second. This ismoreconsistentwithotherintrinsicevaluationresultsdescribedabove. 10https://github.com/ytsvetko/qvec 45 3.3.2 ExperimentalResultsofExtrinsicEvaluators 3.3.2.1 DatasetsandExperimental Setup POSTagging,ChunkingandNamedEntityRecognition By following [45], three downstream tasks for sequential labeling are selected in our exper- iments. The Penn Treebank (PTB) dataset [104], the chunking of CoNLL’00 share task dataset [157]andtheNERofCoNLL’03sharedtaskdataset[158]areusedforthepart-Of-speechtagging, chunking and named-entity recognition, respectively. We adopt standard splitting ratios and eval- uation criteria for all three datasets. The details for datasets splitting and evaluation criteria are showninTable3.13. Name Train(#Tokens) Test(#Tokens) Criteria PTB 337,195 129,892 accuracy CoNLL’00 211,727 47,377 F-score CoNLL’03 203,621 46,435 F-score Table3.13: DatasetsforPOStagging,ChunkingandNER. Forinferencetools,weusethesimplewindow-basedfeed-forwardneuralnetworkarchitecture implementedby[42]. Ittakesinputsoffiveatonetimeandpassesthemthrougha300-unithidden layer, a tanh activation function and a softmax layer before generating the result. We train each modelfor10epochsusingtheAdam optimizationwithabatchsizeof50. SentimentAnalysis We choose two sentiment analysis datasets for evaluation: 1) the Internet Movie Database (IMDb) [102] and 2) the Stanford Sentiment Treebank dataset (SST) [148]. IMDb contains a collection of movie review documents with polarized classes (positive and negative). For SST, we split data into three classes: positive, neutral and negative. Their document formats are different: IMDb consists several sentences while SST contains only single sentence per label. The detailed informationforeachdatasetisgiveninTable3.14. To cover most sentimental analysis inference tools, we test the task using Bi-LSTM and CNN. We choose 2-layer Bi-LSTM with 256 hidden dimensions. The adopted CNN has 3 layers with 46 Classes Train Validation Test SST 3 8544 1101 2210 IMDb 2 17500 7500 25000 Table3.14: Sentimentanalysisdatasets. 100 filters per layer of size [3, 4, 5], respectively. Particularly, the embedding layer for all models are fixed during training. All models are trained for 5 epochs using the Adam optimization with 0.0001learningrate. NeuralMachineTranslation Ascomparedwithsentimentanalysis,neuralmachinetranslation(NMT)isamorechallenging task since it demands a larger network and more training data. We use the same encoder-decoder architecture as that in [82]. The Europarl v8 [83] dataset is used as training corpora. The task is English-French translation. For French word embedding, a pre-trained FastText word embedding model11 is utilized. As to the hyper-parameter setting, we use a single layer bidirectional-LSTM of 500 dimensions for both the encoder and the decoder. Both embedding layers for the encoder and the decoder are fixed during the training process. The batch size is 30 and the total training iterationis100,000. 3.3.2.2 ExperimentalResultsandDiscussion Experimental results of the above-mentioned five extrinsic evaluators are shown in Table 3.15. Generally speaking, both SGNS and ngram2vec perform well in POS tagging, chunking and NER tasks. Actually, the performance differences of all evaluators are small in these three tasks. As to thesentimentalanalysis,theirisnoobviouswinnerwiththeCNNinferencetool. Theperformance gaps become larger using the Bi-LSTM inference tool, and we see that Dict2vec and FastText performtheworst. Basedontheseresults,weobservethatthereexisttwodifferentfactorsaffecting the sentiment analysis results: datasets and inference tools. For different datasets with the same inferencetool,theperformancecanbedifferentbecauseofdifferentlinguisticpropertiesofdatasets. Ontheotherhand,differentinferencetoolsmayfavordifferentembeddingmodelsagainstthesame 11https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md 47 dataset since inference tools extract the information from word models in their own manner. For example, Bi-LSTM focuses on long range dependency while CNN treats each token more or less equally. POS Chunking NER SA(IMDb) SA(SST) NMT Bi-LSTM CNN Bi-LSTM CNN Perplexity SGNS 94.54 88.21 87.12 85.36 88.78 64.08 66.93 79.14 CBOW 93.79 84.91 83.83 86.93 85.88 65.63 65.06 102.33 GloVe 93.32 84.11 85.3 70.41 87.56 65.16 65.15 84.20 FastText 94.36 87.96 87.10 73.97 83.69 50.01 63.25 82.60 ngram2vec 94.11 88.74 87.33 79.32 89.29 66.27 66.45 77.79 Dict2vec 93.61 86.54 86.82 62.71 88.94 62.75 66.09 78.84 Table3.15: Extrinsicevaluationresults. Perplexity is used to evaluate the NMT task. It indicates variability of a prediction model. Lower perplexity corresponds to lower entropy and, thus, better performance. We separate 20,000 sentencesfromthesamecorporatogeneratetestingdataandreporttestingperplexityfortheNMT task in Table 3.15. As shown in the table, ngram2vec, Dict2vec and SGNS are the top three word modelsfortheNMTtask,whichisconsistentwiththewordsimilarityevaluationresults. We conclude from Table 3.15 that SGNS-based models including SGNS, ngram2vec and dict2vec tend to work better than other models. However, one drawback of ngram2vec is that it takes more time in processing n-gram data for training. GloVe and FastText are popular in the researchcommunitysincetheirpre-trainedmodelsareeasytodownload. Wealsocomparedresults using pre-trained GloVe and FastText models. Although they are both trained on larger datasets andproperlyfind-tuned,theydonotprovidebetterresultsinourevaluationtasks. 3.3.3 ConsistencyStudyviaCorrelationAnalysis WeconductconsistencystudyofextrinsicandintrinsicevaluatorsusingthePearsoncorrelation(d) analysis [27]. Besides the six word models described above, we add two more pre-trained models of GloVe and FastText to make the total model number eight. Furthermore, we apply the variance 48 normalization technique [166] to the eight models to yield eight more models. Consequently, we haveacollectionofsixteenwordmodels. Fig. 3.1 shows the Pearson correlation of each intrinsic and extrinsic evaluation pair of these sixteenmodels. Forexample,theentryofthefirstrowandthefirstcolumnisthePearsoncorrelation value of WS-353 (an intrinsic evaluator) and POS (an extrinsic evaluator) of sixteen word models (i.e. 16 evaluation data pairs). Note also that we add a negative sign to the correlation value of NMTperplexitysincelowerperplexityisbetter. Figure3.1: Pearson’scorrelationbetweenintrinsicandextrinsicevaluator,wherethex-axisshows extrinsicevaluatorswhilethey-axisindicatesintrinsicevaluators. Thewarmindicatesthepositive correlationwhilethecoolcolorindicatesthenegativecorrelation. 3.3.3.1 ConsistencyofIntrinsicEvaluators • Word Similarity 49 Allembeddingmodelsaretestedover13evaluationdatasetsandtheresultsareshowninthe top13rows. Weseefromthecorrelationresultthatlargerdatasetstendtogivemorereliable andconsistentevaluationresult. Amongalldatasets,WS-353,WS-353-SIM,WS-353-REL, MTrurk-771,SimLex-999andSimVerb-3500arerecommendedtoserveasgenericevaluation datasets. Although datasets like MC-30 and RG-65 also provide us with reasonable results, their correlation results are not as consistent as others. This may be attributed to the limited amount of testing samples with only dozens of testing word pairs. The Rare-Word (RW) dataset is a special one that focuses on low-frequency words and gains popularity recently. Yet,basedonthecorrelationstudy,theRWdatasetisnotaseffectiveasexpected. Infrequent wordsmaynotplayanimportantroleinallextrinsicevaluationtasks. Thisiswhyinfrequent words are often set to the same vector. The Rare-Word dataset can be excluded for general purposeevaluationunlessthereisaspecificapplicationdemandingrarewordsmodeling. • Word Analogy Thewordanalogyresultsareshownfromthe14throwtothe21strowinthefigure. Among four word analogy datasets (i.e. Google, Google Semantic, Google Syntactic and MSR), GoogleandGoogleSemanticaremoreeffective. Itdoesnotmakemuchdifferenceinthefinal correlationstudyusingeitherthe3CosAddorthe3CosMulcomputation. GoogleSyntacticis noteffectivesincethemorphologyofwordsdoesnotcontainasmuchinformationassemantic meanings. Thus, although the FastText model performs well in morphology testing based on the average of sub-words, it correlation analysis is worse than other models. In general, word analogy provides most reliable correlation results and has the highest correlation with thesentimentanalysistask. • ConceptCategorization All three datasets (i.e., AP, BLESS and BM) for concept categorization perform well. By categorizingwordsintodifferentgroups,conceptcategorizationfocusesonsemanticclusters. 50 It appears that models that are good at dividing words into semantic collections are more effectiveindownstreamNLPtasks. • OutlierDetection Twodatasets(i.e.,WordSim-500and8-8-8)areusedforoutlierdetection. Ingeneral,outlier detection is not a good evaluation method. Although it tests semantic clusters to some extent, outlier detection is less direct as compared to concept categorization. Also, from the dataset point of view, the size of the 8-8-8 dataset is too small while the WordSim-500 dataset contains too many infrequent words in the clusters. This explains why the accuracy for WordSim-500is low(around 10-20%). When thereare largerand more reliabledatasets available,weexpecttheoutlierdetectiontasktohavebetterperformanceinwordembedding evaluation. • QVEC QVEC is not a good evaluator due to its inherit properties. It attempts to compute the correlation with lexicon-resource based word vectors. Yet, the quality of lexicon-resource based word vectors is to poor to provide a reliable rule. If we can find a more reliable rule, theQVECevaluatorwillperformbetter. Basedontheabovediscussion,weconcludethatwordsimilarity,wordanalogyandconceptcat- egorizationaremoreeffectiveintrinsicevaluators. Differentdatasetsleadtodifferentperformance. In general, larger datasets tend to give better and more reliable results. Intrinsic evaluators may performverydifferentlyfordifferentdownstreamtasks. Thus,whenwetestanewwordembedding model,allthreeintrinsicevaluatorsshouldbeusedandconsideredjointly. 3.3.3.2 ConsistencyofExtrinsicEvaluators For POS tagging, chunking and NER, none of intrinsic evaluators provide high correlation. Their performance depend on their capability in sequential information extraction. Thus, word meaning 51 plays a subsidiary role in all these tasks. Sentiment analysis is a dimensionality reduction proce- dure. It focuses more on combination of word meaning. Thus, it has stronger correlation with the properties that the word analogy evaluator is testing. Finally, NMT is sentence-to-sentence conversion, and the mapping between word pairs is more helpful in translation tasks. Thus, the word similarity evaluator has a stronger correlation with the NMT task. We should also point out that some unsupervised machine translation tasks focus on word pairs [13, 15]. This shows the significanceofwordpaircorrespondenceinNMT. 3.4 Conclusion Two post-processing techniques, PVN and PDE, were proposed to improve the quality of baseline wordembeddingmethodsinthiswork. Thetwotechniquescanworkindependentlyorjointly. The effectivenessofthesetechniqueswasdemonstratedbybothintrinsicandextrinsicevaluationtasks. We would like to study the PVN method by exploiting the correlation of dimensions, and applyingittodimensionalityreductionofwordrepresentationsinthenearfuture. Furthermore,we wouldliketoapplythedynamicembeddingtechniquetobothgenericand/ordomain-specificword embeddingmethodswithalimitedamountofdata. Itisalsodesiredtoconsideritsapplicabilityto non-linearlanguagemodels. We also provided in-depth discussion of intrinsic and extrinsic evaluations on many word em- bedding models, showed extensive experimental results and explained the observed phenomena. Our study offers a valuable guidance in selecting suitable evaluation methods for different appli- cation tasks. There are many factors affecting word embedding quality. Furthermore, there are still no perfect evaluation methods testing the word subspace for linguistic relationships because it is difficult to understand exactly how the embedding spaces encode linguistic relations. For this reason, we expect more work to be done in developing better metrics for evaluation on the overall quality of a word model. Such metrics must be computationally efficient while having a high 52 correlationwithextrinsicevaluationtestscores. Thecruxofthisproblemliesindecodinghowthe wordsubspaceencodeslinguisticrelationsandthequalityoftheserelations. Wewouldliketopointoutthatlinguisticrelationsandpropertiescapturedbywordembedding modelsaredifferentfromhowhumanslearnlanguages. Forhumans,alanguageencompassesmany differentavenuese.g.,asenseofreasoning,culturaldifferences,contextualimplicationsandmany others. Thus, a language is filled with subjective complications that interfere with objective goals of models. In contrast, word embedding models perform well in specific applied tasks. They have triumphedovertheworkoflinguistsincreatingtaxonomicstructuresandothermanuallygenerated representations. Yet,differentdatasetsanddifferentmodelsareusedfordifferentspecifictasks. Wedonotseeawordembeddingmodelthatconsistentlyperformswellinalltasks. Thedesign ofamoreuniversalwordembeddingmodelischallenging. Togeneratewordmodelsthataregood at solving specific tasks, task-specific data can be fed into a model for training. Feeding a large amount of generic data can be inefficient and even hurt the performance of a word model since different task-specific data can lead to contending results. It is still not clear what is the proper balancebetweenthetwodesignmethodologies. 53 Chapter4 SentenceEmbeddingbySemantic Subspace Analysis 4.1 Introduction In this section, we proposed two methods to solve the sentence embedding problem. The first one is built upon static word embedding model. The another one is to find sentence representation throughdissectingdeepcontextualizedmodels. Thewordembeddingtechniqueiswidelyusedinnaturallanguageprocessing(NLP)tasks. For example,itimprovesdownstreamtaskssuchasmachinetranslation[88],syntacticparsing[56],and textclassification[144]. Yet,manyNLPapplicationsoperateatthesentenceleveloralongerpiece oftexts. Althoughsentenceembeddinghasreceivedalotofattentionrecently,encodingasentence intoafixed-lengthvectortocapturesdifferentlinguisticpropertiesremainstobeachallenge. Static word embedding is a popular learning technique that transfers prior knowledge from a large unlabeled corpus [30, 109, 119]. Most of recent sentence embedding methods are rooted in that static word representations can be embedded with rich syntactic and semantic information. It is desired to extend the word-level embedding to the sentence-level, which contains a longer piece of texts. We have witnessed a breakthrough by replacing the “static" word embedding to the “contextualized" word representation in the last several years, e.g., [52, 121, 124, 178]. A naturalquestiontoaskishowtoexploitcontextualizedwordembeddinginthecontextofsentence embedding. Here,weexaminetheproblemoflearningtheuniversalrepresentationofsentences. A contextualizedwordrepresentation,calledBERT,achievesthestate-of-the-artperformanceinmany 54 natural language processing (NLP) tasks. We aim at developing a sentence embedding solution fromBERT-basedmodelsinthiswork. Asreportedin[78]and[98],differentlayersofBERTlearnsdifferentlevelsofinformationand linguistic properties. While intermediate layers encode the most transferable features, representa- tionfromhigherlayersaremoreexpressiveinhigh-levelsemanticinformation. Thus,information fusion across layers has its potential in providing a stronger representation. Furthermore, by con- ducting experiments on patterns of the isolated word representation across layers in deep models, we observe the following property. Words of richer information in a sentence have higher varia- tion in their representations, while the token representation changes gradually, across layers. This finding helps define “salient" word representations and informative words in computing universal sentenceembedding. AlthoughtheBERT-basedcontextualizedwordembeddingmethodperformswellinNLPtasks [165], it has its own limitations. For example, due to the large model size, it is time consuming to perform sentence pair regression such as clustering and semantic search. The most effective way to solve this problem is through an improved sentence embedding method, which transforms a sentence to a vector that encodes the semantic meaning of the sentence. Currently, a common sentence embedding approach based on BERT-based models is to average the representations obtained from the last layer or using the CLS token for sentence-level prediction. Yet, both are sub-optimal as shown in the experimental section of this paper. To the best of our knowledge, thereisonlyonepaperonsentenceembeddingusingpre-trainedBERT,calledSentence-BERTor SBERT [130]. It leverages further training with high-quality labeled sentence pairs. Apparently, howtoobtainsentenceembeddingfromdeepcontextualizedmodelsisstillanopenproblem. While word embedding is learned using a loss function defined on word pairs, sentence em- bedding demands a loss function defined at the sentence-level. Following a path similar to word embedding,unsupervisedlearningofsentenceencoders,e.g.,SkipThought[81]andFastSent[74], build self-supervision from a large unlabeled corpus. Yet, InferSent [47] shows that training on 55 high quality labeled data, e.g., the Stanford Natural Language Inference (SNLI) dataset, can con- sistently outperform unsupervised training objectives. Recently, leveraging training results from multipletaskshasbecomeanewtrendinsentenceembeddingsinceitprovidesbettergeneralization performance. USE [38] incorporates both supervised and unsupervised training objectives on the Transformer architecture. The method in [151] is trained in a multi-tasking manner so as to com- bine inductive biases of diverse training objectives. However, multi-tasking learning for sentence embedding is still under development, and it faces some difficulty in selecting supervised tasks andhandlinginteractionsbetweentasks. Furthermore,supervisedtrainingobjectivesdemandhigh qualitylabeleddatawhichareusuallyexpensive. Beingdrasticallydifferentfromtheabove-mentionedresearch,weinvestigatesentenceembed- dingbystudyingthegeometricstructureofdeepcontextualizedmodelsandproposeanewmethod by dissecting BERT-based word models. It is called the SBERT-WK method. As compared with previoussentenceembeddingmodelsthataretrainedonsentence-levelobjectives,deepcontextual- izedmodelsaretrainedonalargeunlabeledcorpuswithbothword-andsentence-levelobjectives. SBERT-WK inherits the strength of deep contextualized models. It is compatible with most deep contextualizedmodelssuchasBERT[52]andSBERT[130]. 4.1.1 RelatedWork 4.1.1.1 ContextualizedWordEmbedding Traditionalwordembeddingmethodsprovideastaticrepresentationforawordinavocabularyset. Although the static representation is widely adopted in NLP, it has several limitations in modeling the context information. First, it cannot deal with polysemy. Second, it cannot adjust the meaning of a word based on its contexts. To address the shortcomings of static word embedding methods, thereisanewtrendtogofromshallowtodeepcontextualizedrepresentations. Forexample,ELMo [121], GPT1 [124], GPT2 [125] and BERT [52] are pre-trained deep neural language models, and they can be fine-tuned on specific tasks. These new word embedding methods achieve impressive performance on a wide range of NLP tasks. In particular, the BERT-based models are dominating 56 inleaderboardsoflanguageunderstandingtaskssuchasSQuAD2.0[128]andGLUEbenchmarks [165]. ELMoisoneoftheearlierworkinapplyingapre-trainedlanguagemodeltodownstreamtasks [121]. It employs two layer bi-directional LSTM and fuses features from all LSTM outputs using task-specific weights. OpenAI GPT [124] incorporates a fine-tuning process when it is applied to downstream tasks. Tasks-specific parameters are introduced and fine-tuned with all pre-trained parameters. BERT employs the Transformer architecture [164], which is composed by multiple multi-head attention layers. It can be trained more efficiently than LSTM. It is trained on a large unlabeledcorpuswithseveralobjectivestolearnbothword-andsentence-levelinformation,where theobjectivesincludemaskedlanguagemodelingaswellasthenextsentenceprediction. Acouple of variants have been proposed based on BERT. RoBERTa [100] attempts to improve BERT by providingabetterrecipeinBERTmodeltraining. ALBERT[89]targetsatcompressingthemodel size of BERT by introducing two parameter-reduction techniques. At the same time, it achieves better performance. XLNET [178] adopts a generalized auto-regressive pre-training method that hasthemeritsofauto-regressiveandauto-encoderlanguagemodels. Because of the superior performance of BERT-based models, it is important to have a better understanding of BERT-based models and the transformer architecture. Efforts have been made alongthisdirectionrecentlyasreviewedbelow. Liuetal. [98]andPetronietal. [122]usedword- level probing tasks to investigate the linguistic properties learned by the contextualized models experimentally. Kovaleva et al. [84] and Michel et al. [107] attempted to understand the self- attentionschemeinBERT-basedmodels. Haoetal. [72]providedinsightsintoBERTbyvisualizing and analyzing the loss landscapes in the fine-tuning process. Ethayarajh [58] explained how the deepcontextualizedmodellearnsthecontextrepresentationofwords. Despitetheabove-mentioned efforts, the evolving pattern of a word representation across layers in BERT-based models has not been studied before. In this work, we first examine the pattern evolution of a token representation across layers without taking its context into account. With the context-independent analysis, we 57 observe that the evolving patterns are highly related to word properties. This observation in turn inspirestheproposalofanewsentenceembeddingmethod–SBERT-WK. 4.1.1.2 UniversalSentenceEmbedding Bysentenceembedding,weaimatextractinganumericalrepresentationforasentencetoencapsu- lateitsmeanings. Thelinguisticfeatureslearnedbyasentenceembeddingmethodcanbeexternal informationresourcesfordownstreamtasks. Sentenceembeddingmethodscanbecategorizedinto two categories: non-parameterized and parameterized models. Non-parameterized methods usu- allyrelyonhighqualitypre-trainedwordembeddingmethods. Thesimplestexampleistoaverage wordembeddingresultsastherepresentationforasentence. Followingthislineofthought,several weighted averaging methods were proposed, including tf-idf, SIF [11], uSIF [59] and GEM [179]. SIFusestherandomwalktomodelthesentencegenerationprocessandderiveswordweightsusing the maximum likelihood estimation (MLE). uSIF extends SIF by introducing an angular-distance- basedrandomwalkmodel. Nohyper-parametertuningisneededinuSIF.Byexploitinggeometric analysis of the space spanned by word embeddings, GEM determines word weights with several hand-craftedmeasurements. Insteadofweightedaveraging,itusesthe?-mean[135]toconcatenate thepowermeansofwordembeddingsandfusesdifferentwordembeddingmodelssoastoshorten theperformancegapbetweennon-parameterizedandparameterizedmodels. Parameterizedmodelsaremorecomplex,andtheyusuallyperformbetterthannon-parameterized models. The skip-thought model [81] extends the unsupervised training of word2vec [109] from thewordleveltothesentencelevel. Itadoptstheencoder-decoderarchitecturetolearnthesentence encoder. InferSent[47]employsbi-directionalLSTMwithsupervisedtraining. Ittrainsthemodel to predict the entailment or contradiction of sentence pairs with the Stanford Natural Language Inference (SNLI) dataset. It achieves better results than methods with unsupervised learning. The USE (Universal Sentence Encoder) method [38] extends the InferSent model by employing the 58 Transformer architecture with unsupervised as well as supervised training objectives. It was ob- served by later studies [151], [136] that training with multiple objectives in sentence embedding canprovidebettergeneralizability. TheSBERTmethod[130]istheonlyparameterizedsentenceembeddingmodelusingBERTas thebackbone. SBERTshareshighsimilaritywithInferSent[47]. ItusestheSiamesenetworkontop of the BERT model and fine-tunes it based on high quality sentence inference data (e.g. the SNLI dataset) to learn more sentence-level information. However, unlike supervised tasks, universal sentenceembeddingmethodsingeneraldonothaveaclearobjectivefunctiontooptimize. Instead of training on more sophisticated multi-tasking objectives, we combine the advantage of both parameterized and non-parameterized methods. SBERT-WK is computed by subspace analysis of themanifoldlearnedbytheparameterizedBERT-basedmodels. 4.1.1.3 SubspaceLearningandAnalysis In signal processing and data science, subspace learning and analysis offer powerful tools for multidimensional data processing. Correlated data of a high dimension can be analyzed using latent variable representation methods such as Principal Component Analysis (PCA), Singular Spectrum Analysis (SSA), Independent Component Analysis (ICA) and Canonical Correlation Analysis (CCA). Subspace analysis has solid mathematical foundation. It is used to explain and understandtheinternalstatesofDeepNeuralNetworks[85],[86],[62]. Thegoalofwordorsentenceembeddingistomapwordsorsentencesontoahigh-dimensional space. Thus, subspace analysis is widely adopted in this field, especially for word embedding. Before learning-based word embedding, the factorization-based word embedding methodology is the mainstream, which is used to analyze the co-occurrence statistics of words. To find word representations, Latent Semantic Analysis (LSA) [90] factorizes the co-occurrence matrix using singular value decomposition. Levy and Goldberg [93] pointed out the connection between the word2vec[109]modelandthefactorization-basedmethods. Recently,subspaceanalysisisadopted 59 forinterpretablewordembeddingbecauseofmathematicaltransparency. Subspaceanalysisisalso widelyusedinpost-processingandevaluationofwordembeddingmodels[166],[161],[168]. Because of the success of subspace analysis in word embedding, it is natural to incorporate subspace analysis in sentence embedding as a sentence is composed by a sequence of words. For example, SCDV [106] determines the sentence/document vector by splitting words into clusters and analyzing them accordingly. GEM [179] models the sentence generation process as a Gram- Schmidtprocedureandexpandsthesubspaceformedbywordvectorsgradually. BothDCT[9]and EigenSent [79] map a sentence matrix into the spectral space and model the high-order dynamics ofasentencefromasignalprocessingperspective. Although subspace analysis has already been applied to sentence embedding, all above- mentionedworkwasbuiltupononstaticwordembeddingmethods. Tothebestofourknowledge, ourworkisthefirstonethatexploitssubspaceanalysistofindgenericsentenceembeddingsbased ondeepcontextualizedwordmodels. WewillshowinthisworkthatSBERT-WKcanconsistently outperform state-of-the-art methods with low computational overhead and good interpretability, which is attributed to high transparency and efficiency of subspace analysis and the power of deep contextualizedwordembedding. 4.1.2 ProposedMethodonStaticWordRepresentation Here, we propose a novel non-parameterized sentence embedding method based on semantic subspace analysis. It is called semantic subspace sentence embedding (S3E) (see Fig. 4.1). The S3E method is motivated by the following observation. Semantically similar words tend to form semantic groups in a high-dimensional embedding space. Thus, we can embed a sentence by analyzing semantic subspaces of its constituent words. Specifically, we use the intra- and inter- group descriptors to represent words in the same semantic group and characterize interactions betweenmultiplesemanticgroups,respectively. VectorofLocallyAggregatedDescriptors(VLAD)isafamousalgorithmintheimageretrieval field. Same with Bag-of-words method, VLAD trains a codebook based on clustering techniques 60 and concatenate the feature within each clusters as the final representation. Recently work called VLAWE (vector of locally-aggregated word embeddings) [77], introduce this idea into document representation. However, VLAWE method suffers from high dimensionality problem which is not favored by machine learning models. In this work, a novel clustering method in proposed by taking word frequency into consideration. At the same time, covariance matrix is used to tackle thedimensionalityexplosionproblemofVLAWEmethod. Recently, a novel document distance metric called Word Mover’s Distance (WMD) [87] is proposedandachievedgoodperformanceinclassificationtasks. Basedonthefactthatsemantically similarwordswillhaveclosevectorrepresentations,thedistancebetweentwosentencesaremodels as the minimal ’travel’ cost for moving the embedded words from one sentence to another. WMD targets on modeling the distance between sentences in the shared word embedding space. It is naturaltoconsiderthepossibilityofcomputingthesentencerepresentationdirectlyfromtheword embeddingspacebysemanticaldistancemeasures. Thereareafewworkstryingtoobtainsentence/documentrepresentationbasedonWordMover’s Distance. D2KE (distances to kernels and embeddings) and WME (word mover’s embedding) converts the distance measure into positive definite kernels and has better theoretical guarantees. However, both methods are proposed under the assumption that Word Mover’s Distance is a good standard for sentence representation. In our work, we borrow the ’travel’ concept of embedded words in WMD’s method. And use covariance matrix to model the interaction between semantic conceptsinadiscreteway. Thisworkhasthreemaincontributions. 1. The proposed S3E method contains three steps: 1) semantic group construction, 2) intra- group descriptor and 3) inter-group descriptor. The algorithms inside each step are flexible and,asaresult,previousworkcanbeeasilyincorporated. 2. Tothebestofourknowledge,thisisthefirstworkthatleveragescorrelationsbetweensemantic groupstoprovideasentencedescriptor. Previousworkusingthecovariancedescriptor[159] 61 yieldssuper-highembeddingdimension(e.g. 45Kdimensions). Incontrast,theS3Emethod canchoosetheembeddingdimensionflexibly. 3. The effectiveness of the proposed S3E method in textual similarity and supervised tasks is shown experimentally. Its performance is as competitive as that of very complicated parametrizedmodels. 4.1.3 ProposedMethodonContextualizedWordRepresentation Sentence embedding is an important research topic in natural language processing (NLP) since it can transfer knowledge to downstream tasks. Meanwhile, a contextualized word representation, calledBERT,achievesthestate-of-the-artperformanceinquiteafewNLPtasks. Yet,itisanopen problemtogenerateahighqualitysentencerepresentationfromBERT-basedwordmodels. Itwas shown in previous study that different layers of BERT capture different linguistic properties. This allows us to fusion information across layers to find better sentence representation. In this work, we study the layer-wise pattern of the word representation of deep contextualized models. Then, we propose a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation. It is called the SBERT-WK method. No further training is required in SBERT-WK. We evaluate SBERT-WK on semantic textualsimilarityanddownstreamsupervisedtasks. Furthermore,tensentence-levelprobingtasks are presented for detailed linguistic analysis. Experiments show that SBERT-WK achieves the state-of-the-artperformance. Ourworkhasthefollowingthreemaincontributions. 1. WestudytheevolutionofisolatedwordrepresentationpatternsacrosslayersinBERT-based models. These patterns are shown to be highly correlated with word’s content. It provides usefulinsightsintodeepcontextualizedwordmodels. 2. We propose a new sentence embedding method, called SBERT-WK, through geometric analysisofthespacelearnedbydeepcontextualizedmodels. 62 ”I would like to book a flight on May 17 th from Los Angeles to Beijing.” Word Probability Word Weights Semantic Group Construction Group 1 Group 2 Group K-1 Group K Inter-group Descriptor S3E-Embedding Figure4.1: OverviewoftheproposedS3Emethod. 3. We evaluate the SBERT-WK method against eight downstream tasks and seven semantic textualsimilaritytasks,andshowthatitachievesstate-of-the-artperformance. Furthermore, we use sentence-level probing tasks to shed light on the linguistic properties learned by SBERT-WK. 4.2 Sentence Embedding via Semantic Subspace Analysis on StaticWordRepresentations 4.2.1 Methodology As illustrated in Fig. 4.1, the S3E method contains three steps: 1) constructing semantic groups based on word vectors; 2) using the inter-group descriptor to find the subspace representation; and 3) using correlations between semantic groups to yield the covariance descriptor. Those are detailedbelow. 63 Semantic Group Construction. Given wordF in the vocabulary,+, its uni-gram probability and vector are represented by ?¹Fº andE F 2 R 3 , respectively. We assign weights to words based on?¹Fº: weight¹Fº= n n¸?¹Fº (4.1) where n is a small pre-selected parameter, which is added to avoid the explosion of the weight when?¹Fº istoosmall. Clearly,0 weight¹Fº 1. Wordsareclusteredinto groupsusingthe K-means++algorithm[16],andweightsareincorporatedintheclusteringprocess. Thisisneeded since some words of higher frequencies (e.g. ’a’, ’and’, ’the’) are less discriminative by nature. Theyshouldbeassignedwithlowerweightsinthesemanticgroupconstructionprocess. Intra-group Descriptor. After constructing semantic groups, we find the centroid of each group by computing the weighted average of word vectors in that group. That is, for the8 C group, 8 ,welearnitsrepresentation6 8 by 6 8 = 1 j 8 j Õ F2 8 weight¹FºE F (4.2) wherej 8 j is the number of words in group 8 . For sentence ( =fF 1 F 2 F < g, we allocate wordsin(totheirsemanticgroups. Toobtaintheintra-groupdescriptor,wecomputethecumulative residual between word vectors and their centroid (6 8 ) in the same group. Then, the representation ofsentence( inthe8 C semanticgroupcanbewrittenas E 8 = Õ F2(\ 8 weight¹Fº¹E F 6 8 º (4.3) Ifthereare semanticgroupsintotal,wecanrepresentsentence( withthefollowingmatrix: ¹(º= 2 6 6 6 6 6 6 6 6 6 6 6 4 E ) 1 E ) 2 E ) 3 7 7 7 7 7 7 7 7 7 7 7 5 = © « E 11 E 13 E 21 E 23 E 1 E 3 ª ® ® ® ® ® ® ® ® ¬ 3 (4.4) 64 where3 thedimensionofwordembedding. Inter-group Descriptor. After obtaining the intra-group descriptor, we measure interactions between semantic groups with covariance coefficients. We can interpret ¹(º in (4.4), as 3 observations of -dimensional random variables, and useD 2 R 1 to denote the mean of each rowin. Then,theinter-groupcovariancematrixcanbecomputedas =» 89 ¼ = 1 3 ¹` º¹` º ) 2R (4.5) where 89 =f 89 = ¹E 8 ` 8 º ) ¹E 9 ` 9 º 3 (4.6) isthecovariancebetweengroups8 and 9. Thus,matrix canbewrittenas = © « f 2 1 f 12 f 1 f 12 f 2 2 f 2 f 1 f 2 f 2 ª ® ® ® ® ® ® ® ® ¬ (4.7) Since covariance matrix is symmetric, we can vectorize its upper triangular part and use it as the representation for sentence (. The Frobenius norm of the original matrix is kept the same withtheEuclideannormofvectorizedmatrices. Thisprocessproduceanembeddingofdimension R ¹ ¸1º2 . Then,theembeddingofsentence( becomes E¹(º=E42C¹º= 8 > > < > > : p 2f 89 if 8 9 f 88 if 8= 9 (4.8) Finally,thesentenceembeddinginEq. (4.8)is! 2 normalized. ComplexityThesemanticgroupconstructionprocesscanbepre-computedforefficiency. Our runtimecomplexityis(3#¸ 3 2 ),where# isthelengthofasentence, isthenumberofsemantic groups, and3 is the dimension of word embedding in use. Our algorithm is linear with respect to 65 the sentence length. The S3E method is much faster than all parameterized models and most of non-parameterizedmethodssuchas[179]wherethesingularvaluedecompositionisneededduring inference. TheruntimecomparisonisalsodiscussedinSec. 4.2.2.3. 4.2.2 ExperimentalResultsandAnalysis Weevaluateourmethodontwosentenceembeddingevaluationtaskstoverifythegeneralizability of S3E. Semantic textual similarity tasks are used to test the clustering and retrieval property of our sentence embedding. Discriminative power of sentence embedding is evaluated by supervised tasks. For performance benchmarking, we compare S3E with a series of other methods including parameterizedandnon-parameterizedones. 1. Non-parameterizedModels a) Avg. GloVeembedding; b) SIF [11]: Derived from an improved random-walk model. Consist of two parts: weightedaveragingofwordvectorsandfirstprincipalcomponentremoval; c) ?-means [135]: Concatenating different word embedding models and different power ratios; d) DCT[9]: Introducediscretecosinetransformintosentencesequentialmodeling; e) VLAWE[77]: IntroduceVLAD(vectoroflocallyaggregateddescriptor)intosentence embeddingfield; 2. ParameterizedModels a) Skip-thought [81]: Extend word2vec unsupervised training objectives from word level intosentencelevel; b) InferSent[47]: Bi-directionalLSTMencodertrainedonhighqualitysentenceinference data. 66 c) Sent2Vec [115]: Learn n-gram word representation and use average as the sentence representation. d) FastSent [74]: An improved Skip-thought model for fast training on large corpus. Simplifytherecurrentneuralnetworkasbag-of-wordsrepresentation. e) ELMo[121]: Deepcontextualizedwordembedding. Sentenceembeddingiscomputed byaveragingallLSTMoutputs. f) Avg. BERT embedding [52]: Average the last layer word representation of BERT model. g) SBERT-WK [167]: A fusion method to combine representations across layers of deep contextualizedwordmodels. 4.2.2.1 TextualSimilarityTasks Table 4.1: Experimental results on textual similarity tasks in terms of the Pearson correlation coefficients (%), where the best results for parameterized and non-parameterized are in bold respectively. Model Dim STS12 STS13 STS14 STS15 STS16 STSB SICK-R Avg. Parameterized models skip-thought[81] 4800 30.8 24.8 31.4 31.0 - - 86.0 40.80 InferSent[47] 4096 58.6 51.5 67.8 68.3 70.4 74.7 88.3 68.51 ELMo[121] 3072 55.0 51.0 63.0 69.0 64.0 65.0 84.0 64.43 Avg. BERT[52] 768 46.9 52.8 57.2 63.5 64.5 65.2 80.5 61.51 SBERT-WK[167] 768 70.2 68.1 75.5 76.9 74.5 80.0 87.4 76.09 Non-parameterized models Avg. GloVe 300 52.3 50.5 55.2 56.7 54.9 65.8 80.0 59.34 SIF[11] 300 56.2 56.6 68.5 71.7 - 72.0 86.0 68.50 ?-mean[135] 3600 54.0 52.0 63.0 66.0 67.0 72.0 86.0 65.71 S3E (GloVe) 355-1575 59.5 62.4 68.5 72.3 70.9 75.5 82.7 69.59 S3E(FastText) 355-1575 62.5 67.8 70.2 76.1 74.3 77.5 84.7 72.64 S3E (L.F.P.) 955-2175 61.0 69.3 73.2 76.1 74.4 78.6 84.7 73.90 We evaluate the performance of the S3E method on the SemEval semantic textual similarity tasks from 2012 to 2016, the STS Benchmark and SICK-Relatedness dataset. The goal is to predict the similarity between sentence pairs. The sentence pairs contains labels between 0 to 5, 67 whichindicatetheirsemanticrelatedness. ThePearsoncorrelationcoefficientsbetweenprediction and human-labeled similarities are reported as the performance measure. For STS 2012 to 2016 datasets, the similarity prediction is computed using the cosine similarity. For STS Benchmark dataset and SICK-R dataset, they are under supervised setting and aims to predict the probability distributionofrelatednessscores. Weadoptthesamesettingwith[153]forthesetwodatasetsand alsoreportthePearsoncorrelationcoefficient. The S3E method can be applied to any static word embedding method. Here, we report three of them; namely, GloVe [119], FastText [30] and L.F.P. 1. Word embedding is normalized using [166]. Parameter n in Eq. (4.1) is set to 10 3 for all experiments. The word frequency, ?¹Fº, is estimated from the wiki dataset2. The number of semantic groups, , is chosen from the set f1020304050g andthebestperformanceisreported. Experimental results on textual similarity tasks are shown in Table 4.1, where both non- parameterizedandparameterizedmodelsarecompared. RecentparameterizedmethodSBERT-WK providesthebestperformanceandoutperformsothermethodbyalargemargin. S3Emethodusing L.F.P word embedding is the second best method in average comparing with both parameterized and non-parameterized methods. As mentioned, our work is compatible with any weight-based methods. With better weighting schemes, the S3E method has a potential to perform even better. Aschoiceofwordembedding,L.F.PperformsbetterthanFastTextandFastTextisbetterthanGloVe vectorinTable4.1,whichisconsistentwiththepreviousfindings[168]. Therefore,choosingmore powerfulwordembeddingmodelscanbehelpfulinperformanceboost. 4.2.2.2 SupervisedTasks TheSentEvaltoolkit3 [46]isusedtoevaluateoneightsupervisedtasks: 1. MR:Sentimentclassificationonmoviereviews. 2. CR:Sentimentclassificationonproductreviews. 1concatenatedLexVec,FastTextandPSL 2https://dumps.wikimedia.org/ 3https://github.com/facebookresearch/SentEval 68 Table 4.2: Experimental results on supervised tasks, where sentence embeddings are fixed during the training process and the best results for parameterized and non-parameterized models are markedinboldrespectively. Model Dim MR CR SUBJ MPQA SST TREC MRPC SICK-E Avg. Parameterized models skip-thought[81] 4800 76.6 81.0 93.3 87.1 81.8 91.0 73.2 84.3 83.54 FastSent[74] 300 70.8 78.4 88.7 80.6 - 76.8 72.2 - 77.92 InferSent[47] 4096 79.3 85.5 92.3 90.0 83.2 87.6 75.5 85.1 84.81 Sent2Vec[115] 700 75.8 80.3 91.1 85.9 - 86.4 72.5 - 82.00 USE[38] 512 80.2 86.0 93.7 87.0 86.1 93.8 72.3 83.3 85.30 ELMo[121] 3072 80.9 84.0 94.6 91.0 86.7 93.6 72.9 82.4 85.76 SBERT-WK [167] 768 83.0 89.1 95.2 90.6 89.2 93.2 77.4 85.5 87.90 Non-parameterized models GloVe(Ave) 300 77.6 78.5 91.5 87.9 79.8 83.6 72.1 79.0 81.25 SIF[11] 300 77.3 78.6 90.5 87.0 82.2 78.0 - 84.6 82.60 p-mean[135] 3600 78.3 80.8 92.6 89.1 84.0 88.4 73.2 83.5 83.74 DCT[9] 300-1800 78.5 80.1 92.8 88.4 83.7 89.8 75.0 80.6 83.61 VLAWE[77] 3000 77.7 79.2 91.7 88.1 80.8 87.0 72.8 81.2 82.31 S3E (GloVe) 355-1575 78.3 80.4 92.5 89.4 82.0 88.2 74.9 82.0 83.46 S3E (FastText) 355-1575 78.8 81.4 92.9 88.5 83.5 87.0 75.7 81.4 83.65 S3E(L.F.P.) 955-2175 79.4 81.4 92.9 89.4 83.5 89.0 75.6 82.6 84.23 3. SUBJ:Subjectivity/objectiveclassification. 4. MPQA:Opinionpolarityclassification. 5. SST2: Stanfordsentimenttreebankforsentimentclassification. 6. TREC:Questiontypeclassification. 7. MRPC:Paraphraseidentification. 8. SICK-Entailment: EntailmentclassificationonSICKdataset. The details for each dataset is also shown in Table 4.3. For all tasks, we trained a simple MLP classifier that contain one hidden layer of 50 neurons. It is same as it was done in [9] and only tunedthe! 2 regularizationtermonvalidationsets. Thehyper-parametersettingofS3Eiskeptthe sameasthatintextualsimilaritytasks. Thebatchsizeissetto64andAdamoptimizerisemployed. For MR, CR, SUBJ, MPQA and MRPC datasets, we use the nested 10-fold cross validation. For 69 Table4.3: Examplesindownstreamtasks Dataset #Samples Task Class MR 11k moviereview 2 CR 4k productreview 2 SUBJ 10k subjectivity/objectivity 2 MPQA 11k opinionpolarity 2 SST2 70k sentiment 2 TREC 6k question-type 6 MRPC 5.7k paraphrasedetection 2 SICK-E 10k entailment 3 TREC and SICK-E, we use the cross validation. For SST2 the standard validation is utilized. All experimentsaretrainedwith4epochs. ExperimentalresultsonsupervisedtasksareshowninTable4.2. TheS3Emethodoutperforms all non-parameterized models, including DCT [9], VLAWE [77] and ?-means [135]. The S3E method adopts a word embedding dimension smaller than?-means and VLAWE and also flexible in choosing embedding dimensions. As implemented in other weight-based methods, the S3E method does not consider the order of words in a sentence but splits a sentence into different semantic groups. The S3E method performs the best on the paraphrase identification (MRPC) dataset among all non-parameterized and parameterized methods. This is attributed to that, when paraphrasing, the order is not important since words are usually swapped. In this context, the correlation between semantic components play an important role in determining the similarity betweenapairofsentencesandparaphrases. Comparing with parameterized method, S3E also outperforms a series of them including Skip-thought, FastSent and Sent2Vec. In general, parameterized methods performs better than non-parameterized ones on downstream tasks. The best performance is the recently proposed SBERT-WK method which incorporate a pre-trained deep contextualized word model. However, even though good perform is witnessed, deep models are requiring much more computational resourceswhichmakesithardtointegrateintomobileorterminaldevices. Therefore,S3Emethod hasitsown strengthinitsefficiencyandgoodperformance. 70 4.2.2.3 InferenceSpeed Table4.4: Inferencetimecomparison. Dataarecollectedfrom5trails. Model CPU inferencetime (ms) GPUinferencetime(ms) InferSent 53.07 15.23 SBERT-WK 179.27 42.79 GEM 26.54 - SIF 1.56 - ProposedS3E 0.69 - We compare the inference speed of S3E with other models including the non-parameterized andparameterizedones. Forfaircomparison,thebatchsizeissetto1andallsentencesfromSTSB datasets are used for evaluation (17256 sentences). All benchmark results can run on CPU4 and GPU5.TheresultsareshowedinTable4.4. Comparing the other method, S3E is the very efficient in inference speed and this is very im- portantinsentenceembedding. WithouttheaccelerationofpowerfulGPU,whendoingcomparing tasksof10,000sentencepairs,deepcontextualizedmodelstakesabout1hourtoaccomplish,which S3Eonlyrequires13seconds. 4.2.2.4 SensitivitytoClusterNumbers WetestthesensitivityofS3Etothesettingofclusternumbers. Theclusternumberissetfrom5to 60withinternalof5clusters. ResultsforSTS-Benchmark,SICK-EntailmentandSST2datasetare reported. AswecanseefromFigure4.2,performanceofS3Eisquiterobustfordifferentchoiceof clusternumbers. Theperformancevarieslessthan1%inaccuracyorcorrelation. 4.2.3 Discussion Averagingwordembeddingprovidesasimplebaselineforsentenceembedding. Aweightedsumof word embeddings should offer improvement intuitively. Some methods tries to improve averaging 4Inteli7-5930of3.50GHzwith12cores 5NvidiaGeForceGTXTITANX 71 5 15 25 35 45 55 65 Cluster Number 76 78 80 82 84 Performance STSB SICK-E SST2 Figure4.2: Comparingresultswithdifferentsettingsonclusternumbers. STSBresultispresented inPearsonCorrelationCoefficients(%). SICK-EandSST2arepresentedinaccuracy. are to concatenate word embedding in several forms such as ?-means [135] and VLAWE[77]. Concatenatingwordembeddingsusuallyencountersthedimensionexplosionproblem. Thenumber ofconcatenatedcomponentscannotbetoolarge. OurS3Emethodiscompatiblewithexitingmodelsanditsperformancecanbefurtherimproved by replacing each module with a stronger one. First, we use word weights in constructing seman- tically similar groups and can incorporate different weighting scheme in our model, such as SIF [11], GEM [179]. Second, different clustering schemes such as the Gaussian mixture model and dictionary learning can be utilized to construct semantically similar groups [67, 106]. Finally, the intra-groupdescriptorcanbereplacedbymethodslikeVLAWE[77]and?-means[135]. Ininter- groupdescriptor,correlationbetweensemanticgroupscanalsobemodeledinanon-linearwayby applying different kernel functions. Another future direction is to add sequential information into currentS3Emethod. 72 (a) BERT (b)SBERT (c) RoBERTa (d)XLNET 0 1 2 3 4 5 6 7 8 9 10 11 12 layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity 1 Hop 2 Hop 3 Hop 4 Hop 5 Hop (e) BERT 0 1 2 3 4 5 6 7 8 9 10 11 12 layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity 1 Hop 2 Hop 3 Hop 4 Hop 5 Hop (f)SBERT 0 1 2 3 4 5 6 7 8 9 10 11 12 layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity 1 Hop 2 Hop 3 Hop 4 Hop 5 Hop (g) RoBERTa 0 1 2 3 4 5 6 7 8 9 10 11 12 layers 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity 1 Hop 2 Hop 3 Hop 4 Hop 5 Hop (h)XLNET Figure 4.3: Evolving word representation patterns across layers measured by cosine similarity, where(a-d)showthesimilarityacrosslayersand(e-h)showthesimilarityoverdifferenthops. Four contextualizedwordrepresentationmodels(BERT,SBERT,RoBERTaandXLNET)aretested. 73 4.3 Sentence Embedding by Dissecting Contextualized Word Models 4.3.1 WordRepresentationEvolutionacrossLayers Although studies have been done in the understanding of the word representation learned by deep contextualized models, none of them examine how a word representation evolves across layers. To observe such an evolving pattern, we design experiments in this section by considering the followingfourBERT-basedmodels. • BERT[52]. Itemploysthebi-directionaltrainingofthetransformerarchitectureandapplies ittolanguagemodeling. Unsupervisedobjectives,includingthemaskedlanguagemodeland thenextsentenceprediction,are incorporated. • SBERT [130]. It integrates the Siamese network with a pre-trained BERT model. The supervisedtrainingobjectiveisaddedtolearnhighqualitysentenceembedding. • RoBERTa [100]. It adapts the training process of BERT to more general environments such as longer sequences, bigger batches, more data and mask selection schemes, etc. The next sentencepredictionobjectiveisremoved. • XLNET [178]. It adopts the Transformer-XL architecture, which is trained with the Auto- Regressive(AR)objective. The above four BERT-based models have two variants; namely, the 12-layer base model and the 24-layer large model. We choose their base models in the experiments, which are pre-trained on theirrespectivelanguagemodelingtasks. To quantifythe evolutionof word representationsacross layers ofdeep contextualizedmodels, wemeasurethepair-wisecosinesimilaritybetween1-and#-hopneighbors. Bythe1-hopneighbor, werefertotherepresentationintheprecedingorthesucceedinglayerofthecurrentlayer. Generally, 74 wordF has¹#¸1º representationsofdimension3 fora#-layertransformernetwork. Thewhole representationsetforF canbeexpressedas E 0 F E 1 F E # F (4.9) whereE 8 F 2R 3 denotestherepresentationofwordFatthe8-thlayerThepair-wisecosinesimilarity betweenrepresentationsofthe8-thandthe 9-thlayerscanbecomputedas CosSim¹89º= hE 8 F E 9 F i jE 8 F jjE 9 F j (4.10) To obtain statistical results, we extract word representations from all sentences in the popular STSbenchmark dataset [37]. The dataset contains 8628 sentence pairs from three categories: captions, news and forum. The similarity map is non-contextualized. We average the similarity mapforallwordstopresentthepatternforcontextualizedwordembeddingmodels. Table 4.5: Word groups based on the variance level. Less significant words in a sentence are underlined. Variance Low Middle High SBERT can,end,do,would,time,all,say, percent, security, mr, into, military, eating, walking,small,room,person, says,how,before,more,east, she, arms,they, nuclear, head,billion, children, grass, baby, cat,bike, field, be,have,so,could,that,than, on, another, around, their, million, runs, potato,horse,snow,ball,dogs,dancing been,south,united,what,peace, killed, mandela,arrested, wearing,three, men, dog,running,women,boy,jumping, to,states,against,since,first,last his,her,city, through, cutting,green,oil plane,train, man,camera, woman,guitar BERT have,his,their,last,runs, would jumping, on, against,into, man,baby military,nuclear,killed,dancing,percent been,running,all,than,she,that around, walking,person,green,her, peace, plane,united,mr,bike, guitar, to,cat,boy,be,first,woman,how end,through,another, three,so, oil,train,children,arms,east,camera cutting,since,dogs,dog,say, wearing,mandela,south,do, potato, grass, ball,field,room,horse,before,billion could,more,man,small,eating they, what, women,says, can, arrested city,security,million,snow,states, time Figs. 4.3 (a)-(d) show the similarity matrix across layers for four different models. Figs. 4.3 (e)-(h)showthepatternsalongtheoffsetdiagonal. Ingeneral,weseethattherepresentationsfrom nearbylayerssharealargesimilarityvalueexceptforthatinthelastlayer. Furthermore,weobserve that,exceptforthemaindiagonal,offsetdiagonalsdonothaveanuniformpatternasindicatedbythe bluearrowintheassociatedfigure. ForBERT,SBERTandRoBERTa,thepatternsatintermediate layers are flatter as shown in Figs. 4.3 (e)-(g). The representations between consecutive layers 75 have a cosine similarity value that larger than 0.9. The rapid change mainly comes from the beginningandthelastseverallayersofthenetwork. Thisexplainswhythemiddlelayersaremore transferable to other tasks as observed in [98]. Since the representation in middle layers are more stable,moregeneralizablelinguisticpropertiesarelearnedthere. AscomparedwithBERT,SBERT and RoBERTa, XLNET has a very different evolving pattern of word representations. Its cosine similarity curve as shown in Fig. 4.3 (h) is not concave. This can be explained by the fact that XLNET deviates from BERT significantly from architecture selection to training objectives. It alsoshedslightonwhySBERT[130],whichhasXLNETasthebackboneforsentenceembedding generation,hassentenceembeddingresultsworsethanBERT,giventhatXLNETismorepowerful inotherNLPtasks. WeseefromFigs. 4.3(e)-(g)thatthewordrepresentationevolvingpatternsinthelowerandthe middlelayersofBERT,SBERTandRoBERTaarequitesimilar. Theirdifferencesmainlylieinthe lastseverallayers. SBERThasthelargestdropwhileRoBERTahastheminimumchangeincosine similaritymeasuresinthelastseverallayers. SBERThasthehighestemphasisonthesentence-pair objectivesinceitusestheSiamesenetworkforsentencepairprediction. BERTputssomefocuson the sentence-level objective via next-sentence prediction. In contrast, RoBERTa removes the next sentencepredictioncompletelyintraining. Wearguethatfasterchangesinthelastseverallayersarerelatedtothetrainingwiththesentence- level objective, where the distinct sentence level information is reflected. Generally speaking, if moreinformationisintroducedbyaword,weshouldpayspecialattentiontoitsrepresentation. To quantifysuchaproperty,weproposetwometrics(namely,alignmentandnovelty)inSec. 4.3.2.1. We have so far studied the evolving pattern of word representations across layers. We may ask whether such a pattern is word dependent. This question can be answered below. As shown in Fig. 4.3, the offset diagonal patterns are pretty similar with each other in the mean. Without loss ofgenerality,weconductexperimentsontheoffset-1diagonalthatcontains12valuesasindicated by the arrow in Fig. 4.3. We compute the variances of these 12 values to find the variability of the 1-hop cosine similarity values with respect to different words. The variance is computed for 76 each word in BERT and SBERT6. We only report words that appears more than 50 times to avoid randomnessinTable4.5. ThesamesetofwordswerereportedforBERTandSBERTmodels. The words are split into three categorizes based on their variance values. The insignificant words in a sentence are underlined. We can clearly see from the table that words in the low variance group are in general less informative. In contrast, words in the high variance group are mostly nouns and verbs, which usually carry richer content. We conclude that more informative words in deep contextualized models vary more while insignificant words vary less. This finding motivates us to designamodulethatcandistinguishimportantwordsinasentenceinSec. 4.3.2.2. 4.3.2 ProposedSBERT-WKMethod We propose a new sentence embedding method called SBERT-WK in this section. The block diagramoftheSBERT-WKmethodisshowninFig. 4.4. Itconsistsofthefollowingtwosteps: 1. Determine a unified word representation for each word in a sentence by integrating its representationsacrosslayersbyexaminingitsalignmentandnoveltyproperties. 2. Conduct a weighted average of unified word representations based on the word importance measuretoyieldtheultimatesentenceembeddingvector. Theyareelaboratedinthefollowingtwosubsections,respectively. 4.3.2.1 UnifiedWordRepresentationDetermination AsdiscussedinSec. 4.3.1,thewordrepresentationevolvesacrosslayers. WeuseE 8 F todenotethe representationofwordF atthe8thlayer. Todeterminetheunifiedwordrepresentation, ˆ E F ,ofword F inStep1,weassignweightU 8 toits8-thlayerrepresentation,E 8 F ,andtakeanaverage: ˆ E F = # Õ 8=0 U¹E 8 F ºE 8 F (4.11) 6Since RoBERTa and XLNET use a special tokenizer, which cannot be linked to real word pieces, we do not test onRoBERTaandXLNEThere. 77 whereweightU canbederivedbasedontheinversealignmentandthenoveltytwoproperties. InverseAlignmentMeasure WedefinethecontextmatrixofE 8 F as C=»E 8< F E 81 F E 8¸1 F E 8¸< F ¼2R 32< (4.12) where 3 is the word embedding dimension and< is the context window size. We can compute the pair-wise cosine similarity betweenE 8 F and all elements in the context window¹E 8 F º and use their average to measure howE 8 F aligns with the word vectors inside its context context. Then, the alignmentsimilarityscoreofE 8 F canbedefinedas V 0 ¹E 8 F º= 1 2< 8¸< Õ 9=8<9<8 hE 8 F F 9 i jE 8 F jjE 9 F j (4.13) If a word representation at a layer aligns well with its context word vectors, it does not provide much additional information. Since it is less informative, we can give it a smaller weight. Thus, we use the inverse of the alignment similarity score as the weight for word F at the 8-th layer. Mathematically,wehave U 0 ¹E 8 F º= 0 V 0 ¹E 8 F º (4.14) where 0 is a normalization constant independent of8 and it is chosen to normalize the sum of weights: # Õ 8=1 U 0 ¹E 8 F º= 1 WecallU 0 ¹E 8 F º theinversealignmentweight. NoveltyMeasure Another way to measure the new information of word representation E 8 F is to study the new informationbroughtbyitwithrespecttothesubspacespannedwordsinitscontextwindow. Clearly, words in the context matrixC form a subspace. We can decomposeE 8 F into two components: one containedbythesubspaceandtheotherorthogonaltothesubspace. Weviewtheorthogonaloneas 78 Pre-trained Transformer Pre-trained Transformer Alignment ( " ) & Novelty ( # ) Tech. trend is unpredictable in 2020 Pre-trained Transformer Word Importance ( % ) Sentence Representation ' ( ) * + , ' ' ' - ' (~(-0') ( ' ) ' * ' + ' , ' ( (~(-0') ) (~(-0') * (~(-0') + (~(-0') , (~(-0') ( - ) - * - + - , - Figure4.4: IllustrationfortheproposedSBERT-WKmodel. its novel component and use its magnitude as the novelty score. By singular value decomposition (SVD),wecanfactorizematrixMofdimension<=intotheformM=UV,whereUisan<= matrix with orthogonal columns, is an== diagonal matrix with non-negative numbers on the diagonalandVis==orthogonalmatrix. Inourcurrentcontext,wedecomposethecontextmatrix C in Eq. (4.12) to C = UV to find the orthogonal basis for the context words. The orthogonal columnbasisforCisrepresentedbymatrixU. Thus,theorthogonalcomponentofE 8 F withrespect toCcanbecomputedas @ 8 F =E 8 F UU T E 8 F (4.15) ThenoveltyscoreofE 8 F iscomputedby U = ¹E 8 F º= = jj@ 8 F jj 2 jjE 8 F jj 2 (4.16) 79 where = is a normalization constant independent of8 and it is chosen to normalize the sum of weights: # Õ 8=1 U = ¹E 8 F º= 1 WecallU = ¹E 8 F º thenoveltyweight. Unified WordRepresentation We examine two ways to measure the new information brought by word representationE 8 F at the8-thlayer. Wemayconsideraweightedaverageofthetwoinformof U 2 ¹E 8 F lº=lU 0 ¹E 8 F º¸¹1lºU = ¹E 8 F º (4.17) where 0 l 1 andU 2 ¹E 8 F lº is called the combined weight. We compare the performance of three cases (namely, novelty weightl= 0, inverse alignment weightl= 1 and combined weight l= 05) in the experiments. A unified word representation is computed as a weighted sum of its representationsindifferentlayers: ˆ E F = # Õ 8=0 U 2 ¹E 8 F ºE 8 F (4.18) WecanviewE F asthenewcontextualizedwordrepresentationforwordF. 4.3.2.2 WordImportance As discussed in Sec. 4.3.1, the variances of the pair-wise cosine-similarity matrix can be used to categorizewordsintodifferentgroups. Wordsofricherinformationusuallyhavealargervariance. By following the line of thought, we can use the same variance to determine the importance of a wordandmergemultiplewordsinasentencetodeterminethesentenceembeddingvector. Thisis summarizedbelow. For the 9-th word in a sentence denoted byF¹9º, we first compute its cosine similarity matrix using its word representations from all layers as shown in Eq. (4.10). Next, we extract the offset-1 80 diagonal of the cosine similarity matrix, compute the variance of the offset-1 diagonal values and use f 2 9 to denote the variance of the 9th word. Then, the final sentence embedding (E B ) can be expressedas E B = Õ 9 l 9 ˆ E F¹9º (4.19) where ˆ E F¹9º is the the new contextualized word representation for word F¹9º as defined in Eq. (4.18)and l 9 = jf 2 9 j Í : jf 2 : j (4.20) Notethattheweightforeachwordisthe; 1 -normalizedvarianceasshowninEq. 4.20. Tosumup, in our sentence embedding scheme, words that evolve faster across layers with get higher weights sincetheyhavelargervariances. 4.3.2.3 ComputationalComplexity ThemaincomputationalburdenofSBERT-WKcomesfromtheSVDdecomposition,whichallows more fine-grained analysis in novelty measure. The context window matrixC is decomposed into the product of three matricesC= UV. The orthogonal basis is given by matrixU. The context windowmatrixisofsize32<,where3 isthewordembeddingsizeand2< isthewholewindow size. Inourcase,3ismuchlargerthan<sothatthecomputationalcomplexityforSVDis$¹83< 2 º, whereseveraltermsareignored. Instead of performing SVD decomposition, we use the QR factorization in our experiments as an alternative because of its computational efficiency. With QR factorization, we first concatenate thecenterwordvectorrepresentationE 8 F tothecontextwindowmatrixCtoformanewmatrix ˜ C=»E 8< F E 81 F E 8¸1 F E 8¸< F E 8 F ¼2R 3¹2<¸1º (4.21) has 2<¸ 1 word representations. We perform the QR factorization on ˜ C, and obtain ˜ C = QR, wherenon-zerocolumnsofmatrixQ2R 3¹2<¸1º areorthonormalbasisandR2R ¹2<¸1º¹2<¸1º is 81 anuppertriangularmatrixthatcontainstheweightsforwordrepresentationsunderthebasisofQ. Wedenotethe8thcolumnofQandRasq 8 andr 8 ,respectively. WithQRfactorization,r 2<¸1 isthe representationofE 8 F undertheorthogonalbasisformedbymatrixQ. Thenewdirectionintroduced to the context byE 8 F is represented as@ 2<¸1 . Then, the last component ofA 2<¸1 is the weight for thenewdirection,whichisdenotedbyA 2<¸1 1 . Then,thenoveltyweightcanbederivedas: U = ¹E 8 F º= = A 2<¸1 1 jA 2<¸1 j (4.22) where = is the normalization constant. The inverse alignment weight can also computed under thenewbasisQ. The complexity of the QR factorization is$¹3¹2<¸1º 2 º, which is two times faster than the SVDdecomposition. Inpractice,weseelittleperformancedifferencebetweenthesetwomethods. TheexperimentalruntimeiscomparedinSec. 4.3.3.5 4.3.3 Experiments Sinceourgoalistoobtainageneralpurposesentenceembeddingmethod,weevaluateSBERT-WK onthreekindsofevaluationtasks. • Semantictextualsimilaritytasks. They predict the similarity between two given sentences. They can be used to indicate the embedding ability of a method in terms of clustering and information retrieval via semantic search. • Superviseddownstreamtasks. Theymeasureembedding’stransfercapabilitytodownstreamtasksincludingentailmentand sentimentclassification. 82 • Probingtasks. They are proposed in recent years to measure the linguistic features of an embedding model andprovidefine-grainedanalysis. Thesethreekindsofevaluationtaskscanprovideancomprehensivetestonourproposedmodel. The popular SentEval toolkit [46] is used in all experiments. The proposed SBERT-WK method can be built upon several state-of-the-art pre-trained language models such as BERT, SBERT, RoBERTaandXLNET.Here,weevaluateitontopoftwomodels: SBERTandXLNET.Thelatter istheXLNETpre-trainedmodelobtainedfromtheTransformerRepo. Weadopttheirbasemodels thatcontain12self-attentionlayers. Forperformancebenchmarking,wecompareSBERT-WKwiththefollowing10differentmeth- ods,includingparameterizedandnon-parameterizedmodels. 1. AverageofGloVewordembeddings; 2. AverageofFastTextwordembedding; 3. AveragethelastlayertokenrepresentationsofBERT; 4. Use [CLS] embedding from BERT, where [CLS] is used for next sentence prediction in BERT; 5. SIFmodel[11],whichisanon-parameterizedmodelthatprovidesastrongbaselineintextual similaritytasks; 6. ?-meanmodel[135]thatincorporatemultiplewordembeddingmodels; 7. Skip-Thought[81]; 8. InferSent[11]withbothGloVeandFastTextversions; 9. UniversalSentenceEncoder[38],whichisastrongparameterizedsentenceembeddingusing multipleobjectivesandtransformerarchitecture; 83 10. Sentence-BERT, which is a state-of-the-art sentence embedding model by training the SiamesenetworkoverBERT. 4.3.3.1 SemanticTexturalSimilarity To evaluate semantic textual similarity, we use 2012-2016 STS datasets [3–7]. They contain sentence pairs and labels between 0 and 5, which indicate their semantic relatedness. Some methodslearnacomplexregressionmodelthatmapssentencepairstotheirsimilarityscore. Here, we use the cosine similarity between sentence pairs as the similarity score and report the Pearson correlationcoefficient. MoredetailsofthesedatasetscanbefoundinTable4.6. Semantic relatedness is a special kind of similarity task, and we use the SICK-R [105] and the STSBenchmarkdataset[37]inourexperiments. BeingdifferentfromSTS12-STS16,thesemantic relatedness datasets are under the supervised setting where we learn to predict the probability distribution of relatedness scores. The STS Benchmark dataset is a popular dataset to evaluate supervised STS systems. It contains 8,628 sentences from three categories (captions, news and forums)andtheyaredividedintotrain(5,749),dev(1,500)andtest(1,379). ThePearsoncorrelation coefficientisreported. In our experiments, we do not include the representation from the first three layers since their representationsarelesscontextualizedasreportedin[58]. Somesuperficialinformationiscaptured by those representations and they play a subsidiary role in most tasks [78]. We set the context windowsizeto<= 2 inallevaluation tasks. Table4.6: ExamplesinSTS12-STS16,STS-BandSICKdatasets. Dataset Task Sent A Sent B Label STS12-STS16 STS "Don’t cookfor him. He’s a grown up." "Don’tworry abouthim. He’s a grown up." 4.2 STS-B STS "Swimmersareracingina lake." "Women swimmersaredivinginfront of thestarting platform." 1.6 SICK-R STS "Twopeople insnowsuitsarelyinginthe snowand making snowangel." "Twoangels aremaking snowonthelyingchildren" 2.5 TheresultsaregiveninTable4.7. WeseethattheuseofBERToutputsdirectlygeneratesrather poor performance. For example, the CLS token representation gives an average correlation score of 38.93% only. Averaging BERT outputs provides an average correlation score of 61.51%. This is used as the default setting of generating sentence embedding from BERT in the bert-as-service 84 Table4.7: ExperimentalresultsonvarioustextualsimilaritytasksintermsofthePearsoncorrelation coefficients(%),wherethebestresultsareshowninboldface. Model Dim STS12 STS13 STS14 STS15 STS16 STSB SICK-R Avg. Non-Parameterizedmodels Avg. GloVeembeddings 300 52.3 50.5 55.2 56.7 54.9 65.8 80.0 59.34 Ave. FastText embedding 300 58.0 58.0 65.0 68.0 64.0 70.0 82.0 66.43 SIF 300 56.2 56.6 68.5 71.7 - 72.0 86.0 68.50 ?-mean 3600 54.0 52.0 63.0 66.0 67.0 72.0 86.0 65.71 Parameterizedmodels Skip-Thought 4800 41.0 29.8 40.0 46.0 52.0 75.0 86.0 52.83 InferSent-GloVe 4096 59.3 58.8 69.6 71.3 71.5 75.7 88.4 70.66 InferSent-FastText 4096 62.7 54.8 68.4 73.6 71.8 78.5 88.8 71.23 UniversalSentenceEncoder 512 61.0 64.0 71.0 74.0 74.0 78.0 86.0 72.57 BERT[CLS] 768 27.5 22.5 25.6 32.1 42.7 52.1 70.0 38.93 Avg. BERT embedding 768 46.9 52.8 57.2 63.5 64.5 65.2 80.5 61.51 Sentence-BERT 768 64.6 67.5 73.2 74.3 70.1 74.1 84.2 72.57 Proposed SBERT-WK 768 70.2 68.1 75.5 76.9 74.5 80.0 87.4 76.09 toolkit 7. They are both worse than non-parameterized models such as averaging FastText word embedding,whichisastaticwordembeddingscheme. Theirpoorperformancecouldbeattributed tothatthemodelisnottrainedusingasimilarobjectivefunction. Themaskedlanguagemodeland next sentence prediction objectives are not suitable for a linear integration of representations. The study in [60] explains how linearity is exploited in static word embeddings (e.g., word2vec) and it sheds light on contextualized word representations as well. Among the above two methods, we recommend averaging BERT outputs because it captures more inherent structure of the sentence while the CLS token representation is more suitable for some downstream classification tasks as showninTable4.9. WeseefromTable4.7thatInferSent,USEandSBERTprovidethestate-of-the-artperformance on textual similarity tasks. Especially, InferSent and SBERT have a mechanism to incorporate the joint representation of two sentences such as the point-wise difference or the cosine similarity. Then, the training process learns the relationship between sentence representations in a linear manner and compute the correlation using the cosine similarity, which is a perfect fit. Since the original BERT model is not trained in this manner, the use of the BERT representation directly 7https://github.com/hanxiao/bert-as-service 85 would give rather poor performance. Similarly, the XLNET model does not provide satisfactory resultsinSTStasks. Ascomparedwithothermethods,SBERT-WKimprovestheperformanceontextualsimilarity tasks by a significant margin. It is worthwhile to emphasize that we use only 768-dimension vectorsforsentenceembeddingwhileInferSentuses4096-dimensionvectors. Asexplainedin[47, 57, 135], the increase in the embedding dimension leads to increased performance for almost all models. This may explain SBERT-WK is slightly inferior to InferSent on the SICK-R dataset. For allothertasks,SBERT-WKachievessubstantialbetterperformanceevenwithasmallerembedding size. 4.3.3.2 SupervisedDownstreamTasks For supervised tasks, we compare SBERT-WK with other sentence embedding methods in the followingeightdownstreamtasks. • MR: Binarysentimentpredictiononmoviereviews[118]. • CR: Binarysentimentpredictiononcustomerproductreviews[76]. • SUBJ:Binarysubjectivitypredictiononmoviereviewsandplotsummaries[116]. • MPQA:Phrase-levelopinionpolarityclassification[171]. • SST2: StanfordSentimentTreebankwithbinarylabels[148]. • TREC:Questiontypeclassificationwith6classes[96]. • MRPC:MicrosoftResearchParaphraseCorpusforparaphraseprediction[54]. • SICK-E:Naturallanguageinferencedataset[105]. MoredetailsonthesedatasetsaregiveninTable4.8. 86 The design of our sentence embedding model targets at the transfer capability to downstream tasks. Typically, one can tailor a pre-trained language model to downstream tasks through tasks- specific fine-tuning. It was shown in previous work [11], [179] that subspace analysis methods are more powerful in semantic similarity tasks. However, we would like to show that sentence embeddingcanprovideanefficientwayfordownstreamtasksaswell. Inparticular,wedemonstrate that SBERT-WK does not hurt the performance of pre-trained language models. Actually, it can even perform better than the original model in downstream tasks under the SBERT and XLNET backbonesettings. Table4.8: Datasetsusedinsuperviseddownstreamtasks. Dataset # Samples Task Class Example Label MR 11k movie review 2 "A fascinatingandfunfilm." pos CR 4k productreview 2 "No way tocontacttheir customer service" neg SUBJ 10k subjectivity/objectivity 2 "She’s anartist, buthasn’t pickedupabrush in ayear." objective MPQA 11k opinion polarity 2 "strong sense of justice" pos SST2 70k sentiment 2 "At first,the sight of ablindman directingafilm ishilarious,but as the film goes on,thejokewearsthin." neg TREC 6k question-type 6 "What is the average speed of thehorsesat theKentuckyDerby?" NUM:speed MRPC 5.7k paraphrasedetection 2 "Theauthor isoneof severaldefenseexperts expected totestify." Paraphrase "Spitzisexpected totestify laterforthedefense. SICK-E 10k entailment 3 "There is nowoman using an eye pencilandapplyingeye liner toher eyelid." Contradiction "A womanis applyingcosmetics toher eyelid." ForSBERT-WK,weusethesamesettingastheoneinsemanticsimilaritytasks. Fordownstream tasks,weadoptamulti-layer-perception(MLP)modelthatcontainsonehiddenlayerof50neurons. The batch size is set to 64 and the Adam optimizer is adopted in the training. All experiments are trained with 4 epochs. For MR, CR, SUBJ, MPQA and MRPC, we use the nested 10-fold cross validation. For SST, we use the standard validation. For TREC and SICK-E, we use the cross validation. TheexperimentalresultsoneightsuperviseddownstreamtasksaregiveninTable4.9. Although it is desired to fine-tune deep models for downstream tasks, we see that SBERT-WK still achieves goodperformancewithoutanyfine-turning. Ascomparedwiththeother12benchmarkingmethods, SBERT-WK has the best performance in 5 out of the 8 tasks. For the remaining 3 tasks, it still ranks among the top three. SBERT-WK with SBERT as the backbone achieves the best averaged performance (87.90%). The average accuracy improvements over XLNET and SBERT alone are 4.66% and 1.91%, respectively. For TREC, SBERT-WK is inferior to the two best models, USE and BERT[CLS], by 0.6%. For comparison, the baseline SBERT is much worse than USE, and 87 Table 4.9: Experimental results on eight supervised downstream tasks, where the best results are showninboldface. Model Dim MR CR SUBJ MPQA SST2 TREC MRPC SICK-E Avg. Non-Parameterizedmodels Avg. GloVeembeddings 300 77.9 79.0 91.4 87.8 81.4 83.4 73.2 79.2 81.66 Ave. FastText embedding 300 78.3 80.5 92.4 87.9 84.1 84.6 74.6 79.5 82.74 SIF 300 77.3 78.6 90.5 87.0 82.2 78.0 - 84.6 82.60 ?-mean 3600 78.3 80.8 92.6 73.2 84.1 88.4 73.2 83.5 81.76 Parameterizedmodels Skip-Thought 4800 76.6 81.0 93.3 87.1 81.8 91.0 73.2 84.3 83.54 InferSent-GloVe 4096 81.8 86.6 92.5 90.0 84.2 89.4 75.0 86.7 85.78 InferSent-FastText 4096 79.0 84.1 92.9 89.0 84.1 92.4 76.4 86.7 85.58 UniversalSentence Encoder 512 80.2 86.0 93.7 87.0 86.1 93.8 72.3 83.3 85.30 BERT[CLS] vector 768 82.3 86.9 95.4 88.3 86.9 93.8 72.1 73.8 84.94 Avg. BERTembedding 768 81.7 86.8 95.3 87.8 86.7 91.6 72.5 78.2 85.08 Avg. XLNET embedding 768 81.6 85.7 93.7 87.1 85.1 88.6 66.5 56.7 80.63 Sentence-BERT 768 82.4 88.9 93.9 90.1 88.4 86.4 75.5 82.3 85.99 SBERT-WK(XLNET) 768 83.6 87.4 94.9 89.1 87.1 91.0 74.2 75.0 85.29 SBERT-WK(SBERT) 768 83.0 89.1 95.2 90.6 89.2 93.2 77.4 85.5 87.90 SBERT-WK outperforms SBERT by 6.8%. USE is particularly suitable TREC since it is pre- trainedonquestionansweringdata,whichishighlyrelatedtothequestiontypeclassificationtask. In contrast, SBERT-WK is not trained or fine-tuned on similar tasks. For SICK-E, SBERT-WK is inferior to two InferSent-based methods by 1.2%, which could be attributed to the much larger dimensionofInferSent. We observe that averaging BERT outputs and CLS vectors give pretty similar performance. Although CLS provides poor performance for semantic similarity tasks, CLS is good at classifi- cation tasks. This is because that the classification representation is used in its model training. Furthermore, the use of MLP as the inference tool would allow certain dimensions to have higher importance in the decision process. The cosine similarity adopted in semantic similarity tasks treats all dimension equally. As a result, averaging BERT outputs and CLS token representation are not suitable for semantic similarity tasks. If we plan to apply the CLS representation and/or averaging BERT outputs to semantic textual similarity, clustering and retrieval tasks, we need to learnanadditionaltransformationfunctionwithexternalresources. 88 4.3.3.3 ProbingTasks It is difficult to infer what kind of information is present in sentence representation based on downstream tasks. Probing tasks focus more on language properties and, therefore, help us understand sentence embedding models. We compare SBERT-WK on 10 probing tasks so as to cover a wide range of aspects from superficial properties to deep semantic meanings. They are divide into three types [48]: 1) surface information, 2) syntactic information and 3) semantic information. • SurfaceInformation – SentLen: Predictthelengthrangeoftheinputsentencewith6classes. – WC:Predictwhichwordisinthesentencegiven1000candidates. • SyntacticInformation – TreeDepth: Predictdepthoftheparsingtree. – TopConst: Predicttop-constituentsofparsingtreewithin20classes. – BShift: Predictwhethera bigramhasbeenshiftedornot. • SemanticInformation – Tense: Classifythemainclausetensewithpastorpresent. – SubjNum: Classifythesubjectnumberwithsingularorplural. – ObjNum: Classifytheobjectnumberwithsingularorplural. – SOMO:Predictwhetherthenoun/verbhasbeenreplacedbyanotheronewiththesame part-of-speechcharacter. – CoordInv: Sentencesaremadeoftwocoordinateclauses. Predictwhetheritisinverted ornot. 89 We use the same experimental setting as that used for supervised tasks. The MLP model has one hidden layer of 50 neurons. The batch size is set to 64 while Adam is used as the optimizer. All tasks are trained in 4 epochs. The standard validation is employed. Being Different from the work in [120] that uses logistic regression for the WC task in the category of surface information, weusethesameMLPmodeltoprovidesimpleyetfaircomparison. Table4.10: Experimentalresultson10probingtasks,wherethebestresultsareshowninboldface. Surface Syntactic Semantic Model Dim SentLen WC TreeDepth TopConst BShift Tense SubjNum ObjNum SOMO CoordInv Non-Parameterizedmodels Avg. GloVeembeddings 300 71.77 80.61 36.55 66.09 49.90 85.33 79.26 77.66 53.15 54.15 Ave. FastTextembedding 300 64.13 82.10 36.38 66.33 49.67 87.18 80.79 80.26 49.97 52.25 ?-mean 3600 86.42 98.85 38.20 61.66 50.09 88.18 81.73 83.27 53.27 50.45 Parameterized models Skip-Thought 4800 86.03 79.64 41.22 82.77 70.19 90.05 86.06 83.55 54.74 71.89 InferSent-GloVe 4096 84.25 89.74 45.13 78.14 62.74 88.02 86.13 82.31 60.23 70.34 InferSent-FastText 4096 83.36 89.50 40.78 80.93 61.81 88.52 86.16 83.76 53.75 69.47 UniversalSentence Encoder 512 79.84 54.19 30.49 68.73 60.52 86.15 77.78 74.60 58.48 58.19 BERT [CLS]vector 768 68.05 50.15 34.65 75.93 86.41 88.81 83.36 78.56 64.87 74.32 Avg. BERTembedding 768 84.08 61.11 40.08 73.73 88.80 88.74 85.82 82.53 66.76 72.59 Avg. XLNET embedding 768 67.93 42.60 35.84 73.98 72.54 87.29 85.15 78.52 59.15 67.61 Sentence-BERT 768 75.55 58.91 35.56 61.49 77.93 87.32 79.76 78.40 62.85 65.34 SBERT-WK (XLNET) 768 79.91 60.39 43.34 80.70 79.02 88.68 88.16 84.01 61.15 71.71 SBERT-WK (SBERT) 768 92.40 77.50 45.40 79.20 87.87 88.88 86.45 84.53 66.01 71.87 The performance is shown in Table 4.10. We see that SBERT-WK yields better results than SBERTinalltasks. Furthermore,SBERT-WKoffersthebestperformanceinfourofthetentasks. As discussed in [48], there is a trade-off in shallow and deep linguistic properties in a sentence. Thatis,lowerlayerrepresentationscarrymoresurfaceinformationwhiledeeplayerrepresentations representmoresemanticmeanings[78]. Bymerginginformationfromvariouslayers,SBERT-WK cantakecareofthesedifferentaspects. Thecorrelationbetweenprobingtasksanddownstreamtaskswerestudiedin[48]. Theyfound that most downstream tasks only correlates with a subset of the probing tasks. WC is positively correlated with all downstream tasks. This indicates that the word content (WC) in a sentence is the most important factor among all linguistic properties. However, in our finding, although ?-means provides the best WC performance, it is not the best one in downstream tasks. Based on the above discussion, we conclude that “good performance in WC alone does not guarantee satisfactory sentence embedding and we should pay attention to the high level semantic meaning 90 aswell". Otherwise,averagingone-hotwordembeddingwouldgiveperfectperformance,whichis howevernottrue. The TREC dataset is shown to be highly correlated with a wide range of probing tasks in [48]. SBERT-WK is better than SBERT in all probing tasks and we expect it to yield excellent performance for the TREC dataset. This is verified in Table 4.9. We see that SBERT-WK works wellfortheTRECdatasetwithsubstantialimprovementoverthebaselineSBERTmodel. SBERT is trained using the Siamese Network on top of the BERT model. It is interesting to pointoutthatSBERTunderperformsBERTinprobingtasksconsistently. Thiscouldbeattributed to that SBERT pays more attention to the sentence-level information in its training objective. It focuses more on sentence pair similarities. In contrast, the mask language objective in BERT focusesmoreonword-orphrase-levelandthenextsentencepredictionobjectivecapturestheinter- sentenceinformation. Probingtasksaretestedontheword-levelinformationortheinnerstructure ofasentence. TheyarenotwellcapturedbytheSBERTsentenceembedding. Yet,SBERT-WKcan enhance SBERT significantly through detailed analysis of each word representation. As a result, SBERT-WKcanobtainsimilarorevenbetterresultsthanBERTinprobingtasks. 4.3.3.4 AblationandSensitivityStudy To verify the effectiveness of each module in the proposed SBERT-WK model, we conduct the ablationstudybyaddingonemoduleatatime. Also,theeffectoftwohyperparameters(thecontext windowsizeandthestartinglayerselection)isevaluated. Theaveragedresultsfortextualsemantic similaritydatasets,includingSTS12-STS16andSTSB,arepresented. Ablationstudyofeachmodule’scontribution We present the ablation study results in Table 4.11. It shows that all three components (Align- ment, Novelty, Token Importance) improve the performance of the plain SBERT model. Adding the Alignment weight and the Novelty weight alone provides performance improvement of 1.86% and 2.49%, respectively. The Token Importance module can be applied to the word representation 91 ofthelastlayerorthewordrepresentationobtainedbyaveragingalllayeroutputs. Thecorrespond- ing improvements are 0.55% and 2.2%, respectively. Clearly, all three modules contribute to the performanceofSBERT-WK.Theultimateperformancegaincanreach3.56%. Table4.11: Comparisonofdifferentconfigurationstodemonstratetheeffectivenessofeachmodule oftheproposedSBERT-WKmethod. TheaveragedPearsoncorrelationcoefficients(%)forSTS12- STS16andSTSBdatasetsarereported. Model Avg. STS results SBERTbaseline 70.65 SBERT+Alignment (F= 0) 72.51 SBERT+Novelty (F= 1) 73.14 SBERT+TokenImportance(last layer) 71.20 SBERT+TokenImportance(alllayers) 72.85 SBERT-WK (F= 05) 74.21 Sensitivitytowindowsizeandlayerselection We test the sensitivity of SBERT-WK to two hyper-parameters on STS, SICK-E and SST2 datasets. The results are shown in Fig. 4.5. The window size < is chosen to be 1, 2, 3 and 4. There are at most 13 representations for a 12-layer transformer network. By setting window size to< = 4, we can cover a wide range of representations already. The performance versus the < value is given in Fig. 4.5 (a). As mentioned before, since the first several layers carry little contextualized information, it may not be necessary to include representations in the first several layers. We choose the starting layer; ( to be from 0-6 in the sensitivity study. The performance versus the; ( value is given in Fig. 4.5 (b). We see from both figures that SBERT-WK is robust to different values of< and; ( . By considering the performance and computational efficiency, we set window size < = 2 as the default value. For starting layer selection, the perform goes up a little bit when the representations of first three layers are excluded. This is especially true for the SST2 dataset. Therefore, we set; ( = 4 as the default value. These two default settings are used throughoutallreportedexperimentsinothersubsections. 92 1 2 3 4 Window Size 70 80 90 Performance Avg. STS SICK-E SST2 (a) 0 1 2 3 4 5 6 Starting Layer 70 80 90 Performance Avg. STS SICK-E SST2 (b) Figure 4.5: Performance comparison with respect to (a) window size< and (b) starting layer; ( , where the performance for the STS datset is the Pearson Correlation Coefficients (%) while the performancefortheSICK-EandtheSST2datasetsistestaccuracy. 4.3.3.5 InferenceSpeed We evaluate the inference speed against the STSB datasets. For fair comparison, the batch size is set to 1. All benchmarking methods are run on CPU and GPU8. Both results are reported. On the other hand, we report CPU results of SBERT-WK only. All results are given in Table 4.12. With CPU, the total inference time of SBERT-WK (QR) is 8.59 ms (overhead) plus 168.67ms (SBERT baseline). As compared with the baseline BERT model, the overhead is about 5%. SVD computationisslightlyslowerthanQRfactorization. Table 4.12: Inference time comparison of InferSent, BERT, XLNET, SBERT and SBERT-WK. Dataarecollectedfrom5trails. Model CPU(ms) GPU(ms) InferSent 53.07 15.23 BERT 86.89 15.27 XLNET 112.49 20.98 SBERT 168.67 32.19 OverheadofSBERT-WK(SVD) 10.60 - OverheadofSBERT-WK(QR) 8.59 - 8Inteli7-5930Kof3.50GHzandNvidiaGeForceGTXTITANXarechosentobetheCPUandtheGPU,respectively. 93 4.4 ConclusionandFuture Work A sentence embedding method based on semantic subspace analysis was proposed. The proposed S3Emethodhasthreebuildingmodules: semanticgroupconstruction,intra-groupdescriptionand inter-groupdescription. TheS3Emethodcanbeintegratedwithmanyotherexistingmodels. Itwas shown by experimental results that the proposed S3E method offers state-of-the-art performance amongnon-parameterizedmodels. S3Eisoutstandingforitseffectivenesswithlowcomputational complexity. With the development of deep contextualized models, we also proposed a way to generate high quality sentence embedding from deep contextualized models. In our work, we provided in-depthstudyoftheevolvingpatternofwordrepresentationsacrosslayersindeepcontextualized models. Furthermore, we proposed a novel sentence embedding model, called SBERT-WK, by dissectingdeepcontextualizedmodels,leveragingthediverseinformationlearnedindifferentlayers for effective sentence representations. SBERT-WK is efficient, and it demands no further training. Evaluation wasconductedonawiderangeoftaskstoshowtheeffectivenessofSBERT-WK. Based on this foundation, we may explore several new research topics in the future. Subspace analysis and geometric analysis are widely used in distributional semantics. Post-processing of the static word embedding spaces leads to furthermore improvements on downstream tasks [112, 166]. Deep contextualized models have achieved supreme performance in recent natural language processing tasks. It could be beneficial by incorporating subspace analysis in the deep contextualized models to regulate the training or fine-tuning process. This representation might yieldevenbetterresults. Anothertopicistounderstanddeepcontextualizedneuralmodelsthrough subspace analysis. Although deep contextualized models achieve significant improvements, we still do not understand why these models are so effective. Existing work that attempts to explain BERT and the transformer architecture focuses on experimental evaluation. Theoretical analysis ofthesubspaceslearnedbydeepcontextualizedmodelscouldbethekeyinrevealingthemyth. 94 Chapter5 CommonsenseKnowledge Graph Embedding and Completion 5.1 IntroductionandRelated Work 5.1.1 Introduction Knowledgegraphs(KGs)arerepresentedastripletswhereentities(nodes)areconnectedbyrelation- ships(edges). Itisastructuredknowledgebasewithvariousapplicationssuchasrecommendation systems, question answering, and natural language understanding [49, 99, 169]. In practice, most KGs are far from complete. Therefore, predicting missing facts is one of the most fundamental problems in this field. A lot of embedding-based methods have been proposed and shown to be effective on the KG completion task [20, 21, 32, 50, 152, 154, 160]. However, relatively little work targets at commonsense knowledge graph (CKG) completion. There are unique challenges encountered by applying existing KG embedding methods to CKGs (e.g. ConceptNet [150] and ATOMIC[137]). First, many real-world CKGs are dynamic in nature, and entities with unseen text/names are introduced from time to time. We call these entities unseen entities because they are not involved intrainingandonlyappearintesting. Second,entityattributesofCKGsarecomposedoffree-formtextswhicharenotpresentinnon- attributed KG datasets. As shown in Figure 5.1, entity description has rich semantic meaning and commonsense knowledge can be largely inferred from their implicit semantic relations. However, 95 work out be healthy stretch get fit increase muscle mass physical exercise HasPrerequisite HasFirstSubevent MotivatedByGoal HasSubevent ? ? Unseen Entity Seen Entity Figure5.1: IllustrationofafragmentofConceptNetcommonsenseknowledgegraph: Linkpredic- tionforseenentities(transductive)andunseenentities(inductive). we notice that often entities refer to the same concept are stored as distinct ones, resulting in the graph to be larger and sparser. As shown in Table 1, the average in-degree of ConceptNet and ATOMIC is only 115 and 18 comparing with that of FB15K-237, a popular KG dataset. Since CKGs are highly sparse and can be disconnected, a portion of entities are isolated from the main graph structure. These entities are also unseen entities and how to obtain embeddings for these isolatedentitiesremainschallenging. Therefore, the inductive learning problem on commonsense knowledge graph completion is particularly important with practical necessities. An example for CKG completion is shown in Figure 5.1. Transductive setting targets on predicting missing links for seen entities. The predictions can be made from two perspectives: 1) entity attributes and 2) existing links for seen entities. In contrast, inductive setting works on purely unseen entities where only entity attributes canbeleveragedinthefirstplace. Differentfrompreviousinductivesettingonnon-attributedKGs [8,155],inductivelearningonCKGassumesunseenentitiesarepurelyisolatedanddoesnothave anyexistinglinks. Therefore,thisproblemisuniqueandremainsunexplored. 96 ManyexistingKGembeddingmodelsarefocusingonnon-attributedgraphswithentityembed- dings obtained during training [32, 152, 154] or using attribute embedding solely for initialization [103]. Therefore, all entities are required to present in training. Otherwise, the system will not havetheirembeddingsandhencetransductiveasoriginallyproposed. In this work, we first propose and define the inductive learning problem on CKG comple- tion. Then, an inductive learning framework called InductivE is introduced to address the above- mentioned challenges and several components are specially designed to enhance its inductive learningcapabilities. First,theinductivecapabilityofInductivEisguaranteedbydirectlybuilding representations from entity descriptions, not merely using entity textual representation as training initialization. Second, a novel graph encoder with densification is proposed to leverage structural information for unseen entities and entities with limited neighboring connections. Overall, In- ductivE follows an encoder-decoder framework to learn from both semantic representations and updatedgraphstructures. Themaincontributionscanbesummarizedasfollows. 1 1. A formal definition of inductive learning on CKGs is presented. We propose the first benchmarkforinductiveCKGcompletiontask,includingnewdatasplitsandtestingschema, tofacilitatefutureresearch. 2. InductivEisthefirstmodelthatisdedicatedtoinductivelearningoncommonsenseknowledge graphs. It leverages entity attributes based on transfer learning from word embedding and graphstructuresbasedonannovelgraphneuralnetworkwithdensification. 3. Comprehensive experiments are conducted on ConceptNet and ATOMIC datasets. The improvements are demonstrated in both transductive and inductive settings. InductivE per- forms especially well on inductive learning scenarios with an over 48% improvement on MRR comparingwithprevious methods. 1Ourcode isnowpubliclyavailable: here. 97 5.1.2 RelatedWork 5.1.2.1 KnowledgeGraphEmbedding Knowledgegraphcompletionbypredictingmissinglinkshasbeenintensivelyinvestigatedinrecent years. Most methods are embedding-based. TransE [32] models the relationship between entities with nice translation property (4 ¸4 A 4 C ) in the embedding space. ComplEx [160] and RotatE [152] represent embeddings in a complex space to model more complicated relation interactions. Instead of using a simple score function, ConvE [50] applied convolution to embedding so as to allow more interactions among triplet features. To exploit the structural information, GNNs are appliedtomulti-relationalgraphsasdoneinR-GCN[138]andSCAN[143]. Allabove-mentioned methodslearnembeddingbasedonafixedsetofentitiesandaretransductiveasoriginallyproposed. 5.1.2.2 InductiveLearningonGraphs Inductive learning is investigated in the last several years for both graphs and knowledge graphs. GraphSage[70]reliesonnodefeaturesandlearnsalocalaggregationfunctionthatisgeneralizable to newly observed subgraphs. Most KGs embedding models are focusing non-attributed graph so thatinductivelearningismostlyrelyingonlocalconnectionsforunseenentities[8]. [69]generates embeddingforunseennodesbyaggregatingtheinformationfromsurroundingknownnodes. [155] models relation prediction as a subgraph reasoning problem. However, because unseen entities in CKGshavenoexistinglink,structural-basedmethodscannotbeapplied[8,155]. Asanalternative, thereexistsworkthatincorporatesentitydescriptionintheembeddingprocessandcanbeinductive bynature. Forexample,[173]learnsajointembeddingspaceforconventionalentityembeddingand description-basedembedding. Ourworkexploitsbothstructureinformationandtextualdescription forinductivepurposeonCKGs. 98 5.1.2.3 LanguagemodelonCKGs Recently, researchers attempt to link commonsense knowledge with pre-trained language models [51, 122, 180]. COMET [33] is a generative model that transfers knowledge from pre-trained language models and generates new facts in CKGs. It could achieve performance close to human beings. However, COMET always introduces novel/unseen entities to an existing graph, leading to an even sparser graph. With unseen entities constantly introduced by generative models, an inductivelearningmethod,suchasInductivE,isimportantinpractice. 5.2 ProblemDefinition Definition1 Commonsense Knowledge Graph (CKG) is represented by =¹+'º, where+ isthesetofnodes/entities, isthesetofedgesand'isthesetofrelations. Edgesconsistoftriplets ¹ACº where head entity and tail entityC are connected by relationA: =f¹ACºj2+C2 +A2 'g, andeachnodecomeswithafree-textdescription. Definition2 CKG Completion Given a commonsense knowledge graph = ¹+'º, CKG completion is defined as the task of predicting missing triplets 0 =f¹ACºj¹ACº 8 g. It includes both transductive and inductive settings. Transductive CKG completion is defined as predictingmissingtriplets 00 =f¹ACºj¹ACº82+C2+A2 'g. Definition3 InductiveCKGcompletionisdefinedaspredictingmissingtriplets 000 =f¹ACºj¹ACº8 2+ 0 >AC2+ 0 A2 'g,where+ 0 \+ =; and+ 0 <;. 5.3 DatasetPreparation Three CKG datasets are used to evaluate the link prediction (i.e., CKG completion) task. Their statistics are shown in Table 5.1. CN-82K is our newly proposed split for better evaluation on link prediction task. Besides the standard split, we create an inductive split for CN-82K and ATOMIC 99 Table5.1: StatisticsofCKGdatasets. UnseenEntity%indicatesthepercentagesofunseenentities inalltestentities. Dataset Entities Relations Train Edges ValidEdges TestEdges Avg. In-Degree Unseen Entity% CN-100K 78,334 34 100,000 1,200 1,200 1.31 6.7% CN-82K 78,334 34 81,920 10,240 10,240 1.31 52.3% ATOMIC 304,388 9 610,536 87,700 87,701 2.58 37.6% ¢ Forcomparison,apopularKGdataset: FB-15K-237has14,541entities,310,115edgesand18.76Avg. In-Degree. ¢ In transductive settings, all entities are assumed visible during training including unseen entities. In contrast, un- seen entities canonlybeused duringtesting stageforinductivesettings. [0,3) [3,15) [15,35) [35,inf) triplet-degree range 0.0 0.2 0.4 0.6 percentage of triplets train val test (a) CN-100K [0,3) [3,15) [15,35) [35,inf) triplet-degree range 0.0 0.1 0.2 0.3 0.4 percentage of triplets train val test (b)CN-82K Figure 5.2: The triplet-degree distribution for the ConceptNet dataset, where the triplet-degree is the average of head and tail degrees. Triplets with high degrees are easier to predict in general. CN-100KsplitisclearlyunbalancedcomparingwithCN-82K. called CN-82K-Ind and ATOMIC-Ind to specifically evaluate model’s generalizability for unseen entities. 5.3.1 StandardSplit: CN-100K,CN-82K,ATOMIC CN-100K was first introduced by [95]. It contains Open Mind Common Sense (OMCS) entries in the ConceptNet 5 dataset [150]. “100K” indicates the number of samples in the training data. In theConceptNet5dataset,eachtripletisassociatedwithaconfidencescorethatindicatesthedegree of trust. In the original split of CN-100K, the most confident 1,200 triplets are selected for testing and the next 1,200 most confident triplets are used for validation. Entities have a text description withanaverageof2.9words. 100 CN-100K was originally proposed to separate true and false triplets. It is not ideal for link prediction for two reasons. First, its data split ratio is biased. For 100,000 training samples, CN-100K contains only 2,400 triplets (2.4%) for validation (1.2%) and testing (1.2%). With such limitedtestingandvalidationsets,weseealargevarianceinevaluation. Second,thetestingandthe validationsetsareselectedasthemostconfidentsampleswhicharetheleastchallengingones. As showninFigure5.2(a),weseeanunbalanceddistributionintriplet-degreeamongtrain,validation and test sets. In order to test the performance of the link prediction task, we create and release a newdatasplitcalledCN-82K. CN-82K is a uniformly sampled version of the CN-100K dataset. 80% of the triplets are used forthetraining,10%forvalidation,andtheremaining10%fortesting. AsshowninFigure5.2(b), CN-82Kismorebalancedamongtrain,validationandtestsetsw.r.tthetriplet-degreedistribution. ATOMIC contains everyday commonsense knowledge entities organized as if-then relations [137]. It contains over 300K entities in total and entities are composed of text descriptions with an average of 4.4 words. Here, we use the same data split as done in [103]. Different from the ATOMIC split in [33, 137], the data split is a random split with ratio 80%/10%/10% for train/valid/test. Therefore, there is still a reasonable percentage of entities are unseen during training. 5.3.2 InductiveSplit: CN-82K-Ind,ATOMIC-Ind Toevaluate amodel’sgeneralizability to unseenentities,we createnewvalidationand testsetsfor CN-82K and ATOMIC, while keeping the training set unchanged. The inductive validation and testing sets are subsets of the original validation and test sets, which contain only triplets with at leastoneunseenentities. Afterfiltering,forCN-82K-Ind,thevalidationandtestsetscontain5,715 and 5,655 triplets, respectively. For ATOMIC-Ind, the validation and test sets contain 24,355 and 24,486 triplets, respectively. We show by experiments that previous models fail in such scenarios (please refer to Table 5.3) and propose a new benchmark called InductivE for inductive CKG completion. 101 Triplet Graph: G=(V, E) Graph Densifier GR-GCN Layer x N Encoder G=(V, E+simE) Entity Embedding Relation Embedding Activation & Dropout BERT Embedding fastText Embedding } Decoder e h <latexit sha1_base64="pmPkolHKQ5mUhTQ/2Kv0fpDZxck=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0gP1Rv1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwBDaI3I</latexit> e r <latexit sha1_base64="OvYL5YyIEsHsD9453zHhniagQ9E=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0gH3VL1fcqjsHWSVeTiqQo9Evf/UGMUsjlIYJqnXXcxPjZ1QZzgROS71UY0LZmA6xa6mkEWo/m586JWdWGZAwVrakIXP190RGI60nUWA7I2pGetmbif953dSE137GZZIalGyxKEwFMTGZ/U0GXCEzYmIJZYrbWwkbUUWZsemUbAje8surpFWrehfV2v1lpX6Tx1GEEziFc/DgCupwBw1oAoMhPMMrvDnCeXHenY9Fa8HJZ47hD5zPH1KQjdI=</latexit> Shuffle ˆ e h <latexit sha1_base64="Bhf/8zsNnJl50LgJspCw11rHLlA=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ie0oWy2k3bpbhJ2N0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSNsGm4EdhKFVAYC28H4dua3n1BpHkcPZpKgL+kw4iFn1FjpsTeiJsP+aNovV9yqOwdZJV5OKpCj0S9/9QYxSyVGhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFraUQlaj+bHzwlZ1YZkDBWtiJD5urviYxKrScysJ2SmpFe9mbif143NeG1n/EoSQ1GbLEoTAUxMZl9TwZcITNiYgllittbCRtRRZmxGZVsCN7yy6ukVat6F9Xa/WWlfpPHUYQTOIVz8OAK6nAHDWgCAwnP8ApvjnJenHfnY9FacPKZY/gD5/MHEl+QlQ==</latexit> ˆ e r <latexit sha1_base64="JYUM4H+qrwOm2QrPgnM40odrmjs=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0mqoMeiF48V7Ie0oWy2k3bpbhJ2J0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVHJo8lrHuBMyAFBE0UaCETqKBqUBCOxjfzvz2E2gj4ugBJwn4ig0jEQrO0EqPvRHDDPp62i9X3Ko7B10lXk4qJEejX/7qDWKeKoiQS2ZM13MT9DOmUXAJ01IvNZAwPmZD6FoaMQXGz+YHT+mZVQY0jLWtCOlc/T2RMWXMRAW2UzEcmWVvJv7ndVMMr/1MREmKEPHFojCVFGM6+54OhAaOcmIJ41rYWykfMc042oxKNgRv+eVV0qpVvYtq7f6yUr/J4yiSE3JKzolHrkid3JEGaRJOFHkmr+TN0c6L8+58LFoLTj5zTP7A+fwBIZGQnw==</latexit> Conv-TransE Conv. Kernels Vectorization Linear Projection KvsAll Loss Figure5.3: IllustrationoftheproposedInductivEmodel. TheencodercontainsmultipleGR-GCN layers applied to the graph to aggregate node’s local information. The decoder uses an enhanced Conv-TransEmodel. 5.4 ProposedInductivE model In this section, we introduce InductivE, the first benchmark of inductive learning on CKGs, which includes a free-text encoder, a graph encoder with densification and a convolutional decoder. The overallarchitectureisillustratedinFigure5.3. 5.4.1 Free-TextEncoder Ourfree-textencoderembedstextattributeswithapre-trainedlanguagemodelandword-embedding. For pre-trained language model, BERT [51] is applied and further fine-tuned on entity textual at- tributes, which allows BERT to better model our domain-specific data. For word embedding, fastText model is used and mean-pooling is applied when the text sequence contains more than one word [31, 168]. Finally, two representations are concatenated together as the final entity rep- resentation. Forinductivepurpose,thefree-textencoderisviewedasafeatureextractorandentity embeddingsarefixedduringtraining. BERT shows superior performance in many NLP tasks. However, for CKG datasets, fastText embedding exhibits comparable performance. We infer that fastText can be close to BERT feature whenhandlingshorttextsequences. MorediscussionisprovidedinSec. 5.5.3. 102 5.4.2 GraphEncoder In addition to semantic representation, it is also desired to investigate the possibility to enhance inductiveabilitywithentityneighboringstructures. However,itischallengingtoleveragestructural informationforunseenentitiesbecause,asisolatedones,unseenentitiesdonotcontainanyexisting links at the beginning. Therefore, we first densify the graph by add similarity links using a novel graph densifier. Then, a gated-relational graph convolutional network is applied to learn from the graphstructuralinformation. 5.4.2.1 Graphdensifier We observe that some entities often share the same ontological concept in CKGs. For example, “workout”and“physicalexercise”inFigure5.1aresimilarconceptsbutpresentedastwodistinct entities. Similarity edges are added to densify the graph to provide more interactions between similarconcepts. Here,weproposeanovelgraphdensifiertogeneratehigh-qualitylinksamongsemantic-similar entities. We first compute the pairwise cosine similarity using the output entity feature of our graph encoder rather than original entity features. Then, for each node8, we identify : 8 nearest neighbors and add directed similarity edges from them to node8 to densify the graph. To balance theresultingnodedegreeacrossthegraph,thenumberofaddedsyntheticedgesisnode-dependent. More synthetic edges will be added to nodes with fewer connections. Specifically, the number of addededges: 8 foreachnode8 isdeterminedby : 8 = 8 > > > > < > > > > : 0 if<346A44¹8º <346A44¹8º otherwise (5.1) where < is a hyperparameter to determine the number of similarity edges added for each node, and346A44¹8º represents the degree of node8. The densified graph is updated periodically during trainingbasedontheabove-mentionedscheme. 103 Inthetestingstage,toinfertheembeddingforunseenentities,wefirstgetentityrepresentations byapplyingourtrainedgraphencoderontheCKGgraph,inwhichtheunseenentitiesareisolated; thensimilarityedgesareaddedaccordingtotheserepresentationstodensifythegraph;finally,the densified graph is encoded again using our trained graph encoder to get final entity embeddings, whichserveastheinputtothedecoder. The logic behind our densification design is as follows: first, we did not use raw feature with thresholdbecausealowerthresholdleadstonoisyedgesandahigherthresholdprovideslittleextra information as the feature already serves as the input; second, more edges are added to unseen entitiestoensureenoughstructuralinformationareincorporatedforinductivepurposes. 5.4.2.2 Gated-RelationalGCN R-GCN[138]iseffectiveinlearningthenoderepresentationovergraphwithrelationalinformation between nodes (e.g., Knowledge graph). Our model is an extension of the R-GCN [138] model. First, in CKGs, the neighboring conditions can vary a lot from node to node. It is desired to adaptively control the amount of information fused to the center node from their neighboring connections. Therefore, a gating function is added to R-GCN for this purpose based on the interaction of the center and neighboring nodes. Second, to increase the efficiency of R-GCN model, instead of using relation-type-specific transformation matrices, one unified transformation matrix, 1 isadoptedforallneighboringrelationtypes. Asaresult,ourgraphencoderiscomposed ofmultiplenewlyproposedgated-relationalgraphconvolutional(GR-GCN)layers. The first convolutional layer takes the output from free-text encoder as the input: ¹0º 8 = G 8 . At each layer, the update message is a weighted sum of a transformation of center nodeD 2 and a transformationofitsneighborsD = informof ¹;¸1º 8 =f¹D ¹;º 2 V ¹;º 8 ¸D ¹;º = ¹1V ¹;º 8 ºº (5.2) 104 whereV 8 denotes a gating function, denotes an element-wise multiplication,f¹º is a nonlinear activationfunction. ThetwotransformationsinEq. (5.2)aredefinedas D ¹;º 2 = , ¹;º > ¹;º 8 (5.3) D ¹;º = = Õ A2' Õ 92# A 8 1 j# 8 j U ¹;º A , ¹;º 1 ¹;º 9 (5.4) where # A 8 denotes neighbors of node8 with relation typeA, ' denotes all relation types, U ¹;º A is the relation weight at layer;. For all neighboring nodes, we use one unified transformation, ¹;º 1 , which differs from the transformation ofthe self-loop message denoted by, ¹;º > , to account for the relationgapbetweenneighborhoodinformationandself-connection. A gating mechanism controls the amount of the neighborhood message flowing into the center node. In this work, a learnable gating function is used, which takes bothD ¹;º 2 andD ¹;º = as the input. Itcanbewrittenas V ¹;º 8 = sigmoid¹5¹»D ¹;º 2 D ¹;º = ¼ºº (5.5) where»¼ istheconcatenationofself-loopandneighboringmessagesand 5 isalineartransforma- tion. Finally,asigmoidfunctionisusedtoensure0 V ¹;º 8 1. 5.4.3 Decoder-Conv-TransE Convolutional decoder is effective in scoring triplets in KGs with high parameter efficiency [50]. Conv-TransE [143] is a simplified version of ConvE [50]. It removes the reshaping process before theconvolutionoperationanduse1Dconvolutionoperationinsteadof2D.Yet,the2Dconvolution increases the expressive power of the ConvE model since it allows more interactions between embeddings as discussed in [50]. InteractE was proposed recently to allow even more interactions [163]. AsshowninFigure5.3,wedeployanimprovedConv-TransEmodelthataddsashufflingoper- ationbeforeconvolutiontoenablemoreinteractionsacrossdimensionsinspiredbyInteractE[163]. 105 Formally, let q B represent the horizontal shuffling process, the score function in our decoder is definedas: B2>A4¹4 4 A 4 C º= 5¹E42¹5¹q B ¹»4 ;4 A ¼ºFºº,ºº4 C (5.6) where4 4 C 2R 3 is the head/tail embedding and4 A 2R 3 is the relation embedding. These entity embeddings come from the output of the graph encoder. The relation embedding is randomly initialized and jointly trained with the end-to-end framework. 5 denotes a non-linear activation function. In a feed-forward pass, 4 and 4 A are first stacked as af23g matrix and shuffled horizontally. It is used as the input to a 1D convolutional layer with filtersF. Each filter is with 1Dconvolutionalkernelwithsizef2=g. Theoutputisfurtherreshapedasavectorandprojected back to the original3 dimensions using a linear transformation. Finally, an inner product with the tailembedding4 C isperformedinthe3-dimensionalspaceasthefinaltripletscore. To train our model parameter, we use KvsAll training schema [50] by considering all entities simultaneously. Instead of scoring each triplet¹ACº, we take each¹Aº pair and score it with allentities(positiveornegative)astail. Somepairscouldhavemorethanonepositivetailentities. Thus,itisamulti-labelproblem,forwhichweadoptthebinarycross-entropyloss: !¹?Cº= 1 # Õ 8 ¹C 8 ;>6¹? 8 º¸¹1C 8 º;>6¹1? 8 ºº (5.7) whereC isthetruelabelvectorwithdimensionR # . Wegettheprobabilityof(AC)beingpositive as? 8 = sigmoid¹B2>A4¹4 4 A 4 8 ºº. 106 Table5.2: Comparisonof CKGcompletionresults onCN-100K,CN-82K andATOMICdatasets. Improvementiscomputedbycomparingwith[103]. Model CN-100K CN-82K ATOMIC MRR Hits@3 Hits@10 MRR Hits@3 Hits@10 MRR Hits@3 Hits@10 DistMult 10.62 10.94 22.54 2.80 2.90 5.60 12.39 15.18 18.30 ComplEx 11.52 12.40 20.31 2.60 2.70 5.00 14.24 14.13 15.96 ConvE 20.88 22.91 34.02 8.01 8.67 13.13 10.07 10.29 13.37 RotatE 24.72 28.20 45.41 5.71 6.00 11.02 11.16 11.54 15.60 COMET 6.07 2.92 21.17 - - - 4.91 2.40 21.60 Malaviyaet al. 52.25 58.46 73.50 16.26 17.95 27.51 13.88 14.44 18.38 InductivE 57.35 64.50 78.00 20.35 22.65 33.86 14.21 14.82 20.57 Improvement 9.8% 10.3% 6.1% 25.2% 26.2% 23.1% 2.38% 2.63% 11.92% Table 5.3: Comparison of CKG completion results on unseen entities for CN-82K-Ind and ATOMIC-Ind. Model CN-82K-Ind ATOMIC-Ind MRR Hits@10 MRR Hits@10 ConvE 0.21 0.40 0.08 0.09 RotatE 0.32 0.50 0.10 0.12 Malaviya etal. 12.29 19.36 0.02 0.07 InductivE 18.15 29.37 2.51 5.45 5.5 ExperimentalResults and Analysis 5.5.1 ExperimentalSetup 5.5.1.1 Evaluationprotocol We use link prediction task with standard evaluation metrics including Mean Reciprocal Rank (MRR) and Hits@10, to evaluate CKG completion models. We report the average results in percentage with five runs. Following [50, 152], each triplet¹ACº is measured in two directions: ¹A?º and¹CA 1 ?º. InverserelationsA 1 areaddedasnewrelationtypesandthefilteredsetting isusedtofilteroutallvalidtripletsbeforeranking. 5.5.1.2 Hyper-parametersettings OurgraphencoderconsistsoftwoGR-GCNlayerswithhiddendimension500. Thehyperparameter < usedingraphdensifierissetto5forConceptNetand3forATOMIC.Thegraphisupdatedevery 107 100 epochs for ConceptNet and 500 epochs for ATOMIC. For convolutional decoder, we use 300 kernels of size {25}. For model training, we set the initial learning rate to 3e-4 and 1e-4 for ConceptNet and ATOMIC, respectively, and halve the learning rate when validation metric stops increasing for three times. The checkpoint with the highest Mean Reciprocal Rank (MRR) on validationsetisusedfortesting.2 TrainingGCNonlargegraphsdemandslargememoryspacetokeeptheentiregraphparameters. Here,forefficiencypurposes,weperformuniformrandomsamplingonnodesateachtrainingepoch withsamplingsize50kinallexperiments. 5.5.1.3 Baselines Wecomparewithseveralrepresentativemodels,includingDistMult[177],ComplEx[160],RotatE [152], ConvE [50], Malaviya [103] and COMET [33]. The first four models are competitive KG embedding models without using entity textual attributes. The last two models are focusing on CKGs and also take entity textual attributes into consideration. For non-inductive models, we allow the presence of unseen entities during training in order to perform evaluation. We use their respectiveofficialimplementationstoobtainthebaselineresultsandtuneseveralhyperparameters includingbatchsize,learningrateandembeddingdimensions. Finally,ourobtainedbaselineresults arecomparedwiththebestresultsreportedinexistingliteratureandahigheroneisreported. 5.5.2 ResultandAnalysis 5.5.2.1 Transductivelinkprediction Table 5.2 summarizes results on CN-100K, CN-82K and ATOMIC. For CN-100K, our model outperforms previous state-of-the-art by 9.8% on MRR and 6.1% on HITs@10. In contrast, the performance for CN-82K is much lower since CN-82K has more unseen entities in the testing as shown in Table 5.1. Over 50% of all entities in testing are unseen. This is very challenging for all existing methods. InductivE can learn high-quality embedding for all entities and outperforms 2Moredetailsarepresentedintheourreleased project page. 108 the previous best model by over 20% across all evaluation metrics on CN-82K. For ConceptNet datasets, without using the textual information, ConvE and RotatE perform better than ComplEx and DistMult. [103] outperforms ConvE by a large margin with BERT features as initialization. ThisindicatesthatsemanticinformationplaysanimportantroleforConceptNetentities. ForATOMIC,ComplExandInductivEprovidethebestperformanceamongallmethods. Text attribute feature is less effective than that for the ConceptNet dataset. We observe that the relation types from ATOMIC (e.g. xAttr, xIntent, oReact) are more complex and require more high-level reasoning. Thus, it is more difficult to infer directly from semantic embeddings. As compared withComplEx,InductivEisgoodatHITs@10score,whichmeansmoreentitiesarerankedhigher. This is attributed to synthetic edges that help entities with limited connections to get reasonable performanceasmoresyntheticconnectionsareadded. 5.5.2.2 Inductivelinkprediction Table5.3summariesresultsonCN-82K-IndandATOMIC-Ind. Fromthistable,weobservethat: • TheconventionalKGembeddingmodels,ConvEandRotatE,performbadlyinourproposed inductive settings. ConvE and RotatE learn entity embeddings via learning to score posi- tive/negative links between entities. After training, unseen entities remain to be randomly initialized, because no links over unseen entities can be observed during training. These randomembeddingsforunseenentitiesexplainthepoorperformanceofConvEandRotatE. • Comparingwith[103],InductivEachievesthebestperformanceonbothinductivedatasets. It hasanimprovementof48%onMRRwhencomparingwith[103]onCN-82K-Inddataset. As showninthetransductiveresult(Table5.2),textfeaturesforATOMICdatasetarelesseffective than that for ConceptNet dataset, thus results in much worse performance on ATOMIC-Ind thanonCN-82K-Ind. In [103] the model training adopts a two-branch training process: in the GCN branch, node features are randomly initialized; in the text encoder branch, entities embeddings are initialized 109 by BERT features then finetuned during training. For both branches, training does not change the embeddings for unseen entities, because no links over unseen entities can be observed during training. Therefore, at test time, embeddings for unseen entities are computed from random node features (in the GCN branch) plus raw BERT features (in the text encoder branch), which are differentfromthoseoftrainedseenentities,thusresultsinpoorperformance. In contrast, InductivE first ensures inductive learning ability by using fixed text embedding as theinputtoGR-GCNmodelandfurtherenhancestheneighboringstructureforunseenentitiesvia theproposedgraphdensifier. BothmodulescontributetothegoodperformanceofInductivE. To conclude, InductivE is the first benchmark in the inductive setting while providing compet- itiveperformanceinthetransductivesetting. Table5.4: AblationstudyonCN-82K-Inddataset. Model CN-82K-Ind MRR Hits@10 InductivE 18.15 29.37 replaceGR-GCNwith MLP -1.70 -2.60 removegraphdensifier -3.03 -4.60 removegatinginGR-GCN -3.79 -5.37 BERTfeature only 16.57 27.68 fastTextfeature only 16.14 26.87 5.5.3 AblationStudyandAnalysis 5.5.3.1 Ablationstudy To better understand the contribution of different modules in InductivE to the performance, we presentablationstudiesinTable5.4. • Feature Analysis: The performance gap between BERT feature and fastText feature is not huge. It is because the text attributes for CKG datasets are short phrases, with an average of 110 2.9wordsforConceptNetand4.4wordsforATOMIC.BERTismorepowerfulforsentence- level sequences and that is why we use a concatenation of features to have multi-view entity representations. • ReplaceGR-GCNwithMLP:TheMRRscoredropsfrom18.15to16.45. Thisindicatesthat learning from the graph structural information can further boost the performance, although ourfree-textencoderprovidesa strongbaselineandmostessentialforinductivelearning. • Remove graph densifier: This means the model only learns from the original triplet graph structure. The MRR score drops from 18.15 to 15.12. This indicates our graph densifier is helpful for inductive learning, since that the added synthetic edges are particularly helpful for unseen entities since they do not have existing neighboring structure to learn from at the beginning. • RemovegatinginGR-GCN:Theperformancedropsfrom18.15to14.36. Thisdemonstrates the importance of the gating function, which adaptively controls the amount of neighboring information flowing into the center node. Neighboring nodes in CKGs are diverse and connected to the center node with different relations. Without the gating function, different informationsourcesareinjecteddirectlytothecenternodewhichcancauseconfusiontothe model. 5.5.3.2 Analysisongraphdensifier Our iterative graph densifier is a special design to provide more structural information for unseen entitieswhichdoesnotcontainanyexistinglinksatthebeginning. Inthemeantime,thereareother alternatives[103]tobuildsyntacticlinkssuchasdirectlybuildingsyntacticlinksfromrawfeatures. Even though constructing synthetic links directly from raw features is easier to implement, it has twodisadvantageswhentargetingonourinductivelearningframework: 1)Alowthresholdleadsto noisy synthetic links and a high threshold provide little extra information as the feature has served 111 as the input. 2) Synthetic links can be unbalanced between entities and some unseen entities will stillhavenoconnectionevenafterdensification. Toverifyourgraphdensifierissuperiorthanitsalternatives,wecompareitwithtwoalternatives: • Raw feature with global thresholding (GS)[103]: The BERT feature is taken and compute pairwise cosine similarity between entities. A global threshold (0.95) is set for the whole graph and similarity links are added between entities if their cosine similarity is above the threshold. • Raw feature with fixed neighboring (FN): To make sure that all entities get a reasonable amountofneighboringinformation,weaddsimilaritylinkstoentitiesuntiltheyhaveatleast 5neighboringentities. ThecandidatesareselectedbyrankingcosinesimilarityoftheBERT feature. Table5.5: Analysisonourgraphdensifier Model CN-82K-Ind MRR Hits@10 Ourgraphdensifier 18.15 29.37 Raw featurewithGS 13.25 22.98 Raw featurewithFN 10.12 17.04 The results are shown in Table 5.5. Our graph densifier achieves the best performance. In comparison, Raw feature with GS is not providing satisfactory result because the raw feature can be noisy and imprecise links can be added. Even though some high-quality similarity links are constructed, it does not provide much extra information. Raw feature with FN is the worst choice becauseaddingfixednumberoflinksdirectlyfromrawfeaturewillallowmorenoisylinksinjected tothegraphstructurewhichleadsmoreconfusioningeneral. Sinceourgraphdensifierincorporates an iterative approach, the features we use are generally more fine-tuned with current datasets and keepimprovingduringtrainingandthusprovidingsyntheticlinkswithahigherquality. 112 Table5.6: Top-3nearestneighborsofunseenentities Unseenentity Ours Malaviyaetal. pay forsubway paysubwayfare payfordrink paytraffic ticket payonbillet payfare payforpackage performexperiment tryexperiment doexperiment doexperiment conductexperiment test hishypothesis runexperiment itemfillwithair inflatethingwithair opencontainer balloonflybecausethey wledlike item balloongoupbecause they microscopic thing wait foryouairplane getprepareto wait waitforwhile runshort offly waitforwindyday transportpassenger waitforblue bird 5.5.3.3 Casestudyongraphdensifier Byexaminingourgraphdensifiercarefully,wenoticesomeinterestingphenomenaassociatedwith synthetic similarity links. In Table 5.6, we list top-3 nearest neighbors of unseen entities. As expected, somesimilarity relations canbe discoveredwith ourgraph densifier. Forexample, “pay for subway” shares almost the same semantic meaning with “pay subway fare”. Therefore, when computingembeddingfor“payforsubway”,theembeddingfor“paysubwayfare”canbeareliable reference. Unexpectedly, other complex relation types can also be discovered using our densifier. Entitiesmarkedas boldinTable5.6indicates thecandidateentityhas morecomplexrelationwith the unseen entity. For example, “perform experiment” is the “Goal” to “test his hypothesis”. The “Reason” for “wait for you airplane” is “run short of fly”. With rich local connectivity, unseen entities can perform multi-hop reasoning over graphs and obtain high quality embeddings. Comparingwithpreviousmethod[103],ourproposedgraphdensifiercanconstructhigherquality synthetic edges. This could be especially helpful for unseen entities whose neighboring structures arebuiltusingourgraphdensifier. 113 5.6 Conclusion In this work, we propose to study the inductive learning problem on CKG completion, where unseen entities are involved in link prediction. To better evaluate CKG completion task in both transductiveandinductivesettings,wereleaseonenewConceptNetdatasetandtwoinductivedata splits for future research and development purposes. Dedicate to this task, a new embedding- based framework InductivE is proposed as the first benchmarking on inductive CKG completion. InductivEleveragesentityattributeswithtransferlearningandconsidersstructuralinformationwith GNNs. ExperimentsonbothtransductiveandinductivesettingsshowthatInductivEoutperformed thestate-of-the-artmethodconsiderably. InductivelearningonCKGcompletionisstillatitsinfancy. Therearemanypromisingdirections for future work. For example, large pre-trained language models (LMs) have shown effective in capturing implicit commonsense knowledge from large corpus. How to effectively merge the knowledge in large pre-trained LMs with structured CKG could be an interesting direction to explore. Another direction is to explore inductive learning on unseen/new relations that are truly usefulinrealCKGexpansiontask. 114 Chapter6 Domain-SpecificWordEmbedding from Pre-trained Language Models 6.1 Introduction Word embedding research has been a hot topic in the natural language processing (NLP) field and hasbecomeakeycomponentinalotofNLPapplications. Alotmodelsareproposedandbecomes quite popular in the past decade including word2vec [109], GloVe [119] and fastText [30]. Most existingmodelsshowgreatperformancewhentrainingonalargeunlabeledcorpuswithbillionsof words. These types of models are called generic models because they are pre-trained with generic corpusandusuallyappliedtodifferent domainswithoutfurthermodification. Despite generic word embeddings are widely adopted in different kinds of NLP applications, it is still sub-optimal for some applications where some semantic information is domain-specific andcannotbecapturedwellwithgeneraldomaincorporasuchasWikipediaandBookCorpusdata [185]. First, the vocabulary set may vary a lot across different domains and a word may refer to different concepts in different domains. For example, calculus refers to a stone that forms in the bodywhilereferstoanabstracttheoryinthemathfield. Therefore,itbecomesnecessarytoobtain domain-specific word embeddings. Previously, a lot of efforts are made to obtain high-quality domain-specificwordembeddings[131,133,174]. However,consideringthatthedomain-specific 115 corpus is usually much smaller than the general corpus, these methods usually fail to provide satisfactoryresultconsideringtheextraeffortpreparingit. Recently, pre-trained language models have attracted lots of attention and record a new set of the state-of-the-art in many tasks. Even though the general pre-trained language model (e.g. BERT [52]) has shown impressive gains in many tasks, the success mainly focuses on general domain corpora. It is verified that given domains with abundant unlabeled text, domain-specific languagemodelusuallyoutperformgeneraldomainlanguagemodelandaseriesofdomain-specific language models are proposed in the past two years such as SciBERT [26], Sentence-BERT [130] andBioBERT[91]. Incontrast,thedevelopmentofdomain-specificwordembeddingsisrelatively slowandnoobviousbreakthroughiswitnessed. One may question the necessity to design domain-specific word embedding given that great accesshasbeenwitnessedforlanguagemodelsindifferentdomaintasks. However,therearestilla lotofdrawbackswhenbringinglanguagemodelsintodifferentapplications. First,alanguagemodel is usually composed of complex neural architecture (e.g. 12 layer transformer architecture) which requires high computation and memory cost. In contrast, word embedding models are extremely efficientincomputationandmemoryaslongasthevocabularysizeiscontrolled. Second,language models are contextualized models where representation is obtained based on its context. Even though this is usually treated as the benefit of language models, it is hard to handle cases that the input does not match the training data format. For example, when feeding extreme short sentences (e.g. 1-2 words), the performance of the language model downgrades a lot. Therefore, word embedding is still necessary for a lot of NLP applications and how to obtain high-quality domain-specificwordembeddingremainsanunsolvedproblem. Given the success of the domain-specific language model, in this paper, we investigate the possibility to obtain domain-specific word embedding (DomainWE) models by transferring the knowledgelearnedfrompre-trainedlanguagemodels. Ourmaincontributionsaresummarizedasfollows: 116 • We propose a simple yet effective method to obtain word embedding from domain-specific language models called DomainWE. Our method leverages the success of domain-specific languagemodelsandprovideshigh-qualitydomain-specificwordembedding. • DomainWE is not limited to certain domains but a general framework applied to multiple domains. In this chapter, we are focusing on three domains including semantic relatedness, scientificarticle,andsentimenttext. • Extensiveexperimentsareconductedinallthreeselecteddomains. Greatimprovementsare witnessed comparing with existing word embedding models. We prove by experiments that domain-specific word embedding is extremely efficient and robust to word drop, comparing withlargelanguagemodels. 6.2 Domain-SpecificWordEmbeddingfromPre-trainedLanguage Models In this section, we introduce our method to obtain domain-specific word embedding. There are two major resources need for extracting word embeddings: 1) a reasonable size corpus and 2) a pre-trainedlanguagemodel. 6.2.1 DomainWE To obtain the word embedding for a given wordF, we define two steps: word-corpus collection andrepresentationaggregation. Word-Corpus Collection: Assume we have a corpus withj<j different sentences: = ¹B 1 B 2 B 3 B < º. For the F 8 , we collect a subset from the whole corpus 8 , which is con- sisted of all sentence that contains wordF 8 : 8 =¹B 1 F 8 B 2 F 8 B 3 F 8 B jF 8 j F 8 º. Here,jF 8 j is the number oftimesF 8 appearsincorpus. 117 Representation Aggregation: After collecting the corpus forF 8 , we feed the sentences into the language model. Assume the language models is represented by 5 which takes word 8 and its corresponding context sentence B 9 F 8 as the input and output the contextualized representation for word8: E 9 F 8 = 5¹F 8 B 9 F 8 º As a consequence, we can obtain different representations of word8 as E 1 F 8 E 2 F 8 E 3 F 8 E jF 8 j F 8 given different input sequencesB 1 F 8 B 2 F 8 B 3 F 8 B jF 8 j F 8 . To unify different word representations, we simplyusetheaverageastherepresentationforword8’sstaticvector: E F 8 = +¹E 1 F 8 E 2 F 8 E 3 F 8 E jF 8 j F 8 º We follow the same process for each word in vocabulary to obtain the whole set of word embeddings. To avoid the vocabulary explosion problem and save the resulting word embedding modelsize,wesimplyusesubwordunitasourvocabularyasdefinedinmanypre-trainedlanguage models[172]. Sinceweareusingapre-definedvocabularysetbasedonsubword. Thereisaspecial casethatwordsinthepre-definedsubwordvocabularydoesnotappearinthecorpuswhichmakes jF 8 j= 0. Inthiscase,wesimplyfeedasinglewordasthesequencetoobtaintheirrepresentation. 6.2.2 Domain-SpecificCorpus Toobtaindomain-specificwordembedding,thecorpususedtoextractembeddingisimportant. In thispart,wewouldliketodiscussthreedifferentscenarioswehaveinvestigated. 6.2.2.1 Task-SpecificCorpus When evaluating on domain dataset , we can extract the domain-specific word embedding using the ’s corpus . Note that we only use the corpus but will not use the labels/supervision from dataset. Theadvantagetouseatask-specificcorpusisobvious. Thereisnogapbetweencorpus used to extract domain-specific word embedding and application of interest. However, in practice, 118 thecorpusforevaluationcanbeoff-the-shelfduringtheembeddingextractionperiod. Thatleadto oursecond corpusselection: task-relatedcorpus. 6.2.2.2 Task-RelatedCorpus When the evaluation dataset is and the dataset is not available during the word embedding extraction period, one intuitive solution is to find one related corpus inthe same domain to extract word embedding. Our intuition is that when the corpus used for word embedding extraction is closewiththetaskofinterest,betterdomain-specificwordembeddingcanstillbeinferred. 6.2.2.3 Task-UnrelatedCorpus Thereisanextremecasethatcorpusfromasimilardomainisunavailable. Wetrytoobtaintheword embedding using an unrelated corpus. Because the corpus used for extracting word embedding is notrelatedtothedomainofinterest,wecouldexpecttheperformanceisnotasgoodastheprevious two choices. However, considering that the language model we used is still domain-specific, there isstillachangetogetwordembeddingthatissuitableforparticulardomains. 6.2.3 Domain-SpecificLanguageModels There are many domain-specific language models proposed in the past two years. In general, they canbedividedintothreemajorcategories: • Train language model from scratch using domain-specific data. This is only possible for domainswithabundantcorpus data. • Fine-tunelanguagemodelwhichispre-trainedonagenerallargecorpuswiththesameself- supervised/unsupervised objective. Comparing with training from scratch, the requirement forthedomain-specificcorpusislow. • Fine-tunepre-trainedlanguagemodelwithasupervisedobjective. Itusuallyrequiresalarge amountoflabeleddatainordertobegeneralizabletorelatedtasks. 119 In this chapter, we select three different domain-specific language models to evaluate the effectiveness of DomainWE model. Without loss of generality, we use the base model in all experiments. BERTweet: BERTweet [113] is training from a scratch model. It is training with RoBERTa [100]objective(MaskedLanguageModel)andEnglishTweetscorpuswhichcontainsaround16B tokens. Here, we use BERTweet for the sentiment analysis domain. Therefore, we select the BERTweet model released by [22] which is trained on 58M tokens and further fine-tuned for the twittersentimentclassificationtask. Sentence-BERT: Sentence-BERT [130] is designed for sentence relatedness tasks. It proposes a Siamesenetworktoencodesentencesandtrainingwithanaturallanguageinference(NLI)dataset. NLI dataset is a three-way classification dataset where the model is asked to predict one of three relations given two sentences. Therefore, the Sentence-BERT is particularly suitable to measure therelatednessbetweensentences. SciBERT: Similar to BERTweet, SciBERT [26] is training from scratch model and use a similar objective with BERT. The corpus is composed of 1.14M scientific publications which are around 3.17Btokens. WeusetheoriginalSciBERTmodelwhichisnotfurtherfine-tunedwithanytasks. 6.3 Experiments In this section, we first introduce the baseline models we are comparing with. Second, we discuss the evaluation datasets used for each domain. Then, the results for each domain are presented and discussed in detail. Further, we conduct a thorough analysis of our DomainWE model about its strengthandweakness. 120 6.3.1 Baselines • For word embedding, we take the pre-trained GloVe [119] 840B model which is the largest GloVe word embedding model released. It is a general purpose word embedding model and widelyadoptedinmanyapplications. • BERT model [52], which is trained on general corpus and showed supreme performance on manytasks. Here,weuseBERTmodelasthecontextualizedwordembeddingextractor. • Domain-specific language model, which is used to extract our DomainWE. They have supremeperformanceondomain-specifictasks. 6.3.2 EvaluationDatasets We perform classification to evaluate the quality of word embeddings. For all models including wordembeddingandlanguagemodels,weusetheaverageastherepresentationforasentence. For the classifier, we adopt a linear classifier for sentence relatedness domain datasets while using a two-layermultiplayerperceptron(MLP)with50hiddendimensionsforallotherdatasets. 6.3.2.1 SentenceRelatednessDomain STS-Benchmark [37]: it is a popular dataset to evaluate supervised sentence textual relatedness systems. Itcontains8,628sentencesfromthreecategories(captions,newsandforums)anddivided intotrain(5749),dev(1500)andtext(1379). SICK-Relatedness[105]: Ittriedtopredicta5-levelclassificationwhichindicatestherelatedness givetwoinputsentences. SICK-Entailment [105]: A three-way classification to indicate whether the given sentence pairs haveneutral,entailment,orcontradictionrelation. 6.3.2.2 SentimentAnalysisDomain MR[118]: Binarysentimentpredictiononmoviereviews. 121 SST2[148]: StanfordSentimentTreebankwithbinarysentimentlabels. SST5[148]: StanfordSentimentTreebankwithfine-grained(5-level)sentimentlabels. CR[76]: Binarysentimentpredictiononcustomerproductreviews. 6.3.2.3 ScientificArticleDomain Paper Field [26]: It is built from the Microsoft Academic graph [147] and aims to classify paper titleintooneofsevenresearchfields. SciCite [44]: It contains sentences that citing other papers and the goal is to predict the intent for citationwhichincludes‘background’,‘method’and‘result’. 6.3.3 ResultsandDiscussion Inthissection,wewilldiscussourmodelperformanceinthreedifferentdomains. Foreachdomain, the domain-specific language model is selected as the encoder and the task-specific corpus is used forextractingdomain-specificwordembeddingsforalldatasets. 6.3.3.1 SentenceRelatednessDomain Table 6.1: Experimental results on Sentence Relatedness Domain. Both Pearson correlation and Spearman’scorrelationarereported. (Pearson/Spearman) Type Model STS-Benchmark SICK-Relatedness SICK-Entailment LanguageModel BERT-base-uncased 65.30 / 64.32 80.55 / 73.50 78.25 LanguageModel Sentence-BERT 73.05 / 73.63 84.24 / 79.27 82.77 WordEmbedding GloVe 64.79 / 62.96 79.92 / 71.83 79.18 WordEmbedding BERTInputEmbedding 65.40 / 64.24 78.46 / 72.18 77.51 WordEmbedding DomainWE(ours) 75.87 / 75.49 84.78 / 77.90 82.73 Table 6.1 shows the result on sentence relatedness tasks. As expected, the language model shows the best performance across all models. Comparing the word embedding models, our domain-specific word embedding can outperform the GloVe model with a clear margin in all three datasets. On SICK-R and SICK-E tasks, our DomainWE is on-pair with the state-of-the-art 122 language model. Surprisingly, DomainWE performs especially well on the STS-B dataset which evenoutperformsthelanguagemodel thatisbeenusedforextraction. 6.3.3.2 SentimentAnalysisDomain Table6.2: ExperimentalresultsonSentimentAnalysis. Accuracyarereported. Type Model MR SST-2 SST-5 CR LanguageModel BERT-base-uncased 81.67 86.66 46.92 87.02 LanguageModel BERTweet-Sentiment 84.27 91.65 54.43 91.02 WordEmbedding GloVe 77.39 81.44 46.43 79.89 WordEmbedding BERTInput Embedding 75.04 80.72 42.31 79.02 WordEmbedding DomainWE (ours) 82.26 85.89 50.36 87.79 ForthesentimentclassificationdomainasshowninTable6.2,theBERTweet-Sentimentmodel performs the best among all word embedding and general language models. It is obvious that our DomainWE can surpass generic word embedding GloVe with a clear margin. Even comparing with generic language models, DomainWE can achieve comparable performance. Considering the complexity of the models, DomainWE is definitely a good choice for scenarios with limited resources. 6.3.3.3 ScientificArticleDomain Table6.3: ExperimentalresultsonScientificTextClassification. Accuracyarereported. Type Model Paper Field SciCite LanguageModel BERT-base-uncased 70.83 82.32 LanguageModel SciBERT 71.14 84.09 WordEmbedding GloVe 70.40 77.32 WordEmbedding BERTInputEmbedding 67.91 74.91 WordEmbedding DomainWE(ours) 70.86 81.41 Table 6.3 shows the result on scientific sentence classification tasks. For Paper Field task, the languagemodeldoesnotoutperformtheGloVewordembeddingalot. DomainWEmodelison-pair with both the language model and GloVe model. For SciCite dataset where the domain-specific 123 language model shows the better result and also leads to a better DomainWE word embedding model. 6.3.4 ModelAnalysis 6.3.4.1 CorpusAnalysis InfluenceofCorpus Table6.4: Modelperformanceof DomainWEwithdifferentcorpus. Accuracyisreported. Corpus MR SST-2 SST-5 CR GloVe 77.39 81.44 46.43 79.89 DomainWE(MR) 82.26 86.88 49.1 76.19 DomainWE(SST-2) 80.41 85.89 50.54 76.82 DomainWE(SST-5) 81.92 87.53 50.36 76.08 DomainWE(CR) 67.71 73.42 37.83 87.79 Besidesthedomain-specificlanguagemodel,thequalityofDomainWEalsoreliesonthecorpus to extract embeddings. For example, only when the corpus contains enough words of our interest, DomainWE can predict the desired meanings of a particular word. For example, when targeting on sentiment classification tasks, we expect the corpus to contain enough sentimental words. It is always desired to use the task-specific corpus to extract domain-specific word embeddings. However, these corpora may not be easily accessible during the word embedding generation stage of our model. Therefore, we test the possibility to use a related corpus to extract domain-specific wordembeddings(DomainWE). Table6.4showstheperformanceforDomainWEmodelwithdifferentcorporaonfoursentiment classification tasks. We can see that the first three corpora are highly transferable with each other. Two things guarantee the transferable domain-specific word embedding: 1) the corpus used for extraction should be related to the task corpus; 2) the corpus used for extraction should contain enough sentence samples in order to have more comprehensive coverage. Therefore, since CR dataset is relatively small dataset with just a few thousand sentence, it does not transfer well to othertaskslikeMRandSST. 124 Maximumnumberofsentencesamplesforeachword Table6.5: Maximumnumberofsentencesamplesforeachword. DomainWE(MR) MaximumSent # 1 2 3 5 10 25 125 625 2000 MR 73.67 76.39 78.07 79.2 80.49 81.68 81.84 82.58 82.26 SST-2 79.02 81.49 84.18 84.35 86.22 86.93 86.88 86.88 86.88 SST-5 40.68 43.48 44.93 46.52 47.65 49.55 49.28 51.22 49.1 10 0 10 1 10 2 10 3 10 4 Maximum Samples per Word 72 74 76 78 80 82 84 86 88 Classification Accuracy MR SST-2 Figure6.1: PerformanceofDomainWEwhenchangingthemaximumnumberofsamplesforeach word. Given a corpus, the number of times a word appears may vary a lot. In this ablation study, we want to verify how many samples are needed for each word in order to have a good prediction. Table 6.5 and Figure 6.1 shows the accuracy when changing the maximum number of sample for each word. We can tell that the performance increase with the number of samples provided for each word. The increase in accuracy gets slow when providing enough samples. In general, more samplesusuallyleadtobetterperformance. 6.3.4.2 RobustnessAnalysis 6.3.4.3 ModelComplexity EventhoughlanguagemodelsshowgreatperformanceinmanyNLPtasks,thedeploymentofthese largemodelsareverychallenginginpractice. Incontract,DomainWEisaquitecompetitivemethod 125 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Word Drop Probability 60 65 70 75 80 85 MR dataset Accuracy DomainWE (ours) BERT-base-uncased BERTweet Figure6.2: PerformanceonMRdatasetwithdifferentworddropprobabilities. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Word Drop Probability 65 70 75 80 85 90 95 SST-2 dataset Accuracy DomainWE (ours) BERT-base-uncased BERTweet Figure6.3: PerformanceonSST-2datasetwithdifferentworddropprobabilities. because it shares the same complexity during inference comparing with generic word embedding modelswhileprovidemuchbetterperformance. Inthissection,wewilldiscussthemodelsizeand inferencespeedforbothlanguagemodelandwordembeddingmodels. Table6.6: Modelsizeandinferencespeechfordifferentmodels BERT-base-uncased GloVe DomainWE Model Size 498MB 5.6GB 97 MB InferenceSpeed 127samples/sec (GPU) 7523 samples/sec(CPU) 7011 samples/sec(CPU) Table6.6showsthecomparison. Aswecansee,intermsofmodelsize,DomainWEissmaller which only contains 97M memory. Comparing with GloVe which is 5.6 GB to store the model because the vocabulary size of GloVe is very large in order to have a good coverage. In contrast, 126 DomainWE uses subword as the vocabulary unit which limits the vocabulary size to around 30K wordsandusesplitthewordtosubwordsinordertofinditscorrespondingwordembeddings. In terms of inference speech, word embedding models are much faster because it does not contain any matrix multiplication steps in language models. Word embedding uses a simple indexing method to retrieve the representation for each word. As we can see, even with GPU- enabled language models, the inference speed is still at least 50 slower comparing with word embeddingmodels. 6.4 Conclusion Pre-trained language models ubiquitous in recent NLP systems and domain-specific language models are proposed to tackle specific domain tasks. However, as the model size becomes larger andlarger,languagemodelsarehardtobedeployedinresource-limitedenvironments. Incontrast, generic word embeddings are faster in inference but with less satisfactory performance. We proposed DomainWE model which is lightweight by using a subword vocabulary and provides competitive performance by extracting word embeddings tailors for each domain. Experimental results also verify the transferability within domains and robustness of our model when input sequencecorrupted. 127 Chapter7 ConclusionandFutureWork 7.1 SummaryoftheResearch In this dissertation, we focus on semantic-enhanced representation problems in language models: word-levelrepresentationlearningandsentence-levelrepresentationlearning. Inwordrepresentationlearning,wefirstspottwoweaknessesofcurrentworkembeddingmod- els: 1) word embedding models usually learn a large mean vector and not isotropically distributed in high dimension vector space; 2) current context window based word embedding model do not learn the sequential / order information. To tackle these two problems, two enhancement methods are proposed called post-processing via variance normalization (PVN) and post-processing via dynamic embedding (PDE). By adding regularization of the embedding space and inject sequen- tial information into current embedding models, better word embedding is obtained. Meantime, the evaluation methods of word embedding is not thoroughly studied before and the discrepancy betweenintrinsicevaluatorandextrinsicevaluatormakesthefollowingworkhardtoevaluateword embedding models. We also conduct experiments and detailed analysis to provide insights about the correlation and relationship between different evaluation methods. Besides generic word em- beddingmodels,weintroduceddomain-specificwordembeddingcalledDomainWEwhichmerits the good performance of pre-trained language models while keep the efficient of word embedding models. 128 In sentence representation learning, the non-parameterized models are efficient and easy to interpret comparing with parameterized ones. Therefore, based on existing word embedding models, we proposed a new framework in finding representation for sentences through semantic subspace analysis. By constructing semantic space on word embedding models, words from a sentence can be split into several semantic clusters and the sentence representation is obtain by modeling the interaction between semantic clusters. Our model has achieved state-of-the-art performancecomparingwithnon-parameterizedmethods.Deepcontextualizedwordrepresentation models has achieved supreme performance on a wide range of tasks. But how to obtain high quality sentence embedding from contextualized word representations is still an underdeveloped problem. By designing experiments to study the evolving pattern of word representations in deep contextualized models, we found the information learned by each layer is diverse and fusion the information from different layers is possible for obtaining high quality sentence embedding. We proposedanewmethodforsentenceembeddingbydissectingdeepcontextualizedmodelsandour methodisoutperformingstate-of-the-artmodelswithaclearmargin. Wealsoinvestigatethepossibilitytoleveragesemanticrepresentationforentitytextattributesin order to tackle the inductive learning problem on commonsense knowledge graphs (CKGs). With thehelpofsemanticrepresentation,theinductivelearningabilityofCKGisguaranteed. 7.2 FutureResearchDirections Neural language models have achieved great success in the past several years. By construct- ing self-supervised training objectives, deep language models are able to extract deep semantic representationinbothwordandsentencelevel. In the future we would like to consider two challenges in current deep language models and bringupthefollowingresearchproblem: • Even though powerful linguistic knowledge is learned by deep language models, but the qualitativepropertieslearnedbyeachrepresentationsarestillunclear. Currentdeeplanguage 129 modelsareunstructured,whichmakesitextremelyhardtoanalyze. Wewouldtoinvestigate the deeper reason why it works well and how the information is represented and evolved throughlayer-wiseinformationchange. InterpretationofDeepLanguageModels The most successful deep language models are called BERT and the transformer architecture is incorporated. Comparing with CNN or LSTM, transformer architecture showed its supreme performance in language modeling tasks. Even though a lot of work have proposed to reveal the darksecretofBERTandunderstandthefunctionalityofeachlayer,itremainsablackbox. (a) BERT (b)Multi-headAttention Figure7.1: ComputationalflowchartusedinBERT The computational flowchart for BERT is showed in Figure 7.1. Transformer architecture is mainly composed of multi-head self-attention layer and feed forward layers. In the self-attention 130 layer, the weights for each token is computed from its own representations. The attention can be modeledas CC4=C8>=¹QKVº=B>5C<0G¹ QK T p ¹=º ºV (7.1) whereQKV are identical and is the embedding of a particular token. In the self-attention layer, the weight of a token is determined by its own features after one simple transform matrix. The information exchange between tokens in a sentence are performed in a simple feed forward layer whichisfullyconnected. The core of deep learning is mostly on representation learning. The learned representation are distributed in a shared embedding space and the linguistic information is carried by the space formation. Inoursentenceembeddingwork,weexamtheevolvingpatternoftokenrepresentations across layers. To have a better interpretation of current deep language models, we would like to stick the tool on subspace analysis. With the help of manifold analysis of the learned embedding space,wecanprovidemoretransparentexplanationandguidesfutureresearchonthisdirection. Polysemy ProbleminNeuralLanguageModels The success of language model is because the representation is changing based on current contextinformationofaparticularword. However,theinheritmechanismisstillmysteryandhow the senses corresponding to a word are determined it still mysterious. Different from the current pre-trainedlanguagemodelswherethesensesforaparticularwordcannotbedeterminedeasily. In the future, we want to have an interpretable design for a word-sense model. Our intent is to focus on finding explicit representation for word senses. That is to say, a word should have an explicit number of sense embeddings. During the inference stage, one sense embedding from a word should be retrieved based on its current contextual information. There are several existing attempts trying to provide sense embedding but none of them works well in practice. The sense embeddingisagooddirectiontoexploresinceitcanprovideatransparentmodelwhichisusually desiredforapplicationswithhighsecurityorprivacyrequirements. 131 Bibliography 1. Agarwal, A., Xie, B., Vovsha, I., Rambow, O. & Passonneau, R. J. Sentiment analysis of twitter data in Proceedings of the Workshop on Language in Social Media (LSM 2011) (2011),30–38. 2. Agirre,E.,Alfonseca,E.,Hall,K.,Kravalova,J.,Paşca,M.&Soroa,A.Astudyonsimilarity and relatedness using distributional and wordnet-based approaches in Proceedings of Hu- man Language Technologies: The 2009 Annual Conference of the North American Chapter oftheAssociationforComputationalLinguistics(2009),19–27. 3. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., et al. Semeval- 2014task10:MultilingualsemantictextualsimilarityinProceedingsofthe8thinternational workshoponsemanticevaluation(SemEval2014)(2014),81–91. 4. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., et al. Semeval- 2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability in Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) (2015),252–263. 5. Agirre,E.,Banea,C.,Cer,D.,Diab,M.,Gonzalez-Agirre,A.,Mihalcea,R.,etal.Semeval- 2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (2016),497–511. 6. Agirre,E.,Cer,D.,Diab,M.&Gonzalez-Agirre,A.Semeval-2012task6:Apilotonsemantic textualsimilarityin*SEM2012:TheFirstJointConferenceonLexicalandComputational Semantics–Volume1:Proceedingsofthemainconferenceandthesharedtask,andVolume 2:ProceedingsoftheSixthInternationalWorkshoponSemanticEvaluation(SemEval2012) (2012),385–393. 7. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A. & Guo, W. * SEM 2013 shared task: Semantic textual similarity in Second Joint Conference on Lexical and Computational Se- mantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: SemanticTextualSimilarity(2013),32–43. 8. Albooyeh,M.,Goel,R.&Kazemi,S.M.Out-of-SampleRepresentationLearningforKnowl- edge Graphs in Proceedings of the 2020 Conference on Empirical Methods in Natural LanguageProcessing:Findings(2020),2657–2666. 9. Almarwani, N., Aldarmaki, H. & Diab, M. Efficient sentence embedding using discrete cosine transform. Proceedings of the 2019 Conference on Empirical Methods in Natural LanguageProcessing(2019). 10. Almuhareb,A. AttributesinlexicalacquisitionPhDthesis(UniversityofEssex,2006). 11. Arora,S.,Liang,Y.&Ma,T.Asimplebuttough-to-beatbaselineforsentenceembeddings. InternationalConferenceonLearningRepresentations(ICLR)(2017). 132 12. Artetxe,M.,Labaka,G.&Agirre,E.Generalizingandimprovingbilingualwordembedding mappingswithamulti-stepframeworkoflineartransformationsin AAAI (2018). 13. Artetxe,M.,Labaka,G.&Agirre,E.Learningbilingualwordembeddingswith(almost)no bilingual data in Proceedings of the 55th Annual Meeting of the Association for Computa- tionalLinguistics(Volume1:LongPapers)1(2017),451–462. 14. Artetxe, M., Labaka, G. & Agirre, E. Learning principled bilingual mappings of word embeddingswhilepreservingmonolingualinvariancein EMNLP (2016),2289–2294. 15. Artetxe, M., Labaka, G., Agirre, E. & Cho, K. Unsupervised neural machine translation. arXivpreprintarXiv:1710.11041(2017). 16. Arthur,D.&Vassilvitskii,S.k-means++:TheadvantagesofcarefulseedinginProceedings oftheeighteenthannualACM-SIAMsymposiumonDiscretealgorithms(2007),1027–1035. 17. Bahdanau,D.,Cho,K.&Bengio,Y.Neuralmachinetranslationbyjointlylearningtoalign andtranslate.arXivpreprintarXiv:1409.0473 (2014). 18. Bakarov, A. A Survey of Word Embeddings Evaluation Methods. CoRR abs/1801.09536 (2018). 19. Baker, S., Reichart, R. & Korhonen, A. An unsupervised model for instance level subcat- egorization acquisition in Proceedings of the 2014 Conference on Empirical Methods in NaturalLanguageProcessing(EMNLP)(2014),278–289. 20. Balazevic, I., Allen, C. & Hospedales, T. TuckER: Tensor Factorization for Knowledge GraphCompletionin EMNLP(2019). 21. Bansal,T.,Juan,D.-C.,Ravi,S.&McCallum,A.A2N:AttendingtoNeighborsforKnowledge GraphInferencein ACL (2019). 22. Barbieri, F., Camacho-Collados, J., Neves, L. & Espinosa-Anke, L. TweetEval: Unified BenchmarkandComparativeEvaluationforTweetClassification.arXivpreprintarXiv:2010.12421 (2020). 23. Baroni, M., Dinu, G. & Kruszewski, G. Don’t count, predict! A systematic comparison of context-countingvs.context-predictingsemanticvectorsinProceedingsofthe52ndAnnual MeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers)1(2014), 238–247. 24. Baroni,M.&Lenci,A.HowweBLESSeddistributionalsemanticevaluationinProceedings oftheGEMS2011WorkshoponGEometricalModelsofNaturalLanguageSemantics(2011), 1–10. 25. Baroni, M., Murphy, B., Barbu, E. & Poesio, M. Strudel: A corpus-based semantic model basedonpropertiesandtypes.Cognitivescience34, 222–254(2010). 26. Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXivpreprintarXiv:1903.10676 (2019). 27. Benesty, J., Chen, J., Huang, Y. & Cohen, I. in Noise reduction in speech processing 1–4 (Springer,2009). 28. Bengio,Y.,Ducharme,R.,Vincent,P.&Janvin,C.ANeuralProbabilisticLanguageModel. JournalofMachineLearningResearch3, 1137–1155(2003). 29. Blair, P., Merhav, Y. & Barry, J. Automated generation of multilingual clusters for the evaluationofdistributedrepresentations.arXivpreprintarXiv:1611.01547 (2016). 30. Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017). 133 31. Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017). 32. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J. & Yakhnenko, O. Translating em- beddings for modeling multi-relational data in Advances in neural information processing systems(2013),2787–2795. 33. Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Çelikyilmaz, A. & Choi, Y. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction in Proceedings ofthe57thAnnualMeetingoftheAssociationforComputationalLinguistics(ACL)(2019). 34. Bowman,S.R.,Angeli,G.,Potts,C.&Manning,C.D.Alargeannotatedcorpusforlearning naturallanguageinferenceinProceedingsofthe2015ConferenceonEmpiricalMethodsin NaturalLanguageProcessing(AssociationforComputationalLinguistics,Lisbon,Portugal, 2015),632–642.doi:10.18653/v1/D15-1075. 35. Bruni,E.,Tran,N.-K.&Baroni,M.Multimodaldistributionalsemantics.JournalofArtifi- cialIntelligenceResearch49, 1–47(2014). 36. Camacho-Collados, J. & Navigli, R. Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations in Proceedings of the 1st Workshop onEvaluatingVector-SpaceRepresentationsforNLP(2016),43–50. 37. Cer,D.,Diab,M.,Agirre,E.,Lopez-Gazpio,I.&Specia,L.SemEval-2017Task1:Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (Association for Computational Linguistics, Vancouver, Canada, 2017), 1–14. doi:10.18653/v1/S17- 2001. 38. Cer,D.,Yang,Y.,Kong,S.-y.,Hua,N.,Limtiaco,N.,St.John,R.,etal.UniversalSentence EncoderforEnglishinProceedingsofthe2018ConferenceonEmpiricalMethodsinNatural Language Processing: System Demonstrations (Association for Computational Linguistics, Brussels,Belgium,2018),169–174.doi:10.18653/v1/D18-2029. 39. Chen, J. & Liu, K.-C. On-line batch process monitoring using dynamic PCA and dynamic PLSmodels.ChemicalEngineeringScience57, 63–75(2002). 40. Chen, K., Zhao, T., Yang, M., Liu, L., Tamura, A., Wang, R., et al. A Neural Approach to Source Dependence Based Context Model for Statistical Machine Translation. IEEE/ACM TransactionsonAudio,SpeechandLanguageProcessing(TASLP)26, 266–280(2018). 41. Chen,W.,Zhang,M.&Zhang,Y.Distributedfeaturerepresentationsfordependencyparsing. IEEETransactionsonAudio,Speech,andLanguageProcessing23, 451–460(2015). 42. Chiu, B., Korhonen, A. & Pyysalo, S. Intrinsic evaluation of word vectors fails to predict extrinsic performance in Proceedings of the 1st Workshop on Evaluating Vector-Space RepresentationsforNLP(2016),1–6. 43. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation.arXivpreprintarXiv:1406.1078(2014). 44. Cohan, A., Ammar, W., Van Zuylen, M. & Cady, F. Structural scaffolds for citation intent classificationinscientificpublications.arXivpreprintarXiv:1904.01608(2019). 45. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. & Kuksa, P. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537(2011). 134 46. Conneau, A. & Kiela, D. Senteval: An evaluation toolkit for universal sentence representa- tions.arXivpreprintarXiv:1803.05449(2018). 47. Conneau, A., Kiela, D., Schwenk, H., Barrault, L. & Bordes, A. Supervised Learning of Universal SentenceRepresentations from NaturalLanguage Inference Data in Proceedings ofthe2017ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(Association forComputationalLinguistics,Copenhagen,Denmark,2017),670–680. 48. Conneau, A., Kruszewski, G., Lample, G., Barrault, L. & Baroni, M. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. 56th Annual MeetingoftheAssociationforComputationalLinguistics(ACL)(2018). 49. Cui,W.,Xiao,Y.,Wang,H.,Song,Y.,Hwang,S.-w.&Wang,W.KBQA:learningquestion answeringoverQAcorporaandknowledgebases.arXivpreprintarXiv:1903.02419(2019). 50. Dettmers, T., Minervini, P., Stenetorp, P. & Riedel, S. Convolutional 2d knowledge graph embeddingsin AAAI (2018),1811–1818. 51. Devlin,J.,Chang,M.-W.,Lee,K.&Toutanova,K.BERT:Pre-trainingofDeepBidirectional TransformersforLanguageUnderstandingin NAACL (2019),4171–4186. 52. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformersforlanguageunderstanding.arXivpreprintarXiv:1810.04805(2018). 53. Ding, Y., Liu, Y., Luan, H. & Sun, M. Visualizing and understanding neural machine translationin NAACL 1(2017),1150–1159. 54. Dolan, B., Quirk, C. & Brockett, C. Unsupervised construction of large paraphrase cor- pora: Exploiting massively parallel news sources in Proceedings of the 20th international conferenceonComputationalLinguistics(2004),350. 55. Dong, Y. & Qin, S. J. A novel dynamic PCA algorithm for dynamic data modeling and processmonitoring.JournalofProcessControl 67, 1–11(2018). 56. Dozat, T. & Manning, C. D. Simpler but More Accurate Semantic Dependency Parsing in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume2:ShortPapers)(AssociationforComputationalLinguistics,Melbourne,Australia, 2018),484–490.doi:10.18653/v1/P18-2077. 57. Eger, S., Rücklé, A. & Gurevych, I. Pitfalls in the Evaluation of Sentence Embeddings. 4th WorkshoponRepresentationLearningforNLP(2019). 58. Ethayarajh, K. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. Proceedings of the 2019 Conference onEmpiricalMethodsinNaturalLanguageProcessing(2019). 59. Ethayarajh, K. Unsupervised random walk sentence embeddings: A strong but simple base- line in Proceedings of The Third Workshop on Representation Learning for NLP (2018), 91–100. 60. Ethayarajh, K., Duvenaud, D. & Hirst, G. Towards Understanding Linear Word Analogies inProceedingsofthe57thAnnualMeetingoftheAssociationforComputationalLinguistics (Association for Computational Linguistics, Florence, Italy, 2019), 3253–3262. doi:10. 18653/v1/P19-1315. 61. Faruqui,M.,Tsvetkov,Y.,Rastogi,P.&Dyer,C.Problemswithevaluationofwordembed- dingsusingwordsimilaritytasks.arXivpreprintarXiv:1605.02276 (2016). 62. Fawzi, A., Moosavi-Dezfooli, S.-M. & Frossard, P. The robustness of deep networks: A geometricalperspective.IEEESignalProcessingMagazine 34,50–62(2017). 135 63. Finkelstein,L.,Gabrilovich,E.,Matias,Y.,Rivlin,E.,Solan,Z.,Wolfman,G.,etal.Placing searchincontext:Theconceptrevisited inProceedingsofthe10thinternationalconference onWorldWideWeb(2001),406–414. 64. Firth,J.R.Asynopsisoflinguistictheory,1930-1955.Studiesinlinguisticanalysis(1957). 65. Gerz, D., Vulić, I., Hill, F., Reichart, R. & Korhonen, A. SimVerb-3500: A large-scale evaluationsetofverbsimilarity.arXivpreprintarXiv:1608.00869(2016). 66. Gladkova,A.&Drozd,A.Intrinsicevaluationsofwordembeddings:Whatcanwedobetter? in Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP (2016),36–42. 67. Gupta, V., Saw, A., Nokhiz, P., Netrapalli, P., Rai, P. & Talukdar, P. P-SIF: Document EmbeddingsUsingPartitionAveraging. 68. Halawi,G.,Dror,G.,Gabrilovich,E.&Koren,Y.Large-scalelearningofwordrelatedness with constraints in Proceedings of the 18th ACM SIGKDD international conference on Knowledgediscoveryanddata mining(2012),1406–1414. 69. Hamaguchi, T., Oiwa, H., Shimbo, M. & Matsumoto, Y. Knowledge transfer for out-of- knowledge-base entities: A graph neural network approach in Proceedings of the 26th InternationalJointConference onArtificialIntelligence(2017),1802–1808. 70. Hamilton,W.,Ying,Z.&Leskovec,J.Inductiverepresentationlearningonlargegraphsin Advancesinneuralinformation processingsystems(2017),1024–1034. 71. Hao, Y., Zhang, Y., Liu, K., He, S., Liu, Z., Wu, H., et al. An end-to-end model for ques- tion answering over knowledge base with cross-attention combining global knowledge in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume1:LongPapers)1(2017),221–231. 72. Hao, Y., Dong, L., Wei, F. & Xu, K. Visualizing and Understanding the Effectiveness of BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing(2019). 73. Hellrich, J. & Hahn, U. Don’t get fooled by word embeddings: better watch their neigh- borhood in Digital Humanities 2017—Conference Abstracts of the 2017 Conference of the AllianceofDigitalHumanitiesOrganizations(ADHO).Montréal,Quebec,Canada(2017), 250–252. 74. Hill, F., Cho, K. & Korhonen, A. Learning distributed representations of sentences from unlabelled data. Proceedings of the Conference of the North American Chapter of the AssociationforComputationalLinguistics:HumanLanguageTechnologies(2016). 75. Hill,F.,Reichart,R.&Korhonen,A.Simlex-999:Evaluatingsemanticmodelswith(genuine) similarityestimation.ComputationalLinguistics41, 665–695(2015). 76. Hu, M. & Liu, B. Mining and summarizing customer reviews in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (2004), 168–177. 77. Ionescu, R. T. & Butnaru, A. Vector of locally-aggregated word embeddings (VLAWE): A novel document-level representation in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume1(LongandShortPapers)(2019),363–369. 78. Jawahar, G., Sagot, B., Seddah, D., Unicomb, S., Iñiguez, G., Karsai, M., et al. What does BERTlearnaboutthestructureoflanguage?in57thAnnualMeetingoftheAssociationfor ComputationalLinguistics(ACL),Florence,Italy(2019). 136 79. Kayal, S. & Tsatsaronis, G. EigenSent: Spectral sentence embeddings using higher-order DynamicModeDecompositioninProceedingsofthe57thAnnualMeetingoftheAssociation forComputationalLinguistics(2019),4536–4546. 80. Khandelwal, U., He, H. & Qi, P. Sharp Nearby, Fuzzy Far Away: How Neural Language ModelsUseContext.ACL (2018). 81. Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., et al. Skip- thoughtvectorsinAdvancesinneuralinformationprocessingsystems (2015),3294–3302. 82. Klein,G.,Kim,Y.,Deng,Y.,Senellart,J.&Rush,A.M.Opennmt:Open-sourcetoolkitfor neuralmachinetranslation.arXivpreprintarXiv:1701.02810(2017). 83. Koehn, P. Europarl: A parallel corpus for statistical machine translation in MT summit 5 (2005),79–86. 84. Kovaleva,O.,Romanov,A.,Rogers,A.&Rumshisky,A.Revealingthedarksecretsofbert. Proceedingsofthe2019ConferenceonEmpiricalMethodsinNaturalLanguageProcessing (2019). 85. Kuo, C.-C. J. Understanding convolutional neural networks with a mathematical model. JournalofVisualCommunicationandImageRepresentation41, 406–413(2016). 86. Kuo,C.-C.J.,Zhang,M.,Li,S.,Duan,J.&Chen,Y.Interpretableconvolutionalneuralnet- worksviafeedforwarddesign.JournalofVisualCommunicationandImageRepresentation 60,346–359(2019). 87. Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K. From word embeddings to document distancesin Internationalconferenceonmachinelearning(2015),957–966. 88. Lample, G., Ott, M., Conneau, A., Denoyer, L. & Ranzato, M. Phrase-Based & Neural Unsupervised Machine Translation in Proceedings of the 2018 Conference on Empirical MethodsinNaturalLanguageProcessing(EMNLP)(2018). 89. Lan,Z.,Chen,M.,Goodman,S.,Gimpel,K.,Sharma,P.&Soricut,R.Albert:Alitebertfor self-supervisedlearningoflanguagerepresentations.InternationalConferenceonLearning Representations(2019). 90. Landauer, T. K., Foltz, P. W. & Laham, D. An introduction to latent semantic analysis. Discourseprocesses25, 259–284(1998). 91. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240(2020). 92. Levy,O.&Goldberg,Y.Linguisticregularitiesinsparseandexplicitwordrepresentations in Proceedings of the eighteenth conference on computational natural language learning (2014),171–180. 93. Levy,O.&Goldberg,Y.NeuralwordembeddingasimplicitmatrixfactorizationinAdvances inneuralinformationprocessingsystems(2014),2177–2185. 94. Li,J.,Monroe,W.,Shi,T.,Jean,S.,Ritter,A.&Jurafsky,D.Adversariallearningforneural dialoguegeneration.arXivpreprintarXiv:1701.06547 (2017). 95. Li,X.,Taheri,A.,Tu,L.&Gimpel,K.CommonsenseknowledgebasecompletioninProceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics (2016), 1445–1455. 96. Li, X. & Roth, D. Learning question classifiers in Proceedings of the 19th international conferenceonComputationallinguistics-Volume1(2002),1–7. 137 97. Li, Z., Zhang, M., Che, W., Liu, T. & Chen, W. Joint optimization for Chinese POS tag- ging and dependency parsing. IEEE/ACM Transactions on Audio, Speech and Language Processing(TASLP)22,274–286(2014). 98. Liu,N.F.,Gardner,M.,Belinkov,Y.,Peters,M.E.&Smith,N.A.LinguisticKnowledgeand TransferabilityofContextualRepresentationsinProceedingsoftheConferenceoftheNorth American Chapter of the Association for Computational Linguistics: Human Language Technologies(2019). 99. Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., et al. K-BERT: Enabling Language RepresentationwithKnowledgeGraphinProceedingsoftheAAAIConferenceonArtificial Intelligence(2020),2901–2908. 100. Liu,Y.,Ott,M.,Goyal,N.,Du,J.,Joshi,M.,Chen,D.,etal.Roberta:Arobustlyoptimized bertpretrainingapproach.arXivpreprintarXiv:1907.11692(2019). 101. Luong, T., Socher, R. & Manning, C. Better word representations with recursive neural networks for morphology in Proceedings of the Seventeenth Conference on Computational NaturalLanguageLearning(2013),104–113. 102. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & Potts, C. Learning word vectors for sentiment analysis in Proceedings of the 49th annual meeting of the association forcomputationallinguistics:Humanlanguagetechnologies-volume1(2011),142–150. 103. Malaviya, C., Bhagavatula, C., Bosselut, A. & Choi, Y. Commonsense Knowledge Base Completion with Structural and Semantic Context in Proceedings of the AAAI Conference onArtificialIntelligence(2020),2925–2933. 104. Marcus, M. P., Marcinkiewicz, M. A. & Santorini, B. Building a large annotated corpus of English:ThePennTreebank.Computationallinguistics19, 313–330(1993). 105. Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., Zamparelli, R., et al. A SICK cure for the evaluation of compositional distributional semantic models in LREC (2014),216–223. 106. Mekala, D., Gupta, V., Paranjape, B. & Karnick, H. SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations in Proceedings of the 2017 ConferenceonEmpiricalMethodsinNaturalLanguageProcessing(2017),659–669. 107. Michel, P., Levy, O. & Neubig, G. Are Sixteen Heads Really Better than One? in Advances inNeuralInformationProcessingSystems(2019),14014–14024. 108. Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations invectorspace.ICLR (2013). 109. Mikolov,T.,Sutskever,I.,Chen,K.,Corrado,G.S.&Dean,J.Distributedrepresentationsof wordsandphrasesandtheircompositionalityinAdvancesinneuralinformationprocessing systems(2013),3111–3119. 110. Mikolov, T., Yih, W.-t. & Zweig, G. Linguistic regularities in continuous space word rep- resentations in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2013), 746– 751. 111. Miller, G. A. & Charles, W. G. Contextual correlates of semantic similarity. Language and cognitiveprocesses6,1–28(1991). 112. Mu, J., Bhat, S. & Viswanath, P. All-but-the-top: Simple and effective postprocessing for wordrepresentations.InternationalConferenceonLearningRepresentations(2018). 138 113. Nguyen,D.Q.,Vu,T.&Nguyen,A.T.BERTweet:Apre-trainedlanguagemodelforEnglish Tweets.arXivpreprintarXiv:2005.10200(2020). 114. Ouchi, H., Duh, K., Shindo, H. & Matsumoto, Y. Transition-based dependency parsing exploitingsupertags.IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing 24,2059–2068(2016). 115. Pagliardini, M., Gupta, P. & Jaggi, M. Unsupervised Learning of Sentence Embeddings usingCompositionaln-GramFeaturesinNAACL2018-ConferenceoftheNorthAmerican ChapteroftheAssociationfor ComputationalLinguistics(2018). 116. Pang, B. & Lee, L. A sentimental education: Sentiment analysis using subjectivity summa- rization based on minimum cuts in Proceedings of the 42nd annual meeting on Association forComputationalLinguistics(2004),271. 117. Pang, B., Lee, L., et al. Opinion mining and sentiment analysis. Foundations and Trends® inInformationRetrieval 2,1–135(2008). 118. Pang,B.&Lee,L.Seeingstars:Exploitingclassrelationshipsforsentimentcategorization with respect to rating scales in Proceedings of the 43rd annual meeting on association for computationallinguistics(2005),115–124. 119. Pennington, J., Socher, R. & Manning, C. Glove: Global vectors for word representation in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)(2014),1532–1543. 120. Perone,C.S.,Silveira,R.&Paula,T.S.Evaluationofsentenceembeddingsindownstream andlinguisticprobingtasks.arXivpreprintarXiv:1806.06259(2018). 121. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., et al. Deep contex- tualizedwordrepresentations.arXivpreprintarXiv:1802.05365(2018). 122. Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., et al. Language Models as Knowledge Bases? Proceedings of the 2019 Conference on Empirical Methods inNaturalLanguageProcessing(2019). 123. Qiu,Y.,Li,H.,Li,S.,Jiang,Y.,Hu,R.&Yang,L.inChineseComputationalLinguisticsand Natural Language Processing Based on Naturally Annotated Big Data 209–221 (Springer, 2018). 124. Radford,A.,Narasimhan,K.,Salimans,T.&Sutskever,I.Improvinglanguageunderstanding bygenerativepre-training.Technicalreport,OpenAI (2018). 125. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. & Sutskever, I. Language models are unsupervisedmultitasklearners.OpenAIBlog1(2019). 126. Radinsky, K., Agichtein, E., Gabrilovich, E. & Markovitch, S. A word at a time: computing wordrelatednessusingtemporalsemanticanalysisinProceedingsofthe20thinternational conferenceonWorldwideweb(2011),337–346. 127. Radovanović, M., Nanopoulos, A. & Ivanović, M. On the existence of obstinate results in vector space models in Proceedings of the 33rd international ACM SIGIR conference on Researchanddevelopmentininformationretrieval (2010),186–193. 128. Rajpurkar, P., Jia, R. & Liang, P. Know What You Don’t Know: Unanswerable Questions for SQuAD in Proceedings of the 56th Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers) (Association for Computational Linguistics, Melbourne,Australia,2018),784–789.doi:10.18653/v1/P18-2124. 129. Ravi, K. & Ravi, V. A survey on opinion mining and sentiment analysis: tasks, approaches andapplications.Knowledge-BasedSystems89, 14–46(2015). 139 130. Reimers, N. & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT- NetworksinProceedingsofthe2019ConferenceonEmpiricalMethodsinNaturalLanguage Processing(AssociationforComputationalLinguistics,2019). 131. Rettig, L., Audiffren, J. & Cudré-Mauroux, P. Fusing Vector Space Models for Domain- Specific Applications in 2019 IEEE 31st International Conference on Tools with Artificial Intelligence(ICTAI)(2019),1110–1117. 132. Rogers, A., Drozd, A. & Li, B. The (Too Many) Problems of Analogical Reasoning with Word Vectors in Proceedings of the 6th Joint Conference on Lexical and Computational Semantics(*SEM2017)(2017),135–148. 133. Roy, A., Park, Y. & Pan, S. Learning domain-specific word embeddings from sparse cyber- securitytexts.arXivpreprintarXiv:1709.07470(2017). 134. Rubenstein, H. & Goodenough, J. B. Contextual correlates of synonymy. Communications oftheACM 8, 627–633(1965). 135. Rücklé,A.,Eger,S.,Peyrard,M.&Gurevych,I.Concatenatedpowermeanwordembeddings asuniversalcross-lingualsentencerepresentations.arXivpreprintarXiv:1803.01400(2018). 136. Sanh, V., Wolf,T. & Ruder, S. A hierarchical multi-taskapproach for learning embeddings from semantic tasks in Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019),6949–6956. 137. Sap,M.,LeBras,R.,Allaway,E.,Bhagavatula,C.,Lourie,N.,Rashkin,H.,etal.ATOMIC: Anatlasofmachinecommonsenseforif-thenreasoninginProceedingsoftheAAAIConfer- enceonArtificialIntelligence33(2019),3027–3035. 138. Schlichtkrull,M.,Kipf,T.N.,Bloem,P.,VanDenBerg,R.,Titov,I.&Welling,M.Modeling relational data with graph convolutional networks in European Semantic Web Conference (2018),593–607. 139. Schnabel, T., Labutov, I., Mimno, D. & Joachims, T. Evaluation methods for unsupervised word embeddings in Proceedings of the 2015 Conference on Empirical Methods in Natural LanguageProcessing(2015),298–307. 140. Schütze, H., Manning, C. D. & Raghavan, P. Introduction to information retrieval (Cam- bridgeUniversityPress,2008). 141. Senel,L.K.,Utlu,I.,Yucesoy,V.,Koc,A.&Cukur,T.SemanticStructureandInterpretability ofWordEmbeddings.IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing (2018). 142. Shalaby,W.&Zadrozny,W.Measuringsemanticrelatednessusingminedsemanticanalysis. CoRR,abs/1512.03465(2015). 143. Shang, C., Tang, Y., Huang, J., Bi, J., He, X. & Zhou, B. End-to-end structure-aware con- volutional networks for knowledge base completion in Proceedings of the AAAI Conference onArtificialIntelligence33(2019),3060–3067. 144. Shen, D., Wang, G., Wang, W., Renqiang Min, M., Su, Q., Zhang, Y., et al. Baseline Needs MoreLove:OnSimpleWord-Embedding-BasedModelsandAssociatedPoolingMechanisms in ACL (2018). 145. Shen, M., Kawahara, D. & Kurohashi, S. Dependency parse reranking with rich subtree features. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 1208– 1218(2014). 146. Shin,B.,Lee,T.&Choi,J.D.LexiconIntegratedCNNModelswithAttentionforSentiment Analysisin WASSA (2017),149–158. 140 147. Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-J., et al. An overview of microsoft academicservice(mas)andapplicationsinProceedingsofthe24thinternationalconference onworldwideweb(2015),243–246. 148. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., et al. Recursive deep models for semantic compositionality over a sentiment treebank in Proceedings of the 2013conferenceonempiricalmethodsinnaturallanguageprocessing(2013),1631–1642. 149. Spearman,C.Theproofandmeasurementofassociationbetweentwothings.TheAmerican journalofpsychology15,72–101(1904). 150. Speer,R.&Havasi,C.in ThePeople’sWebMeetsNLP161–176(Springer,2013). 151. Subramanian,S.,Trischler,A.,Bengio,Y.&Pal,C.J.Learninggeneralpurposedistributed sentence representations via large scale multi-task learning. International Conference on LearningRepresentations(2018). 152. Sun,Z.,Deng,Z.-H.,Nie,J.-Y.&Tang,J.Rotate:Knowledgegraphembeddingbyrelational rotation in complex space. International Conference on Learning Representations (ICLR) (2019). 153. Tai, K. S., Socher, R. & Manning, C. D. Improved semantic representations from tree- structuredlongshort-termmemorynetworks.arXivpreprintarXiv:1503.00075(2015). 154. Tang, Y., Huang, J., Wang, G., He, X. & Zhou, B. Orthogonal Relation Transforms with GraphContextModelingforKnowledgeGraphEmbedding.Proceedingsofthe58thAnnual MeetingoftheAssociationforComputationalLinguistics(ACL)(2020). 155. Teru, K. K., Denis, E. & Hamilton, W. L. Inductive Relation Prediction by Subgraph Rea- soning.InternationalConferenceonMachineLearning(2020). 156. Tissier, J., Gravier, C. & Habrard, A. Dict2vec: Learning Word Embeddings using Lexical DictionariesinConferenceonEmpiricalMethodsinNaturalLanguageProcessing(EMNLP 2017)(2017),254–263. 157. Tjong Kim Sang, E. F. & Buchholz, S. Introduction to the CoNLL-2000 shared task: Chunking in Proceedings of the 2nd workshop on Learning language in logic and the 4th conferenceonComputationalnaturallanguagelearning-Volume7 (2000),127–132. 158. Tjong Kim Sang, E. F. & De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition in Proceedings of the seventh conference onNaturallanguagelearningatHLT-NAACL2003-Volume4(2003),142–147. 159. Torki, M. A Document Descriptor using Covariance of Word Vectors in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)(AssociationforComputationalLinguistics,Melbourne,Australia,2018),527–532. doi:10.18653/v1/P18-2084. 160. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É. & Bouchard, G. Complex embeddings for simplelinkpredictionin InternationalConferenceonMachineLearning(ICML)(2016). 161. Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G. & Dyer, C. Evaluation of word vector representationsbysubspacealignmentinProceedingsofthe2015ConferenceonEmpirical MethodsinNaturalLanguageProcessing(2015),2049–2054. 162. Turney, P. D. Mining the web for synonyms: PMI-IR versus LSA on TOEFL in European ConferenceonMachineLearning(2001),491–502. 163. Vashishth, S., Sanyal, S., Nitin, V., Agrawal, N. & Talukdar, P. P. InteractE: Improving Convolution-Based Knowledge Graph Embeddings by Increasing Feature Interactions. in ProceedingsoftheAAAIConferenceonArtificialIntelligence(2020),3009–3016. 141 164. Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,etal.Attention isallyouneed inAdvancesinneuralinformationprocessingsystems (2017),5998–6008. 165. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. & Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (Association for Computational Linguistics, Brussels, Belgium, 2018), 353–355. doi:10.18653/v1/W18-5446. 166. Wang,B.,Chen,F.,Wang,A.&Kuo,C.-C.J.Post-ProcessingofWordRepresentationsvia VarianceNormalizationandDynamicEmbedding.2019IEEEInternationalConferenceon MultimediaandExpo(ICME),718–723(2019). 167. Wang, B. & Kuo, C.-C. J. SBERT-WK: A Sentence Embedding Method By Dissecting BERT-basedWordModels.arXivpreprintarXiv:2002.06652(2020). 168. Wang,B.,Wang,A.,Chen,F.,Wang,Y.&Kuo,C.-C.J.EvaluatingWordEmbeddingMod- els: Methods and Experimental Results. APSIPA Transactions on Signal and Information Processing(2019). 169. Wang, H., Zhang, F., Xie, X. & Guo, M. DKN: Deep knowledge-aware network for news recommendationinProceedingsofthe2018worldwidewebconference(2018),1835–1844. 170. Wen, T.-H., Vandyke, D., Mrksic, N., Gasic, M., Rojas-Barahona, L. M., Su, P.-H., et al. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562(2016). 171. Wiebe, J., Wilson, T. & Cardie, C. Annotating expressions of opinions and emotions in language.Languageresources andevaluation39, 165–210(2005). 172. Wu,Y.,Schuster,M.,Chen,Z.,Le,Q.V.,Norouzi,M.,Macherey,W.,etal.Google’sneural machinetranslationsystem:Bridgingthegapbetweenhumanandmachinetranslation.arXiv preprintarXiv:1609.08144(2016). 173. Xie, R., Liu, Z., Jia, J., Luan, H. & Sun, M. Representation learning of knowledge graphs with entitydescriptionsin ThirtiethAAAIConferenceonArtificialIntelligence(2016). 174. Xu, H., Liu, B., Shu, L. & Yu, P. S. Lifelong domain word embedding via meta-learning. arXivpreprintarXiv:1805.09991(2018). 175. Xu,J.,Sun,X.,He,H.,Ren,X.&Li,S.Cross-DomainandSemi-SupervisedNamedEntity RecognitioninChineseSocialMedia:AUnifiedModel.IEEE/ACMTransactionsonAudio, Speech,andLanguageProcessing(2018). 176. Yaghoobzadeh, Y. & Schütze, H. Intrinsic subspace evaluation of word embedding repre- sentations.arXivpreprintarXiv:1606.07902(2016). 177. Yang,B.,Yih,W.-t.,He,X.,Gao,J.&Deng,L.Embeddingentitiesandrelationsforlearning and inference in knowledge bases. International Conference on Learning Representations (ICLR)(2014). 178. Yang,Z.,Dai,Z.,Yang,Y.,Carbonell,J.,Salakhutdinov,R.R.&Le,Q.V.Xlnet:Generalized autoregressive pretraining for language understanding in Advances in neural information processingsystems(2019),5754–5764. 179. Yang,Z.,Zhu,C.&Chen,W.Parameter-freeSentenceEmbeddingviaOrthogonalBasisin Proceedingsofthe2019ConferenceonEmpiricalMethodsinNaturalLanguageProcessing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP)(2019),638–648. 142 180. Yao, L., Mao, C. & Luo, Y. KG-BERT: BERT for knowledge graph completion. arXiv preprintarXiv:1909.03193(2019). 181. Yu, L.-C., Wang, J., Lai, K. R. & Zhang, X. Refining Word Embeddings Using Intensity Scores for Sentiment Analysis. IEEE/ACM Transactions on Audio, Speech and Language Processing(TASLP)26,671–681(2018). 182. Zhang, B., Xiong, D., Su, J. & Duan, H. A Context-Aware Recurrent Encoder for Neural MachineTranslation.IEEE/ACMTransactionsonAudio,SpeechandLanguageProcessing (TASLP)25, 2424–2432(2017). 183. Zhao,Z.,Liu,T.,Li,S.,Li,B.&Du,X.Ngram2vec:Learningimprovedwordrepresentations from ngram co-occurrence statistics in Proceedings of the 2017 Conference on Empirical MethodsinNaturalLanguageProcessing(2017),244–253. 184. Zhou, G., Xie, Z., He, T., Zhao, J. & Hu, X. T. Learning the multilingual translation representations for question retrieval in community question answering via non-negative matrix factorization. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)24, 1305–1314(2016). 185. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., et al. Align- ing Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and ReadingBooksin TheIEEEInternationalConferenceonComputerVision(ICCV)(2015). 143
Abstract (if available)
Abstract
Natural language is one kind of unstructured data source which does not have a natural numerical representation as to its original form. In most natural language processing tasks, a distributed representation of text entities is necessary. Current state-of-the-art textual representation learning tries to extract information of interest and embed it with vector representations (i.e. embeddings). Even though a lot of progress has been witnessed on representation learning techniques in natural language processing, especially the recent development in large pre-trained language models, there are still several research challenges especially considering that languages have multiple levels of entries and different domains. This thesis investigates and proposes representation learning techniques to learn semantic-enhanced embeddings for both words and sentences and study the application to different domains including commonsense knowledge representation and domain-specific word embeddings. ❧ First, we analyze the desired properties of word embeddings and discuss different evaluation metrics for word embeddings and their connection with downstream tasks. With that, we introduce space enhancement methods to avoid the hubness problem and increase its capabilities in capturing word semantics and sequential information in the context. Second, through the analysis of semantic groups of words, we introduce a new sentence representation technique by measuring the correlation between different semantic groups within a sentence. Also, we propose to leverage deep pre-trained language models and introduces a pooling method to fusion the information learning from different layers for an enhanced sentence representation. A detailed discussion and motivation with our proposed pooling method are also introduced. Last, we study the problem of transfer learning to different textual domains. Therefore, we propose a method to extract domain-specific word embedding models from pre-trained language models and apply it to domains such as sentiment sentence classification, sentence similarity, and scientific articles. Also, we investigate the application of representation learning to enable the inductive learning capability of the commonsense knowledge base completion task.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Efficient graph learning: theory and performance evaluation
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
Human activity analysis with graph signal processing techniques
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Green image generation and label transfer techniques
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Object classification based on neural-network-inspired image transforms
PDF
Theory and applications of adversarial and structured knowledge learning
PDF
Green learning for 3D point cloud data processing
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Human motion data analysis and compression using graph based techniques
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Learning distributed representations of cells in tables
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
Asset Metadata
Creator
Wang, Bin
(author)
Core Title
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
04/01/2021
Defense Date
03/15/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
knowledge graph embedding,OAI-PMH Harvest,representation learning,sentence embedding,word embedding
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
bwang28c@gmail.com,wang699@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-434803
Unique identifier
UC11668447
Identifier
etd-WangBin-9379.pdf (filename),usctheses-c89-434803 (legacy record id)
Legacy Identifier
etd-WangBin-9379.pdf
Dmrecord
434803
Document Type
Dissertation
Rights
Wang, Bin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
knowledge graph embedding
representation learning
sentence embedding
word embedding