Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning distributed representations of cells in tables
(USC Thesis Other)
Learning distributed representations of cells in tables
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNING DISTRIBUTED REPRESENTATIONS OF CELLS IN TABLES by Majid Ghasemi Gol A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2021 Copyright 2021 Majid Ghasemi Gol Dedication To my wife, Hana, my father, Reza, and my brothers, Mahdy and Mohammad for their encouragement, love, and sacrifice ii Acknowledgments PhD program has been a long and bittersweet journey for me. It was filled with moments that I felt excitement thinking about new research ideas, disappointment when my ideas did not work, frustration when my papers got rejected, and satisfaction when all the hard work would pay off. I have been very lucky to have supportive friends and mentors by my side during this journey, without whom I cannot imagine getting through the frustrations and disappointments. First and foremost, I would like to thank my thesis advisor Prof. Pedro Szekely for his support, patience, and valuable feedback throughout my PhD research. He not only spent many hours with me guiding, discussing, and developing research ideas; but also taught me the art of public speaking and presentation. He has been both an academic and a life mentor for me throughout these years. He always supported me and gave me the freedom to explore my way to solve problems and not be afraid to fail. I am very thankful to Prof. Viktor Prasanna, my thesis committee Chair, who supported me through my PhD program. I would also like to thank the other members of both my proposal and dissertation committees for providing me valuable comments on my thesis, iii Prof. Aram Galstyan, Prof. Paul Bogdan, Prof. Ashutosh Nayyar, and Prof. Aiichiro Nakano. I am thankful to all my mentors at ISI, especially Prof. Craig Knoblock, Prof. Jay Pujara, Dr. Steven Minton, Dr. Linhong Zhu, Prof. Mayank Kejriwal, and Dr. Muhao Chen. I would also thank all my peers and coauthors, Javad Dousti, Alireza shafaei, Yanzhi Wang, Mahdi Nazemi; and many more peers at USC and ISI who gave me valuable support and feedback throughout my research. I am also thankful for Prof. Hossein Asadi and Prof. Massoud Pedram, without whom I would have never started this journey. Last but definitely not least, I am thankful to my family and friends. Thanks to my beautiful wife Hana who was very patient and supportive, and helped me stay positive and determined throughout the past four years, even in the most frustrating moments. Thanks to my father, Reza, and brothers, Mahdy and Mohammad, who supported me throughout my very long academic journey. I was also lucky to have great friends who made this journey more enjoyable; thanks to Arman, Roohy, Ehsan, Negar, Aref, Shaghayegh, and all my other friends who created enjoyable memories during this journey. iv Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii 1 Introduction 1 1.1 Insights and Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Dissertation’s Solution and Contributions . . . . . . . . . . . . . . . . . . 5 1.3 Structure of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Unsupervised Learning of Cell Vector Representations 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Property Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Encoding cell textual content . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Encoding cell formatting . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Encoding cell position . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Property encoding vector . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Cell Encoder Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Bag of cells (BoC) . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Cell graph network (CGN) . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 Cell transformer (CTrans) . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Decoder Layer and Training Objective . . . . . . . . . . . . . . . . . . . . 34 2.4.1 Cell value reconstruction . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.2 Cell feature reconstruction . . . . . . . . . . . . . . . . . . . . . . 37 2.4.3 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.1 FastText encoding vectors . . . . . . . . . . . . . . . . . . . . . . 40 2.5.2 Cell embedding vectors . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.3 Comparison of different cell embedding models . . . . . . . . . . . 49 v 3 Detecting Elements of Tabular Data Layout 54 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Supervised Fine-Tuning of Cell Embeddings . . . . . . . . . . . . . . . . . 58 3.2.1 Multi-layer perceptron classification head . . . . . . . . . . . . . . 59 3.2.2 RNN-based classification head . . . . . . . . . . . . . . . . . . . . 59 3.3 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Classifying Web Tables 80 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.1 Table vector calculation . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.2 Table type classification . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.2 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5 Retrieving Tables for Natural Language Query 95 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3.2 Experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3.3 Evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6 Related work 106 6.1 Unsupervised Representation Learning . . . . . . . . . . . . . . . . . . . . 106 6.2 Tabular Data Downstream Problems . . . . . . . . . . . . . . . . . . . . . 108 7 Conclusion and Future Research Directions 112 7.1 Contributions of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Reference List 120 vi List of Tables 2.1 List of cell formatting features used in this thesis. . . . . . . . . . . . . . . 15 2.2 Evaluation results for predicting syntactic features about cell textual value using FastText encoding vectors. Average of the F1 scores and their standard deviation across various folds are reported. . . . . . . . . . . . . . . . . . . 41 3.1 Evaluation scores for in-domain evaluation setting. DeEx, SAUS, and CIUS are the datasets. RF and CRF are baseline methods form the previous work. CV and E2E methods are vanilla baseline models which are built from part of our cell embedding model. CE is our full model, using pre-trained cell embeddings using CTrans model. Note that ±std shows the standard devia- tion of the scores across different folds, and the results with an asterisk (*) are statistically significant using a two-tailed P-value threshold of 0.05, and comparing with the RF baseline. . . . . . . . . . . . . . . . . . . . . . . . 78 3.2 Evaluation scores for out-domain evaluation setting. DeEx, SAUS, and CIUS are the target datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1 Number of pages and tables in each dataset. . . . . . . . . . . . . . . . . . 88 4.2 Number of tables for each table type in the groundtruth for each dataset. R, E, M, L, and ND stand for relational, entity, matrix, list, and non-data respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3 Results for pre-trained baseline models. . . . . . . . . . . . . . . . . . . . 91 4.4 Evaluation results. TabEmb is our method. R, E, M, L, ND and F1-M stand for relational, entity, matrix, list, non-data, and F1-micro score. . . . . . . 93 5.1 Evaluation results for the table retrieval task. CE is our system using the pre-trained cell embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . 105 vii List of Figures 1.1 Examples of tabular documents with complex data layout. . . . . . . . . . 2 2.1 Overview of our masked cell model (MCM) framework for pre-training cell embeddings. First, target and context cell properties are encoded into real- valued vectors. These vectors are then fed into the cell encoder layer which embeds all the important information into a single vector representation for the target cell (cell embedding). The decoder layer then tries to reconstruct part of cell properties as a training objective. Note that we mask the objective part from the initial cell vector (output of property encoder) to prevent the network to trivially use that information to produce its objective. . . . . . . 11 2.2 Cell properties in a spreadsheet table. . . . . . . . . . . . . . . . . . . . . 13 2.3 Example of applying FastText to cell textual content. In this example the text is first broken into 3-grams, and then the vector for each 3-gram is retrieved from the fastText model. The final V t vector is achieved by averaging the vectors for all 3-grams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Example for generating cell formatting encoding vector (V s ). The distribu- tion of the length feature in the table is depicted throughout the transforma- tion process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Continuous bags of word (CBOW) and skip-gram (SG) word embedding models [1]. CBOW is shown on the left and SG is shown on the right. . . . 19 2.6 Bag of cells (BoC) cell embedding model. The green cell (C i;j ) is the target cell for which we want to calculate the cell embedding, and the blue cells are context cells. The cell encoding vectors are fed into thectx andt networks. Enc ctx andEnc t modules have the same structure, but different parameters. 20 2.7 A simple directed graph where each node has a hidden state vector. The node with dark blue color receives messages from its two neighbors in neural message passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.8 An example of cell graph. Nodes in the graph represent cells (or block of merged cells), and edges connect adjacent cells. . . . . . . . . . . . . . . 24 2.9 Row, column, and table nodes in the tabular graph which facilitates informa- tion propagation between distant cells in the GNN. . . . . . . . . . . . . . 25 2.10 GAT cell graph network. Each GAT layer performs a single message passing step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 viii 2.11 An example of attention mechanism in generating a target sequence (y) using the source sequence (x). Attention weights are calculated for each element of the source sequence, for each given step in generating the target sequence. A feed forward network is then used to predict the next token in the target sequence. The formulation on the right formally describes this process. . . 28 2.12 Scaled dot-product attention. The network maps a query (q) and a set of key-value pairs (k, v) to an output. Theq, k, andv vectors can be packed together inQ,K,V tensors respectively. The scaling layer prevents gradient overflow, and mask layer enforces only forward propagation of information in the target sequence (similar to uni-direction LSTM network). . . . . . . 29 2.13 The Transformer model architecture [2]. . . . . . . . . . . . . . . . . . . . 31 2.14 Creating a source sequence of cells for a target cell (highlighted) in the tabu- lar document. The corresponding row and column are concatenated to form the source sequence. Note that the target cell repeats twice in the source sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.15 Overview of the CTrans model. For the target cell C i;j (highlighted), the output of the network is the cell embedding vectorE i;j . x is the tensor con- taining all the encoding vectors for cells in the context sequence. x is the context source tensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.16 CTrans encoder block.L layers of Transformer encoder applies to the source tensorx, and the output tensor of the last layer is summed up to generate the cell embedding vectorE i;j . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.17 Cell decoder module for reconstructing cell textual value. The input to GRU units are one-hot encoding vectors for characters in the text. The output of a GRU unit is fed into a multi-layer perceptron to predict the next character in the sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.18 Cell decoder module for reconstructing cell features. The input to feed- forward network is the cell embedding E i;j , and the output ^ y i;j is the pre- diction for the desired feature. . . . . . . . . . . . . . . . . . . . . . . . . 37 2.19 Overview of the multi-task learning framework. Various decoders share the same encoder parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.20 Categorizing by data type of cell textual value (first scenario). Each category contains 500 sample cells randomly selected from the annotated CIUS corpus. 43 2.21 Example of five major categories for cell role in tabular data layout: left attribute (LA), data (D), top attribute (TA), metadata (MD), and footnotes (N) [CC13, KTRML16]. MD presents meta-data information for the table and often explain what the content of a table is about. TA represents column headers. LA represents row headers. D cells are the core body of the table. N cells present additional information about the table. . . . . . . . . . . . 45 2.22 Categorizing by role type (second scenario). Each category contains 500 sample cells randomly selected from the annotated CIUS corpus. . . . . . . 47 2.23 t-SNE plots for evaluating cell vectors for numerical cells. The plots show about 60,000 cells containing decimal numbers randomly selected from our table corpus. The color represents log transformed magnitude of the numbers. 48 ix 2.24 Semantic information in cell embedding vector space. The figures show the t-SNE visualization of cell embeddings for a random sample of 100,000 cells for WikiTables corpus using Google embedding projector. The side-bar shows the nearest cells to the selected cell in the embeddings vector space. 50 2.25 Comparison of cell embedding models for categorizing by data type of cell textual value (first scenario). Each category contains 500 sample cells ran- domly selected from the annotated CIUS corpus. . . . . . . . . . . . . . . 52 2.26 Comparison of cell embedding models for categorizing by role type in the data layout of table (second scenario). Each category contains 500 sample cells randomly selected from the annotated CIUS corpus. . . . . . . . . . . 53 3.1 Examples of table layout. (a) from DeEx, (b) from SAUS, (c) from CIUS . 56 3.2 Overview of our cell classification framework. The raw table is fed into our cell embedding model which produces a tensor of cell embedding vectors for each cell in the table. A classification head uses the cell embedding vectors to predict the role of each cell in the tabular data layout. . . . . . . . . . . . 57 3.3 3-layer perceptron classification head for producing cell label probabilities. FC is a fully connected layer, and LReLU stands for leaky ReLU activation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 RNN-based cell classification head. lstm row observes the cell embeddings in table rows, and lstm col observes the cell embeddings in table columns. Outputs of these two LSTM networks are aggregated and used to predict the prediction probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Comparison of the vanilla neural-network models, E2E and CV , with our full classification model, CE. E2E trains both the cell embedding network and the classification head end-to-end, while CE uses pre-trained cell embeddings. CV uses similar classification head, but uses the cell property encoding vec- tors as features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Overview of the workflow for the in-domain and out-domain evaluation set- tings. In the in-domain evaluation setting the classification model is trained on the whole training set from the target dataset. In the out-domain evalua- tion setting, the classification model is trained on out-domain datasets, plus sample documents from the training set of the target dataset. . . . . . . . . 67 3.7 2D visualization of cell embeddings for CIUS dataset. The numbers of TA, D, MD, B, LA, and N points in this plot are 5486, 21210, 914, 6376, 58041, and 2854 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8 Confusion matrices for the case of training the systems within each dataset (in-domain setting). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.9 Confusion matrices for the case of fully out-domain training setting, where no training documents from the target domain is introduced to the classifier. 77 4.1 Examples of data table types on the Web from human trafficking advertise- ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 An example of table vectors shown in 2D space. . . . . . . . . . . . . . . . 83 x 4.3 Screen-shot of our table cluster labeling tool (in Jupyter Notebook). For each cluster, the system selects 5 tables near the center of the cluster, and asks the user to assign a label to the cluster. In this example, the system is showing five tables from four different websites and the user is labeling the cluster as entity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4 Number of rows and columns for web-tables in each domain. Note that range of the axises differ in different domains. . . . . . . . . . . . . . . . . . . . 87 4.5 Silhouette score varying number of clusters. . . . . . . . . . . . . . . . . . 90 5.1 Document retrieval overview. Given a keyword query and a set of docu- ments, the goal is to calculate relevance score for each document and rank them accordingly. The top ranked documents are ideally most relevant to the user query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.2 Overview of our pointwise learning to rank framework for table retrieval. The query is encoded into a vectorq using FastText model. The cell embed- ding tensor is aggregated into a table vectort using an attention mechanism. A multi-layer perceptron classifier then takes concatenation ofq andt vec- tors to produce the relevance probability. . . . . . . . . . . . . . . . . . . 98 5.3 Example of annotations for keyword query and candidate tables. Each table occurs in a Wikipedia page and includes metadata information from the page, i.e. page title, section title, and table caption. The annotated relevance score determines how relevant the table is to the query, 2 means very relevant, 1 means somewhat relevant, and 0 means not relevant. Note that this figure shows a sample of the candidate tables for the query, the dataset contains 60 candidate tables for this query. . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4 Example of combining a Web table and its metadata information from the corresponding Wikipedia page into a single tabular document. We set the styling of top attribute cells as bold faced, and set the font size of the meta- data cells larger than other cells. . . . . . . . . . . . . . . . . . . . . . . . 102 xi Chapter 1 Introduction A vast amount of useful information from various domains such as environment, finance, business, and socio-politics, is available in semi-structured tabular form. The organization of the data in the tabular document is governed by the tabular data layout which can express complex multi-dimensional data relations. Common formats of tabular data found on the Web includes spreadsheets, comma-separated value files, and HTML tables. Tabular data format is presented in a two-dimensional matrix of tabular cells which often contain natural language text. These data formats can incorporate rich cell styling, such as font color, to help the human users to interpret the data more easily. The information in tabular documents can be potentially used in various processes, such as business intelligence, defense, and cyber crime investigation; nevertheless, it is challenging for machines to understand complex data relations in many tabular documents. The effort on utilizing the information from tabular data on the Web has a long history and there has been research on various tasks including answering user queries on tables. Majority of such research work focus on HTML tables with known, and rather simple, data layouts. However, a large amount of valuable data lies in tabular documents with complex data layouts that include hierarchical relationships and concatenation of disparate data. As an example, there are more than 30,000 excel documents on data.gov containing valuable information from various domains. Figure 1.1 shows some examples of tables with complex 1 (a) (b) Figure 1.1: Examples of tabular documents with complex data layout. data layouts. Figure 1.1a contains sub-header structure and Figure 1.1b contains concatena- tion of different indicators for a population in a single table. Understanding such data can be cognitively challenging for humans, and automated techniques for table understanding still struggle to parse arbitrary tabular datasets, from arbitrary domains. Previous approaches for understanding the complex data layout of tables focused on manually-engineered stylistic, formatting, and typographic features of tabular cells [3, 4, 5]. Examples of such manual features include background color, font size, cell data type, and 2 presence of capitalized letters. Such approaches have three major shortcomings. First, pro- posed features are often dependent on richly-formatted documents in a particular represen- tation (such as Excel documents or HTML), preventing such approaches from being uni- versally applicable. In particular, a large number of published data sources are represented in textual, tab or comma-separated formats where stylistic features are unavailable. For example data.gov contains about 19,000 CSV files from various domains. Second, proposed feature engineering techniques are prone to overfitting on domain specific conventions and often require extensive re-training for domains where different conventions are used. For example, one domain may use bold faced fonts for column header styling while another domain may use italic fonts. Third, such prior approaches have ignored the content of each cell beyond simple featurization such as the content starting with a capitalized letter, dis- carding an essential feature for comprehending the data. In this thesis, we introduce novel methods for utilizing tables with complex data layouts. 1.1 Insights and Hypothesis In recent years, deep neural networks (DNNs) have shown effective results in solving com- plex problems such as natural language processing and computer vision. DNNs enable dis- covering complex data patterns and learn useful continuous representation from raw data without the need for manual feature engineering. Representation learning methods embed complex patterns and relationships in raw data into a vector space in which relative position of points represents desired patterns or relationships in the raw data. Learning such repre- sentations is often performed in an unsupervised manner and by employing large amount of 3 raw, unlabeled data. One of the successful uses of such techniques was the advent of lan- guage modeling techniques for natural language processing, which was a turning point for advances in state of the art natural language processing technology [6]. Language modeling (LM) is a task central to natural language processing and language understanding. Language models can accurately place distributions over sentences and not only encode complexities of language such as grammatical structure, but also distill a fair amount of information about the knowledge that a corpora may contain. In recent years, language models has been suc- cessfully used to achieve complex tasks in different fields of research, such as information retrieval and natural language processing [7, 8, 9]. Tabular data is represented in a structured form following established principles of data organization [10, 11]. Users often follow conventional data organization rules when putting the data in tabular form [12]. Therefore, distributional patterns exist in tabular data. For example users often put the headers on top of the table, put dates in order (the header column in Figure 1.1b), and separate different parts of the document (e.g. separate tables) by empty rows or columns. Tabular data is presented in form of natural language text, however, there are two key characteristics that prevents off the shelf language models (such as BERT [9]) to be applicable on tabular documents with arbitrary layout. First, unlike natural language text where the building elements are single words (or n-grams), the building elements of tabular data are cell values which may vary from a single number to multiple sentences. Second, language models assume a one-dimensional sequence of words while tables represent a two- dimensional cells matrix. Also, tables have a non-local nature and important co-occurrences 4 can be spatially diverse. For example, consider a data cell in the middle of a table, for which the column header may be many rows above and is part of its context. Inspired by the success of language models and shortcomings of current methods for understanding complex table data layouts, this thesis aims at devising techniques to learn meaningful representations for tabular cells as building blocks of tabular data. The main hypothesis I investigate in this thesis is as follows: “an unsupervised representation learn- ing framework can be devised to capture the regularities, general data patterns and rela- tionships between tabular cells, and employing this framework can improve the state of the art results for automated tabular understanding as well as reduce the manual effort for this process.” 1.2 Dissertation’s Solution and Contributions Achieving the downstream tasks on tabular data is often performed using supervised learn- ing frameworks and requires annotated data which is expensive because of manual cura- tion. Moreover, such supervised frameworks are prone to overfitting to the development dataset and often need large training data to learn generalizable patterns. In this thesis, we propose an approach to achieve more generalizable models for achieving the downstream tasks on tabular data, and reduce the manual effort for annotating training data for each new dataset. Our proposed approach comprises two major components, unsupervised represen- tation learning for tabular cells (pre-training), and using the learnt cell representations in a supervised framework to achieve a target task (fine-tuning). 5 We propose three deep neural network architectures for learning distributed representa- tions for tabular cells to embed cell properties and context in a dense vector space model, which we refer as cell embedding models. Our proposed models adopt current techniques in natural language processing and takes the multi-channel cell properties (i.e. textual, posi- tional, and stylistic) and non-local structure of tables into consideration to enable capturing meaningful contextual information about the cells. We then propose a masked cell model (MCM) which is a self supervised framework for training the cell embedding models on a large corpus of unlabeled tabular documents. Pre-trained cell embedding models learn general semantic and structural distributional patterns which we show can be transferred to downstream tasks on tabular data using task-specific fine tuning methods. In this thesis, we investigate three downstream tasks on tables and propose fine-tuning frameworks to achieve them using our proposed pre-trained cell embeddings. Our proposed pre-trained cell embeddings provide a general feature space for tabular cells, and helps with achieving the downstream tasks on tabular data with less manual effort. The first task we investigate in this thesis is classifying tabular cells by their role in the data layout of tables, which aims to detect elements of complex tabular data layouts. We focus on five major cell types, left attributes, data, top attributes, metadata, and footnote. The second downstream task is classifying Web tables by their structure and utility in HTML pages. The majority of HTML tables are used for page formatting (non-data tables), and the tables which con- tain useful data (data tables) can be categorized into four major categories, entity, relational, matrix, and list. We propose a semi-supervised method where we first cluster the HTML 6 tables using the pre-trained cell embeddings, and then annotate the formed clusters by sam- pling a few tables and assigning the majority vote of the category for the sampled tables, to the tables in the corresponding cluster. This approach significantly reduces the man- ual annotation effort needed for achieving this task compared to supervised training. The third downstream task we investigate in this thesis is retrieving and ranking tables based on their relevance to a natural language query. We propose a supervised framework to employ our proposed pre-trained cell embedding models, and an off the shelf pre-trained language model, to achieve this task. To summarize, in this thesis, we make the following contributions: • introduce neural-based cell embedding models for learning distributed representations for tabular cells which take the multi-channel and non-local nature of tables into con- sideration • introduce a self supervised masked cell model as a general framework for pre-training cell embedding models on a large corpus of unlabeled tabular documents • introduce supervised fine-tuning frameworks for using pre-trained cell embeddings to achieve three downstream tasks on tabular data: detecting elements of complex tabular data layout, classifying Web tables by their structural properties, and ranking tables according to a natural language query 7 1.3 Structure of This Dissertation The remaining of this thesis is organized as follows. In Chapter 2, we introduce our pro- posed cell embedding models and our MCM framework. We also present empirical analysis on the cell embedding vector space that results from pre-training cell embedding models on a real-world corpus of tabular documents from various domains. Chapter 3 introduces our proposed supervised framework for fine-tuning pre-trained cell embeddings for classi- fying tabular cells by their role in the data layout of the table. This chapter also presents empirical evaluations on three real-world annotated datasets of tables with complex data lay- outs. We introduce our proposed semi-supervised framework for classifying Web tables by their structural properties in Chapter 4. This chapter also presents empirical evaluation on four real-world dataset of Web tables from various domains. In Chapter 5, we introduce our method for retrieving and ranking tables according to a natural language query. Chap- ter 6 summarizes the related work, and Chapter 7 concludes this thesis and suggests future research directions. 8 Chapter 2 Unsupervised Learning of Cell Vector Representations 2.1 Introduction Deep neural networks (DNNs) have shown state of the art results in several fields including natural language processing. In these networks, which may contain billions of parameters, various encoding layers are used to transform the complex data representation in natural lan- guage text to real-valued feature vector representations. Training these networks for a spe- cific task, such as sentiment analysis, in an end-to-end supervised learning setting requires a large amount of annotated data. Data annotation requires human experts, thus is a tedious and expensive process. Recent research has shown that we can leverage the distributional data patterns in large corpus of un-annotated natural language documents, to pre-train the networks and significantly reduce the amount of annotations needed for achieving a task by reducing the chance of overfitting to labeled data. A common pre-training method for natural language text is masked language models (MLM), such as BERT [9]. Pre-trained language models are an essential part of state of the art systems for several natural language processing tasks, including sequence tagging [13], text classification [14], and machine translation [15]. Pre-training is often performed on a large corpus of unlabeled data, which enables capturing 9 general patterns in the data. The resulting pre-trained vector representations embed the gen- eral data patterns and can be used as features for various downstream tasks [9]. The methods used for representation learning may differ in various applications, but they often follow the same objective which is to preserve as much information about the raw data while attain- ing nice properties in the space of vector representations (such as independence) in order to make the subsequent learning task easier [16]. Tabular data is predominantly organized in two-dimensional matrices, where each cell in the matrix has various properties, textual value, position in the matrix, as well as rich formatting in case of spreadsheets and HTML tables. In order to encode the important infor- mation about a tabular cell in its corresponding cell vector representation we not only need to encode its properties, but also the important contextual information from other cells in the table. Important context cells can be spatially diverse and a cell context may include both nearby elements (local context), and more distant elements (global context). As an example of nearby cell context, consider a tabular column with hierarchical headers, where the con- text of a lower level header cell, includes the higher level header cell in the row above. As an example of distant cell context, consider a data cell in the middle of a table, for which the column header may be many rows above and is part of its context. In this chapter, we introduce deep neural net models to embed important information about a tabular cell and its context in order to capture semantic and structural distributional patterns in tabular data. We also introduce pre-training objectives to train such models on large corpus of unlabeled tabular documents. 10 Figure 2.1: Overview of our masked cell model (MCM) framework for pre-training cell embeddings. First, target and context cell properties are encoded into real-valued vectors. These vectors are then fed into the cell encoder layer which embeds all the important infor- mation into a single vector representation for the target cell (cell embedding). The decoder layer then tries to reconstruct part of cell properties as a training objective. Note that we mask the objective part from the initial cell vector (output of property encoder) to prevent the network to trivially use that information to produce its objective. Let us formally express a given tabular documentD as a tabular matrix withN rows and M columns, i.e.D =fC i;j ; 1iN; 1jMg. We wish to learn an embedding oper- ator (E) that maps information regarding a target cellC i;j and its context to a k-dimensional cell embedding, E i;j 2 R k . In this thesis, we introduce various cell encoding layers as embedding operators and a masked cell model (MCM) framework to train these operators. Figure 2.1 shows the overview of our proposed framework for pre-training cell embeddings. Each tabular cell may have complex textual, formatting, and positional information, which 11 cannot be used in the neural networks as is. We thus introduce property encoding meth- ods which use pre-trained language models as well as feature transformations to encode cell property information into a vector representation. The cell encoding layer then uses the (masked) vector representations for a target cell and its potential context cells, and embeds important cell information into the output cell embedding vector. The potential context cells that are used in the cell encoding layer is a design choice and can be constrained to be the local neighborhood of the target cells, or relaxed to be all the other cells in the table. The decoder layer then tries to use the cell embedding vector to reconstruct some masked proper- ties from the target cell, as the training objective for this MCM framework. The remaining of this chapter introduces various methods for different layers of our MCM framework. 2.2 Property Encoder Among the cell properties, textual content and position (row and column index) can be found in all forms of tabular data, including CSV files. Moreover, basic textual formatting such as capitalization and punctuation is also available in all data formats. Spreadsheets contain much richer formatting information, such as data type, font style, background style, and borders style. In this section we focus on spreadsheets, however, some of the formatting fea- tures that we introduce are available in HTML tables as well. Figure 2.2 shows an example of cell properties in a spreadsheet. The remainder of this section introduces our methods for encoding cell properties into real-valued vectors. 12 Figure 2.2: Cell properties in a spreadsheet table. 2.2.1 Encoding cell textual content In word embedding methods [17], a vocabulary of words is assumed to be available during the training stage, allowing the generation of vector representations for all words. In our problem setting, cell values in tabular documents have a large variety and may vary from a single number to multiple sentences, violating this assumption. For our system to be able to use the cell values, they need to be encoded in a latent vector representation. More formally, for each cell valuet, we wish to associate a vector representationV t 2R d . To achieve such vector representations, an encoder module for the cell values may be trained along with the cell encoder layer; however, in our experiments, we could not achieve stable performance with such designs, which we hypothesize may be due to our computational constraints to train on a very large corpus. We address this issue by using pre-trained text encoding mod- els. We experimented with various popular systems for encoding sentences and short texts, Universal Sentence Encoder [18], InferSent [19], and FastText [20] to generate vector rep- resentations for cell values. We choose to use FastText in our model since it showed better performance in our development experiments. Figure 2.3 shows how FastText uses sub-word embeddings to generate a textual encoding vector (V t ) for a tabular cell. 13 Figure 2.3: Example of applying FastText to cell textual content. In this example the text is first broken into 3-grams, and then the vector for each 3-gram is retrieved from the fastText model. The finalV t vector is achieved by averaging the vectors for all 3-grams. 2.2.2 Encoding cell formatting In this thesis, we use a set of manual features that Koci et al. [21] extracted from rich formatting of tabular cells available in spreadsheets. Table 2.1 shows the list of 57 formatting features that we use in this thesis. Some of these features take integer values with long-tailed distribution over different cells, for example cell text length can be from a few characters to hundreds of characters. In order to be able to use this feature vector in our neural network models, we perform a log transformation to mitigate the long-tailed distribution, and then perform a min-max scaling (with respect to other cells within the table) to bring the feature values in (0,1) domain. Figure 2.4 depicts the steps taken to generate the formatting encoding vector (V s ) for a tabular cell. 2.2.3 Encoding cell position The position of a cell in the table carries useful information. For example, top attributes often appear in the top rows. Therefore, we employ row and column indices of a tabular 14 Feature Description Domain Feature Description Domain all upper if all text characters uppercase 0/1 border bottom type=2 if the bottom border is thick 0/1 capitalized if any character capitalized 0/1 border left type=2 if the left border is thick 0/1 contains colon if contains colon 0/1 border right type=2 if the right border is thick 0/1 first char num if first character number 0/1 border top type=2 if the top border is thick 0/1 first char special if first character special (e.g. %) 0/1 h alignment=0 if horizontally aligned to the left 0/1 in year range if text represents an integer in the range 1985-2025 0/1 h alignment=1 if horizontally aligned to the center 0/1 is alpha if text is alpha-numeric 0/1 h alignment=2 if horizontally aligned to the right 0/1 leading spaces number of leading spaces integer v alignment=0 if vertically aligned to the top 0/1 length number of characters integer v alignment=1 if vertically aligned to the center 0/1 punctuations if punctuation (e.g. ”,”) is present 0/1 v alignment=2 if vertically aligned to the bottom 0/1 special chars if special characters are present 0/1 fill patern=0 if no fill pattern exists 0/1 words number of words integer cell borders=0 if no cell borders exist 0/1 words like table if words similar to table are present 0/1 cell type=0 if cell type is text 0/1 words like total if words similar to total are present 0/1 cell type=1 if cell type is numeric 0/1 num of neighbors=0 if number of non-empty neighbors is 0 0/1 cell type=2 if cell type is formula 0/1 num of neighbors=1 if number of non-empty neighbors is 1 0/1 is aggr formula=1 if cell contains an aggregate formula 0/1 num of neighbors=2 if number of non-empty neighbors is 2 0/1 is font color default if the font color is default 0/1 num of neighbors=3 if number of non-empty neighbors is 3 0/1 font height font height integer num of neighbors=4 if number of non-empty neighbors is 4 0/1 is bold if font is bold faced 0/1 border bottom type=0 if the bottom border is thin 0/1 is italic if font is italic 0/1 border left type=0 if the left border is thin 0/1 underline type=0 if text is underlined 0/1 border right type=0 if the right border is thin 0/1 indentation cell indentation value integer border top type=0 if the top border is thin 0/1 is wraptext if cell wraps text 0/1 border bottom type=1 if the bottom border is medium 0/1 num of cells number of cells (¿1 for merged cells) integer border left type=1 if the left border is medium 0/1 first col num column index, for merged cells left-most column index integer border right type=1 if the right border is medium 0/1 first row num row index, for merged cells top-most row index integer border top type=1 if the top border is medium 0/1 is bg color default if the cell background color is default 0/1 Table 2.1: List of cell formatting features used in this thesis. 15 Figure 2.4: Example for generating cell formatting encoding vector (V s ). The distribution of the length feature in the table is depicted throughout the transformation process. cell to calculate its positional encoding. Unlike the language models which work on one- dimensional sequences of natural text, tabular data representation is two-dimensional. For a given cellC i;j , both the row indexi and the column indexj defines the cell position. To embed this information, we use a similar sinusoidal approach as in the Transformer model in [2]. For each cell at the position (i, j) in the tabular matrix, we calculate a positional encoding vectorV p 2R d , whered is the vector dimension. V p calculates according the the formula in equation (2.1), where 0k<d. V (k) p = 8 > > > > > > > > > > > < > > > > > > > > > > > : sin(i=10000 (2k=d) ); k< d 2 &i is even cos(i=10000 (2k=d) ); k< d 2 &i is odd sin(j=10000 ((2k=d)1) ); k d 2 &j is even cos(j=10000 ((2k=d)1) ); k d 2 &j is odd (2.1) 16 2.2.4 Property encoding vector After calculating textual, formatting, and positional encoding vectors for a tabular cell, we combine these vectors to compute a final property encoding vector. In this thesis we calculate this final vector according to (2.2). In this formula, the textual and positional encoding vectors for a tabular cell at position (i,j) are summed together, and then appended to the formatting encoding vector. V i;j = [V t;i;j +V p;i;j ;V s;i;j ] (2.2) 2.3 Cell Encoder Layer Applying the property encoder layer to a tabular document results in a table tensor v (T) , which contains a corresponding vector for each cell in the table. Each property vector contains independent cell information, which does not carry all the necessary information without context. For example, the same texts such as ”Price” may occur in vastly different contexts such as in the table title or the column header. Moreover, the formatting features of a cell is more meaningful when compared to other cells in the table. For example a table title may have larger font size compared to other cells in the table. Therefore, a cell embed- ding based on the cell properties alone is insufficient. In order to calculate a meaningful cell representation, the context in which tabular cells appear should be taken into account. The purpose of the cell encoder layer is to identify useful context cells, and combine important 17 information from such context cells with the target cell information, and generate the cell embedding vector. This section introduces our proposed model architectures for the cell encoder layer, bag of cells (BoC), cell graph network (CGN), and cell transformer (CTrans). These differ- ent design architectures enable various choices of potential context cells, and use multiple encoder layers to embed cell information as well as its context into the cell embedding vector. The rest of this section goes through these methods. 2.3.1 Bag of cells (BoC) Earlier methods for learning word embeddings relied on the local context of a word in natural language text. Mikolov et al. introduced continuous bags of words (CBOW) and skip-gram (SG) models, which were able to embed semantic information about words in a large corpus of natural text, using simple pre-training objectives. The resulting vectors showed signifi- cant improvement in various NLP tasks [22, 23]. Figure 2.5 shows an overview of CBOW and SG models. These models assume the co-occurrence of words in the local context in natural language text contains meaningful information, and try to learn distributed word rep- resentations from huge data sets with billions of words, and with millions of words in the vocabulary. Another assumption in these models, is that there exists a fixed set of words in the vocabulary. During the training of the model, they learn a representation for each word. The training objective of CBOW architecture is to predict the current wordw(t) based on the context,w(t 2)-w(t + 2). SG model has the training objective of predicting surrounding words given the current word. 18 Figure 2.5: Continuous bags of word (CBOW) and skip-gram (SG) word embedding models [1]. CBOW is shown on the left and SG is shown on the right. Similar to natural language text where the neighboring words contain valuable distribu- tional patterns, surrounding cells in tabular documents also contain important distributional patterns since tabular data formation is often homogeneous along tabular rows or columns. In the bag of cells (BoC) model, we employ a method similar to CBOW and SG [17] to capture such local distributional patterns about tabular cells in the cell embedding vectors. We define the local context of a target cell as its adjacent cells to the left, right, above and below. Based on preliminary experiments using our development set, we achieved the best performance with a neighborhood window size of 2, and our method uses 8 neighboring cells in horizontal and vertical directions as local context. Figure 2.6 depicts the target cell and its local context. More formally we define the local context of a target cellC i;j in a tabular documentD asX C i;j ,C i2;j ;C i1;j ;C i+1;j ;C i+2;j ;C i;j2 ;C i;j1 ;C i;j+1 ;C i;j+2 . 19 Figure 2.6: Bag of cells (BoC) cell embedding model. The green cell (C i;j ) is the target cell for which we want to calculate the cell embedding, and the blue cells are context cells. The cell encoding vectors are fed into thectx andt networks. Enc ctx andEnc t modules have the same structure, but different parameters. Unlike CBOW and SG models, we cannot assume a fixed vocabulary of tabular cells and the purpose of BoC model is to learn a transformation function (cell embedding operator) from the target and context cell to the cell embedding vector. In our model, we collect this cell embedding vector by concatenating the output of the contextual encoder (E ctx c ) and the target encoder (E t c ), i.e.E i;j = [E ctx c;i;j ;E t c;i;j ]. 2.3.2 Cell graph network (CGN) BoC model utilizes the local context of a target cell for calculating its cell embedding vec- tor. We can formulate such local dependencies as a graph structure with local connectivity between tabular cells. In this section we introduce a cell embedding model using such graph structure, and we use graph neural network (GNN) architecture to compute cell embeddings based on the graph representation. Moreover, to bring distant context information into the 20 cell embeddings, we add global nodes to the graph which reduces the number of hops to get to a distant cell in the graph traversal. In the remaining of this section we first go through the GNN architecture, and later introduce our proposed CGN model. Graph neural network (GNN) Graphs are data structures which can model a set of objects (nodes) and their relationships (edges). Graphs have great expressive power and are easily interpreted, thus they are used in various areas including social networks [24], physics systems [25], and knowledge graphs [26]. Graph neural networks (GNNs) [27] are deep learning models that operate on graph domain. We employ the message passing neural network (MPNN) framework [28] to for- mally introduce GNNs. MPNN outlines a general framework for a large category of GNNs [29]. In this framework, each nodev in the graph has a hidden state at current timet (h t v ), and each edge e vw has edge parameters (x e vw ). At each step of message passing, a node receives the hidden states of its neighbors, and by applying an aggregate function the node calculates a new hidden state for itselfh t+1 v . There are two main parts in each message pass- ing step, obtaining the message from neighboring nodes, and updating the hidden state of the node. Equation (2.3) formalizes a general form of these two steps. In this equation, At each message passing step, a nodev receives the current hidden states of its neighbors (h t w ’s) and creates a messagem t+1 v . The updated hidden stateh t+1 v is then calculated based onm t+1 v and current hidden state h t v . Note that M and U are functions with learnable parameters. After performing a certain number of message passing steps, each node in the graph has a final hidden state vector which aggregates the information from its local neighborhood in the 21 Figure 2.7: A simple directed graph where each node has a hidden state vector. The node with dark blue color receives messages from its two neighbors in neural message passing. graph. This local neighborhood can include distant nodes if we repeat the message passing step enough times. The number of steps to perform the message passing is a design choice and is problem specific. m t+1 v = X w2N(v) M(h t v ;h t w ;x e vw ) (2.3) h t+1 v =U(h t v ;m t v ) Figure 2.7 shows an example of a simple graph with hidden state vectors for each nodes (h v ’s), and edge properties (x e vw ’s). There are various forms of GNNs based on the choice of the message propagation methods which are introduced in details in [30, 29]. One of these variants which we use this thesis is graph attention network (GAT) [31]. In GAT, attention mechanisms are used to control the contribution of each neighboring node (w) to the message vector (m t+1 v ). Equation (2.4) shows the update rule for GAT network. In this formula, each neighboring hidden stateh t w passes through a fully connected layer with 22 parameters W t+1 . Then, the message vector m t+1 v is calculated as a weighted sum of the resulting vectors, where the weights t+1 vw are learned through the attention mechanism in equation (2.5). The updated hidden stateh t+1 v is achieved by applying a sigmoid activation on the message vectorm t+1 v . Note that the attention weight t+1 vw is a scalar which measures the connective strength between the node v and its neighbor w. In equation (2.5), a is a vector of learnable parameters, and softmax function ensures that the attention weights sum up to one over all neighbors of v. Moreover, W t+1 are learnable parameters in the multi- head attention mechanism [2] that GAT employs, and g is an activation function (often a LeakyReLU). m t+1 v = X w2N(v)[v t+1 vw W t+1 h t w (2.4) h t+1 v =(m t+1 v ) t+1 vw =softmax(g(a T [W t+1 h t v jjW t+1 h t w ] (2.5) Cell graph presentation To express a tabular document as a suitable graph for GNN, we need to define nodes, edges, and node feature vectors. We consider a tabular graph as a grid of tabular cells, where each node represents a table cell (C i;j ). The edges in tabular graph connect adjacent cells, i.e. the cells at the top, right, bottom, and left. Figure 2.8 shows the cell graph presentation for our example table. Note that in this graph, we assign a single node for a block of merged cells. The graph representation in Figure 2.8 considers the spatial adjacency of cells to connect the edges. This helps with local propagation of information in the GNN, and with more 23 Figure 2.8: An example of cell graph. Nodes in the graph represent cells (or block of merged cells), and edges connect adjacent cells. message passing steps the cell embeddings can capture information from more distant cells. However, tables have non-local nature and this way of propagating information is not effi- cient. In our model, we employ the addition of global nodes to facilitate propagation of information between various distant nodes. More specifically, we add three types of global nodes, row node, column node, and table node. A row (or column) node connect to the nodes in a corresponding table row (or column). A table node, which is one node for the whole 24 Figure 2.9: Row, column, and table nodes in the tabular graph which facilitates information propagation between distant cells in the GNN. table, connects to all the row and columns nodes. Figure 2.9 shows an example of the tabular graph with row, column, and table nodes. Table GAT formulation Up to here, we introduced the tabular graph where nodes represent tabular cells. As per GAT formulation in (2.5), we start from initial node feature vectors (h 0 v ) and aftert message passing steps, each node has a latent representation (h t v ). In our tabular graph, we use the cell encoding vectors that we introduced in Section 2.2 as initial node feature vectors of cell nodes. We set the initial global node features as learnable parameters. More specifically, we introduce three parameter vectorsH r ,H c , andH t as the feature vectors of the row, column, and table nodes respectively. These parameter vectors are trained along with other GAT parameters during training. Note that while the initial feature vector of row (or column) 25 Figure 2.10: GAT cell graph network. Each GAT layer performs a single message passing step. nodes are the same, the latent hidden state of those nodes may be different since they receive messages from different cells. As we mentioned earlier, the number of message passing step in GAT is a design choice. We choose to have 4 layers of GAT in our model as shown in Figure 2.10. The first GAT layer takes the cell encoding vectors (and global node feature vectors) an performs a single message passing step and outputs updated node hidden states. Note that the dimension of the input hidden state vectors in GAT can be different from the output hidden state vectors, and in our model the first GAT layer outputs hidden state vectors with a chosen model dimension d. Each later GAT layer in our model takes the output of the previous layer, performs a message passing step, and outputs the updated node hidden state. These later layers do not change the hidden state vector dimensions. The last GAT layer produces the cell embeddings for all the cells in the table (E i;j ’s) in its output. 26 2.3.3 Cell transformer (CTrans) The BoC model which we introduced in Section 2.3.1 utilized only the local cell context in the cell embedding vectors. The CGN model in Section 2.3.2 tried to bring in distant context information by adding global nodes. In this section, we introduce our proposed cell transformer (CTrans) model to utilize the contextual information from cells in the same row and column as a target cell, to compute its embedding vector. We first explain some background on Transformer models in natural language processing, and then introduce our proposed CTrans model. Transformer model Language models try to learn word (or n-gram) representations by modeling the sequence of words in natural language text [32]. Such models often use encoder-decoder architectures where they encode the word sequence with some masked information in the encoder layer, and the decoder layer tries to generate a target sequence and recover the masked information [2]. We introduced the CBOW and SG models in Section 2.3.1, where the local context of a word in the sentence was used for learning distributed word representations. In recent years, language models have shown to be central to natural language processing tasks and several improved models have been introduced to capture more accurate word representations [32]. The local word context however does not always contain necessary information, and impor- tant contextual information can be in more distant words in the sentence, or even in other sentences. To incorporate such information, attention mechanisms are essential components of the language models. In the attention mechanisms, the objective is to find proper attention 27 Figure 2.11: An example of attention mechanism in generating a target sequence (y) using the source sequence (x). Attention weights are calculated for each element of the source sequence, for each given step in generating the target sequence. A feed forward network is then used to predict the next token in the target sequence. The formulation on the right formally describes this process. weights for various pieces in the word sequence, according to the importance of those pieces to the target word. Attention mechanism emerged from sequence to sequence models for machine transla- tion [33]. Earlier works used recurrent neural networks, especially long short term memory networks (LSTMs), to encode the input sequence and used an attention mechanism to gen- erate the target sequence in another language [33]. Self-attention mechanisms were later introduced for language modeling task where a sentence is used as both the input and the target sequence [34]. In sequence to sequence models, the objective is to generate a target sequence given the source sequence. The attention mechanisms help to identify important information in the source sequence when predicting a next word in the target sequence. There are vari- ous attention-based models for this task [35], here we focus on dot-product global attention 28 Figure 2.12: Scaled dot-product attention. The network maps a query (q) and a set of key- value pairs (k,v) to an output. Theq,k, andv vectors can be packed together inQ,K,V ten- sors respectively. The scaling layer prevents gradient overflow, and mask layer enforces only forward propagation of information in the target sequence (similar to uni-direction LSTM network). [35]. Figure 2.11 shows an example of such attention-based models. The objective of the network is to predict the correcty 2 which is the next word in the target sequence. To capture dependency of words in the sequence, LSTM chains are applied on the target (y) and source sequence (x). h t 1 is the current hidden state of the target LSTM, andh s i ’s are hidden states of the source LSTM. The attention mechanism calculates the attention weights (a (i) 1 ’s) using a dot-product formula, and combines the source hidden states into a context vectorc 1 which then is used for predictingy 2 . Research from Google showed that the LSTM encoders are not efficient and they pro- posed a model purely based on dot-product attention mechanism which is better in efficiency, can handle longer sequences, and performs better compared to the LSTM-based models [2]. The sequence to sequence models based on this mechanism are called Transformers, which are central to many state of the art methods in natural language processing and machine translation [32]. 29 Figure 2.12 shows the overview of the dot-product attention mechanism in Transformers [2]. This attention network is similar to the one in Figure 2.11, and maps a query and a set of key-value pairs to an output. The output of the network is computed as the weighted sum of the values, where the attention weight assigned to each value is computed by a compatibility function of the query with the corresponding key.Q,K, andV are tensors containing all the query, key, and value vectors. Equation (2.6) shows the formula for the attention mechanism. Here,d k is the dimension of the key vectors. Note that in a self-attention mechanism,Q,K, andV tensors are the same. Attention(Q;K;V ) =softmax( QK T p d k )V (2.6) Figure 2.13 shows the full sequence to sequence architecture in the Transformer model. In the Transformer model, there is an encoder network which acts similar to the source LSTM network in Figure 2.11. This encoder layer takes the source sequence, and contains a scaled dot-product self-attention layer followed by a feed-forward layer. Note that the encoder net- work may contain multiple encoder layers. The decoder takes the target sequence, and first applies a masked self-attention scaled dot-product attention. The masking prevents informa- tion flow from words in the sequence to previous positions. This masked self-attention layer acts similar to the target LSTM network in Figure 2.11. The output of the encoder network, and the decoder self-attention layer is then fed into another scaled dot-product attention layer followed by a feed-forward layer. The network ultimately tries to predict the next word in the target sequence (p(y t jy <t ;x)), similar to Figure 2.11. Transformer model combines posi- tional encoding vectors with the input and target sequences to carry information about the 30 Figure 2.13: The Transformer model architecture [2]. position of each word in the sequence. This model allows the computations to be done in parallel (as opposed to LSTM encoders) and hence improves the efficiency of the attention model. Moreover, the attention mechanism can be computed on segments of the input tensors in a multi-head scaled dot-product attention network [2] to further improve the efficiency. Cell transformer model A tabular cell C i;j is associated with two sequences of cells, i.e. ith row and jth column. In our CTrans model, we combine the row and column sequence and use a similar encoder layer as in Transformer architecture for generating cell embeddings. For each target cell, the self-attention mechanism identifies relevant context cells in the corresponding tabular row 31 Figure 2.14: Creating a source sequence of cells for a target cell (highlighted) in the tabular document. The corresponding row and column are concatenated to form the source sequence. Note that the target cell repeats twice in the source sequence. and column. An example of such relevant context cells are the column header(s) and row attribute(s). In the remaining of this section, we describe our CTrans model. We first need to define the source tensor for the Transformer encoder. In Section 2.2 we introduced the cell property encoding method which maps each tabular cellC i;j to a contin- uous vectorv i;j . Note that these vectors also contain the positional cell information as we explained in Section 2.2.3. Given a tabular document and a target cellC i;j , we first concate- nate the sequences of cell encoding vectors in the corresponding row (v i;: ) and column (v :;j ) and form a tensor x. Figure 2.14 shows an example of this source sequences. In order to differentiate the column context and the cell context in the input sequence, we define two parameter vectorsV row andV col which respectively correspond to the context from row and column cell sequence. We create the source tensorx by adding the corresponding parameter 32 Figure 2.15: Overview of the CTrans model. For the target cellC i;j (highlighted), the output of the network is the cell embedding vectorE i;j . x is the tensor containing all the encoding vectors for cells in the context sequence.x is the context source tensor. vector to the cell encoding vectors in x, as described in equation (2.7). Note thatV row and V col are learned along with other parameters during training. x (k) = 8 > > < > > : x (k) +V row ; x (k) in row context x (k) +V col ; x (k) in col context (2.7) Figure 2.15 shows an overview of the CTrans model. For each target cellC i;j , the output of the network is the cell embedding vector E i;j . In this figure, x tensor contains all the encoding vectors for cells in the context sequence. Also,x is the context source tensor which is fed into the CTrans encoder block, where multiple levels of Transformer encoder layers are applied to compute the cell embeddingE i;j . The CTrans block is depicted in Figure 2.16. 33 Figure 2.16: CTrans encoder block. L layers of Transformer encoder applies to the source tensorx, and the output tensor of the last layer is summed up to generate the cell embedding vectorE i;j . We use several layers of Transformer encoders similar to [2] to transform the source tensor x into a latent tensoro. The output tensoro has the same shape asx, and contains a vector associated with each cell in the context cell sequence. To achieve a single embedding vector E i;j for the target cell, we sum up all these vectors and normalize the result. 2.4 Decoder Layer and Training Objective The masked cell framework we introduced earlier follows an encoder-decoder architecture. We introduced various contextual encoding models in Section 2.3. In this section, we pro- pose architectures and objectives for the decoder layer in order to train the model on unla- beled tabular documents. The decoder layer tries to recover some masked property in a target 34 cell in the tabular document. The choice of training objective shapes the cell embedding vec- tor space to retain certain information. In this thesis, we consider textual and formatting cell properties and our training objectives try to capture semantic and syntactic features in target cell text, as well as cell stylistic features. More specifically, we first consider a cell value reconstruction objective, where the decoder layer tries to generate the target cell text. This is similar to the sequence to sequence models in language modeling and machine translation [9]. Moreover, we consider a cell feature reconstruction objective where the decoder layer tries to predict some manually extracted features from the cell text or styling (e.g. length of text, font size, is font bold faced, etc.). Such objectives can be formulated as classification or regression problems. In the remaining of this section we introduce our proposed decoder architectures for cell text reconstruction and cell feature reconstruction objectives. We also propose a multi-task decoder layer to combine different objectives in order to learn a cell embedding vector space which captures various cell information. 2.4.1 Cell value reconstruction We use a forward GRU network to reconstruct the textual value of cells from the cell embed- ding vector, similar to [36]. The output of the attention layer (O j ) feeds into to the hidden input of the first GRU block in our decoder, and the decoder tries to reconstruct the masked cell text character by character. Formally, let us focus on a target cell C i;j and denote its textual value as T i;j and the output of the cell embedding layer asE i;j . T i;j can be denoted as a sequence of characters [<bs>;t 1 i;j ;:::;t K i;j ;<es>], where<bs> and<es> denote special characters indicating 35 Figure 2.17: Cell decoder module for reconstructing cell textual value. The input to GRU units are one-hot encoding vectors for characters in the text. The output of a GRU unit is fed into a multi-layer perceptron to predict the next character in the sequence. beginning and end of cell text, andK is the number of characters in the cell text. We assume there exists a vocabulary of all the characters in the corpus which also contains < bs > and < es >. The objective of the decoder network is to predict the next character in the sequence given E i;j and the previous characters, i.e. p(t k i;j jt <k i;j ;E i;j ). We use a one-hot encoding scheme to introduce the previous characters to the network. Let us denote the corresponding index of a character t k i;j in the vocabulary as c k , then the one-hot encoded vector fort k i;j has the same length as the vocabulary sizeV , and has a 1 at positionc k and 0’s elsewhere. Figure 2.17 shows the overview of our cell text decoder module. For each stepk in the sequence, the GRU decoder tries to predict the corresponding character t k i;j , using the cell state of the GRU block and previous words in the sequence, i.e. p(t k i;j jt <k i;j ;E i;j ). The output of the GRU network at positionk contains the information aboutE i;j and all the characters up to positionk (t <k i;j ). Therefore, we use a multi-layer perceptron block to use the output of the GRU network at each step to predict the next character. We can formally describe the 36 Figure 2.18: Cell decoder module for reconstructing cell features. The input to feed-forward network is the cell embeddingE i;j , and the output ^ y i;j is the prediction for the desired feature. cell text decoder layer as in equation (2.8). Here,h k i;j is the output ofk’s GRU unit, andW 1 , b 1 ,W 2 ,b 2 are learnable parameters. p(t k i;j jt <k i;j ;E i;j ) = e y (c k ) i;j P V k 0 =1 e y k 0 i;j (2.8) y i;j =W 1 i;j +b 1 i;j =ReLU(W 2 h k i;j +b 2 ) h k i;j =GRU(t k i;j ;h k1 i;j ) 2.4.2 Cell feature reconstruction In this section, we choose four cell features from Table 2.1 to be reconstructed by our decoder network. The first two are syntactic features about cell text: length (number of characters), and words (number of words). The other two are stylistic cell features available in spread- sheets: isbold, and fontheight. The fontheight, length, and words features are normalized 37 with respect to other cells in the table as explained in Section 2.2, and take Float values between 0 and 1. We choose these four features since they have the most variability in our corpus. Given a target cellC i;j and the corresponding output from the contextual embedding layerE i;j , we use a feed forward deep neural network that takesE i;j and predicts the desired featurey i;j . Note that we use a Sigmoid function on the output of the feed-forward network for Boolean features (i.e. isbold). Figure 2.18 shows an overview of the cell feature decoder network. This network can be formally described as in equation (2.9). Here,W 1 ,b 1 ,W 2 ,b 2 , W 3 , andb 3 are learnable parameters. y i;j = 8 > > < > > : ^ y i;j ; y is Float 1 1 +e ^ y i;j ; y is Boolean (2.9) ^ y i;j =W 1 i;j +b 1 i;j =ReLU(W 2 i;j +b 2 ) i;j =ReLU(W 3 E i;j +b 3 ) 2.4.3 Multi-task learning We introduced various objectives and decoder architectures to train the cell embedding mod- els. The training objective affects the cell embedding vector space in the way that different information may be embedded in cell embedding vectors. We would like to embed as much general cell information as possible in the cell embedding vectors, therefore we use a multi- task learning approach to train our masked cell model framework in order to optimize for all 38 Figure 2.19: Overview of the multi-task learning framework. Various decoders share the same encoder parameters. the different objectives we introduced. In this thesis, we use a hard parameter sharing [37] approach for the multi-task learning, where the parameters of the layers prior to the decoder are shared for different decoders. Figure 2.19 shows an overview of our multi-task learning approach. 2.5 Empirical Analysis The goal of representation learning is to embed the characteristics and relationships in the raw data into the vector space. Therefore, to evaluate the cell representations, we focus on designing methods to evaluate how well the characteristics and relationships in tabular cells translate into the vector space. In this section, we present empirical evaluations on the vector spaces resulted from the output of the encoder layers in our proposed MCM framework. We first evaluate the effectiveness of our proposed property encoder layer to capture individual cell properties, and particularly we evaluate the FastText encoding vectors. We then eval- uate the cell embedding vector space resulted from the cell encoder layer, and inspect the 39 structural and semantic information about tabular cells which are embedded in the vector space. To perform these evaluations, we pre-train the MCM framework on real-world excel sheets from data.gov and HTML tables from Wikipedia. We create a diverse corpus of tabular documents from various domains. This corpus contains about 31,000 spreadsheets consisting of 1,369 spreadsheets from the 2010 Statistical Abstract of the United States (SAUS) 1 , 1,005 spreadsheets from the Crime in the US (CIUS) data [38], and about 29,000 spreadsheets from data.gov. Our corpus also includes a sample of 200,000 HTML tables from Wikipedia in WikiTables dataset 2 . 2.5.1 FastText encoding vectors We utilized FastText model to encode textual cell values in Section 2.2. Tabular cells contain heterogeneous textual values, and may contain a single number, a short phrase, or a long sentence. Although FastText has been shown very effective in encoding general natural lan- guage text [39], it has not been used for encoding numbers. In this section, we would like to evaluate the FastText encoding for capturing syntactic information about cell textual values. Specifically, we try to predict four syntactic features from the cell text using the FastText encoding vector. These four features are as follows: isNumeric (if the text represents a num- ber), isInt (if the text represents an integer number), isFloat (if the text represents a floating point number), and hasSlash (if the text contains a forward slash which often happens in date 1 http://dbgroup.eecs.umich.edu/project/sheets/datasets.html 2 http://websail-fe.cs.northwestern.edu/TabEL/ 40 Feature F1 Score isNumeric 1.00 ±0.0001 isInt 1.00 ±0.0003 isFloat 1.00 ±0.0034 hasSlash 0.92 ±0.0084 Table 2.2: Evaluation results for predicting syntactic features about cell textual value using FastText encoding vectors. Average of the F1 scores and their standard deviation across various folds are reported. format). These features are useful features about a numeric cell because we often observe homogeneous numeric data in blocks of cells representing the same concept in a table. For example in Figure 2.2, the ”Evaluation Scores” column contains floating point numbers, and ”Exam Date” column contains date formatted texts. We use a Random Forest binary classifier for this experiment. The FastText encoding vector of the cell text is fed to the classifier and it outputs a probability for the feature being present in the text. Probability score of 0.5 and above means that the feature is present in the text. We take a sample of 1,000,000 cells from our table corpus for this experiment, and perform a 5-fold cross validation where we randomly split the cells into 5 splits, and use four of the splits to train the classifier and one of them to test the performance (80% train and 20% test). Table 2.2 shows the result of this experiment, which confirms FastText encoding vectors capture important syntactic information about cells with short textual values, and especially numeric cells. 2.5.2 Cell embedding vectors In this section, we present experimental results on cell embedding vector space obtained from pre-training our MCM framework on the corpus of tabular documents. We try to investigate 41 what cell property information is preserved in the cell embedding vectors. To this end, we first perform some experiments to evaluate the cell embedding vectors on the structural information about the tabular data representation, where we focus on the cell syntactic and structural information, and its role in the data layout of the table. We then perform some experiments to evaluate the semantic information captured by cell embedding vectors, where we visually inspect the local neighborhood of the cells in the vector space and search for semantic relationships. Cell structural information We focus on two main structural properties of a tabular cell: its data type, and its role in the tabular data layout. To evaluate the information preserved in the cell embedding vectors about these properties, we compare the cell embedding vectors and the cell encoding vectors we introduced in Section 2.2. Moreover, we use the cell embedding vector space obtained from CTrans embedding model explained in Section 2.3. In the remaining of this section we introduce our method for this evaluation and present the results. In order to test the vector space for each cell property, we first categorize the cells regard- ing to the property. We then assign the cell property category to its corresponding point in the vector space, for all the cells in the dataset. The vector space is then evaluated on how well the points belonging to one category are close (category cohesion), and the points belonging to different categories are far (category separation). Euclidean distance of points are used as the distance measure, and Silhouette score [40] is used as the cohesion and separation metric in our method. Silhouette score is a metric which ranges between -1 and 1, and the 42 (a) cell embeddings, silhouette score: 0.08 (b) sentence encodings, silhouette score: -0.03 Figure 2.20: Categorizing by data type of cell textual value (first scenario). Each category contains 500 sample cells randomly selected from the annotated CIUS corpus. closer the score is to 1, the better cohesion and separation is in the clustering. Moreover, to visualize the cohesion and separation of categories in the vector space, we use a distance matrix of cell vectors, sorted by the categories. Supposing that there aren cells in the raw datafC 1 ;:::;C n g, sorted by their categories, this sorted distance matrix (M) is a symmetric nn matrix in which each elementM i;j is the distance between the vector representations for C i and C j . In a good clustering of data, there will be squares of low values along the main diagonal of the sorted distance matrix representing the clusters (good cluster cohesion), and other values in the matrix are much larger than the values in these squares (good cluster separation). These metrics can be visually illustrated by plotting a heatmap of the sorted distance matrix, where the cooler is the color of the area inside the squares means better 43 clustering cohesion. Also, the warmer is the color of the area outside the squares, the bet- ter is clustering cohesion separation. In our visualizations, we also use an averaged sorted distance matrix where the distance between cell vectors are averaged for each category pair. Note that both the sorted distance matrix and averaged sorted distance matrix are symmetric. We perform the evaluation on spreadsheets from Crime In the US (CIUS) data in 2007 and 2017 3 . Here, we refer to a single spreadsheet as a table, an excel file may contain mul- tiple spreadsheets. Also, empty cells (cells with no content) are removed in this experiment. Our experiments contains two scenarios. The first scenario evaluates the vector spaces on how much information they capture about data type of cells. The second scenario evalu- ates the vector spaces on how much information they capture about cell role type in the data layout of the table. In first scenario, we put the tabular cell into four categories: date/time type (date), integer number (int), decimal number (float), and general text (str). We use a high precision pattern matching on cell textual values to identify integer and decimal numbers, and use dateutil Python library to identify date/time values. We then assign all other cells to general text category. Figure 2.20 shows the sorted distance matrix, and the silhouette score for this cate- gorization for the cell embedding and sentence encoding vector spaces. The cell embedding vector space results in more cohesive separation of cell data type categories, especially for date, int, and float. The str category is more scattered in the vector space and results in reducing the silhouette score which we hypothesize is due to the diversity of textual values (from a single word to multiple sentences). The cell embedding vector space works much 3 https://ucr.fbi.gov/crime-in-the-u.s 44 Figure 2.21: Example of five major categories for cell role in tabular data layout: left attribute (LA), data (D), top attribute (TA), metadata (MD), and footnotes (N) [CC13, KTRML16]. MD presents meta-data information for the table and often explain what the content of a table is about. TA represents column headers. LA represents row headers. D cells are the core body of the table. N cells present additional information about the table. better in separating these categories compared with the sentence encoding vectors, which means our MCM framework is successful in adding structural information. Note that here we are using simple euclidean distance measure for this experiment, the results in previous section showed that a random forest classifier can help identify numeric cells using sentence encoding vectors. For the second scenario, we use five major categories for cell role in tabular data lay- out which has been introduced in previous work: left attribute (LA), data (D), top attribute (TA), metadata (MD), and footnotes (N) [3, 5, 38]. We use an annotated sample of the CIUS tabular documents for this experiment. Figure 2.21 shows an example of these cell cate- gories. We describe these cell categories in more details in Chapter 3. Figure 2.22 shows the sorted distance matrix, and the silhouette score for this categorization for different vec- tor spaces, cell embeddings, stylistic features, sentence encodings, and positional encodings. We observe that stylistic features vector space achieves a good category separation, with a 45 silhouette score of 0.25. This result is expected since the stylistic features are intended for separating elements of tabular data layout for easier interpretation. Our MCM framework is successful in preserving cell information in the contextual cell embeddings vector space and achieves the best separation of cells from the different categories in the vector space, with a silhouette score of 0.3. It is noteworthy that the cell textual content alone does not carry enough information about the cell role in the data layout, and sentence encodings vec- tor space fails to achieve good category separation. Moreover, positional encodings vector space shows good cluster cohesion for left attribute cell category, which is expected given the fact that left attributes often appear on the left columns. Positional encodings vector space also shows good cluster cohesion for footnote which often appear in the top or bottom rows. Although the tables have variable number of rows and our position encodings only capture the distance of the cell from the top and left of the table, large row index is a good indicator of footnote cells in our sampled subset, thus we observe the good cluster cohesion for footnote cell category. Cell semantic information In this section, we evaluate the cell embeddings for the semantic information they contain about textual value of the cell. We separately investigate numeric cells as well as cells containing natural language text. We focus on numeric cells containing decimal numbers where the magnitude of the num- ber carries important information. Often numbers representing different properties have dif- ferent distributions, for example in the table in Figure 2.2, ”Evaluation Score” takes values 46 (a) cell embeddings, silhouette score: 0.30 (b) stylistic features, silhouette score: 0.25 (c) sentence encodings, silhouette score: -0.13 (d) positional encodings, silhouette score: 0.18 Figure 2.22: Categorizing by role type (second scenario). Each category contains 500 sample cells randomly selected from the annotated CIUS corpus. 47 (a) cell embeddings (b) sentence encodings Figure 2.23: t-SNE plots for evaluating cell vectors for numerical cells. The plots show about 60,000 cells containing decimal numbers randomly selected from our table corpus. The color represents log transformed magnitude of the numbers. between 1.0 to 5.0 and ”Number of Registered Students” often takes values between 20 to 700. To evaluate if the cell embeddings have some information about the magnitude of the numeric cell values, we randomly selected ˜ 60,000 numeric cells from our corpus of tables. We calculated the FastText sentence encodings, and the cell embeddings for these sampled cells. Figure 2.23 shows the t-SNE plot for these vector spaces. The colors in this plot shows the heat-map for log-transformed value for each cell, i.e. for each numeric valuex, the heat value is calculated as:h =log(jxj + 1). We observe clear patterns in the sentence encodings showing the information about the numeric data magnitude is present in the FastText vec- tors. The plot for the cell embeddings vector space also shows patterns over the distribution of the numeric data with different magnitudes in the vector space. This result shows our pre-training framework is successful in preserving the information about numeric cells. To evaluate the semantic information in cell embeddings space, we select a random sam- ple of 100,000 cells from WikiTables dataset which contains tables from various topics. We 48 calculate the cell embedding vectors for the sampled cells and visually inspect the vector space using Google embedding projector 4 , and t-SNE dimensionality reduction. Figure 2.24 shows two example cells and their nearest neighbors from this vector space. In Figure 2.24a, the selected cell contains ”civil right advocate”, and the sidebar shows the nearest cells in terms of cosine distance measure. We observe that nearest cells to the selected cell contain values related to politics. Note that these cells are randomly selected and are often from dif- ferent tables. In Figure 2.24a, we select another cell containing ”Russian speed skater” and we observe that the nearest neighbors in the vector space are related to sports. This result showcases that our MCM framework is successful in preserving semantic information about the cell textual value in the cell embedding vectors. 2.5.3 Comparison of different cell embedding models In the previous sections, we evaluated the cell vector space achieved from pre-training our proposed CTrans model, and compared it with the cell vector spaces from manual stylistic features and sentence encodings. In this section, we would like to compare the cell embed- ding vector space achieved from different cell embedding models we introduced, i.e. BoC, CGN, and CTrans. To this end, we pre-trained all these models on the same 30,000 spread- sheets that we introduced earlier in this chapter. We evaluate the cell structure information captured by these cell embedding models, following the same scenarios as in Section 2.5.2. In first scenario, we put the tabular cell into four categories: date/time type (date), integer number (int), decimal number (float), and general text (str). We use a high precision pattern 4 https://projector.tensorflow.org/ 49 (a) (b) Figure 2.24: Semantic information in cell embedding vector space. The figures show the t-SNE visualization of cell embeddings for a random sample of 100,000 cells for WikiTa- bles corpus using Google embedding projector. The side-bar shows the nearest cells to the selected cell in the embeddings vector space. 50 matching on cell textual values to identify integer and decimal numbers. We then assign all other cells to general text category. For the second scenario, we use five major categories for cell role in tabular data layout which has been introduced in previous work: left attribute (LA), data (D), top attribute (TA), metadata (MD), and footnotes (N). Figure 2.25 shows the sorted distance matrix, and the silhouette score for data type cat- egorization (first scenario). The results in this plot shows that BoC model is better in pre- serving cell data type information (with a silhouette score of 0.14), and also CTrans model works better that CGN in this scenario. The cluster coherency of str data type for the CTrans cell embedding vector space is worse compared with the BoC cell embedding vector space. Moreover, the cluster separation between date and float clusters are better in BoC compared with CTrans. However, CTrans model performs better on separating str cluster from other data types, compared with BoC model. Figure 2.26 shows the sorted distance matrix, and the silhouette score for role type cate- gorization (second scenario). CTrans model performs best in this scenario (with a silhouette score of 0.3), and BoC model comes second. Note that in CTrans model, top attribute and left attribute clusters are not well separated, whereas in the BoC model they are well sepa- rated. On the other hand, in BoC model, note, left attribute, and data cells are mixed and are not well separated in the vector space. Overall, CTrans model shows more consistent results, and also we achieved better per- formance on the downstream tasks using CTrans model compared with BoC and CGN in our development experiments. In the remaining of this this thesis, we use CTrans model for our experiments in the downstream tasks. 51 (a) BoC cell embeddings, silhouette score: 0.14 (b) CGN cell embeddings, silhouette score: 0.05 (c) CTrans cell embeddings, silhouette score: 0.08 Figure 2.25: Comparison of cell embedding models for categorizing by data type of cell textual value (first scenario). Each category contains 500 sample cells randomly selected from the annotated CIUS corpus. 52 (a) BoC cell embeddings, silhouette score: 0.25 (b) CGN cell embeddings, silhouette score: 0.08 (c) CTrans cell embeddings, silhouette score: 0.30 Figure 2.26: Comparison of cell embedding models for categorizing by role type in the data layout of table (second scenario). Each category contains 500 sample cells randomly selected from the annotated CIUS corpus. 53 Chapter 3 Detecting Elements of Tabular Data Layout 3.1 Introduction In Chapter 2, we introduced the MCM framework and showed its ability to capture semantic and structural cell information. In this chapter we focus on understanding the data layout of tabular data in cell level. In other words, we try to identify the role of each cell in the tabular data layout by classifying the cells into major role types. Solving this problem provides us with a fine-grained understanding of the tabular data layout, and existing methods are able to extract data relations from tabular data given the individual cell types [41, 21, 42]. There are different definitions and terminologies used for different roles in tabular data layouts in the literature [12, 42, 5]. We combine the terminologies and definitions used in [42, 5] in this chapter and select five major cell types in tabular documents. These cell types can be defined as follows: 1. Metadata (MD) cells present meta-data information for the document or part of the document. This meta-data information often explains what the content of a document (or part of a document) is about, and is necessary for understanding the relational 54 information and semantic modeling [43] of the data within the document. For example, the top meta-data block in figures 3.1c and 3.1b contains table titles and explanation of what the table presents. The inner meta-data blocks in Figure 3.1b is meant to state the categories of characteristics in the first column of the table. 2. Top Attribute (TA) cells are the headers for table columns which can be hierarchical as in Fig. 3.1a. These cells often represent the properties or attributes in the semantic model of the tabular data. 3. Left Attribute (LA) cells are the row headers, and similarly to top attributes, they can be hierarchical. These cells also often represent the properties or attributes in the semantic model of the tabular data. 4. Data (D) cells are the core body of the table. These cells often represent the objects in the semantic model of the tabular data. 5. Footnotes (N) cells present additional information about the document or part of the document. These cells are different from the Metadata cells in the way that they are often not necessary for understanding the relational information and semantic model- ing. The data layout of tabular documents can be complex. For example, there can be mul- tiple levels of header rows or attribute columns, header rows may repeat throughout the document, and multiple tables may exist in the document (either separated by empty cells or glued together). Moreover, the tabular data layout allows the users to present the same data 55 MD TA D LA (a) LA TA D MD N (b) (c) Figure 3.1: Examples of table layout. (a) from DeEx, (b) from SAUS, (c) from CIUS in multiple ways and users may follow different conventions when putting their data in tab- ular form. This results in heterogeneous data layout presentations across different datasets and conventional cell classification models [21] fail to generalize to different datasets. Due to such complexities, the problem of cell type classification is a difficult problem. In this chapter we utilize our pre-trained cell embedding model to achieve a more generalizable cell classifier to identify different elements of tabular data layout. We evaluate our method on three datasets, deexcelerator (DeEx) 1 , SAUS 2 , and CIUS. The first two datasets have been used in previous work [5, 3]. These datasets contain tables with complex data layouts and contain data from different domains (such as financial, busi- ness, crime, agricultural, and health-care). Example documents shown in figures 3.1a, 3.1b, and 3.1c are from financial, crime, and health-care domains respectively. We compare the performance of our system with previous feature-based techniques [3, 5]. In our evaluations, we test our system under both in-domain and out-domain evaluation settings. The in-domain 1 https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/ 2 http://dbgroup.eecs.umich.edu/project/sheets/datasets.html 56 Figure 3.2: Overview of our cell classification framework. The raw table is fed into our cell embedding model which produces a tensor of cell embedding vectors for each cell in the table. A classification head uses the cell embedding vectors to predict the role of each cell in the tabular data layout. setting investigates the trainablity of our proposed representation learning and classification methods , and the out-domain training setting investigates the generalizability of our system in a transfer learning scenario. In in-domain evaluation setting, we train and test our system on each of our datasets separately. In out-domain evaluation setting, we train the model on two of our datasets and test it on the other dataset. Our experiments show that our system performs better than the baseline systems in both of these settings. 57 3.2 Supervised Fine-Tuning of Cell Embeddings We used our MCM framework to pre-train the cell embedding model using unsupervised objectives in Chapter 2. In this section, we fine-tune the pre-trained parameters of the cell embedding model to achieve the cell classification task. To this end, we add a classification head which gets the cell embedding vectors and tries to predict the role of each cell in the tabular data layout. Figure 3.2 shows the overview of this fine-tuning framework. The classification head is a neural network module which produces cell label probabilities. More formally, given a tabular document containing cellsC i;j ’s and set of cell type labelsL, we first use the cell embedding model to produce a cell embedding tensorE where each vector E i;j corresponds to the cellC i;j . Using the cell types we introduced before, we define the cell labels asL,fLA;D;TA;MD;Ng. We use annotated training documents to train the parameters for the newly added clas- sification head, and fine-tune the cell embedding model parameters. The classification head F then calculates label probabilities for each labelk and cellC i;j , i.e. p(l i;j = kjE) where k 2 L. We use a categorical cross entropy loss function as described in Equation (3.1). In this equation, D is a document in the training set, E D is the cell embedding tensor for documentD, andk i;j;D is the true label for the cell in rowi and columnj of the document D. Moreover,w k is the weight for the labelk which mitigates the class imbalance problem in training. w k ’s are calculated based on the number of cells in each classk in the training corpus. loss = X D X i;j w k 1 k=k i;j;D log p (l i;j =kjE D ) (3.1) 58 Figure 3.3: 3-layer perceptron classification head for producing cell label probabilities. FC is a fully connected layer, and LReLU stands for leaky ReLU activation function. w k = 1 count(k) P k 0 count(k 0 ) In the remaining of this section, we introduce two classification heads, a deep multi-layer perceptron, and a network based on recurrent neural networks (RNN). 3.2.1 Multi-layer perceptron classification head Our proposed CGN and CTrans models (see Sections 2.3.2 and 2.3.3) are able to capture long-range contextual information about a tabular cellC i;j within its cell embedding vector E i;j . In this section we use the each individual cell embedding vectorE i;j independently to predict the cell label probabilities. We use a 3-layer perceptron network as depicted in Figure 3.3. 3.2.2 RNN-based classification head In this section we introduce a novel cell classification head using recurrent neural networks (RNNs). The RNN model introduces additional long-range dependencies and context on top of the cell embeddings which is especially useful for our BoC model (see Section 2.3.1) 59 which only captures local context of the cells. Our proposed CGN and CTrans cell embed- ding models (see Sections 2.3.2 and 2.3.3) capture these long range dependencies within the cell embeddings. Prior work [4, 3] sought to capture these dependencies using graph- ical models such as CRFs, but such approaches were time-consuming to train and showed poor performance in our experiments. Our simple and elegant architecture uses two inde- pendent long short-term memory networks (LSTMs), one for rows and one for columns, each of which uses cell embeddings with a context learned from prior cells. Together, the output vectors of these LSTMs are fed into a multi-layer perceptron to calculate the label probabilities, i.e.p (l i;j =kjE). Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been successfully used to detect coarse-grain elements, such as tables and charts, in tabular documents [44]. In these works, CNNs and RNNs are used to encode tabular documents, or part of tabular documents (e.g. rows and columns). [45] uses RNN and CNN network for table type classification, and [46] uses RNNs for validating target relationships between can- didate cells in tabular data as part of their framework for knowledge base creation. However, they do not use RNN networks for classifying the cells in tabular documents. To the best of our knowledge, RNNs have not been investigated for cell level classification in tabular documents. An LSTM block observes a sequence of input vectors (x 1 ...x n ) and generates a hidden output for each vector in the input sequence (h 1 ...h n ). It also maintains an internal state, and for every vector in the input sequence, the hidden output of the LSTM is a function of its state, the input vector, and its previous output. An LSTM maintains information about 60 Figure 3.4: RNN-based cell classification head. lstm row observes the cell embeddings in table rows, andlstm col observes the cell embeddings in table columns. Outputs of these two LSTM networks are aggregated and used to predict the prediction probabilities. arbitrary points earlier in the sequence and is able to capture long-term dependencies, which is especially useful for tabular documents. For example a top attribute may be followed by a long sequence of data cells in its column and it is useful for the classification method to remember the top attribute when classifying the data cells. Tabular formats impose cell dependencies in both its rows and columns. To capture both of these dependencies, we couple two LSTM networks (with different parameters), one observing the sequence of cells in each row, and the other observing the sequence of cells in each column. This architecture gives the LSTM blocks the ability to consider the cells on the left and above the target cell, when generating the output for the target cell. For example, in Figure 3.1a, when classifying the cell F5 with value of Male, the row LSTM cell state has information about the cells on the left which are all header cells. 61 Figure 3.4 shows the overview of our cell classification framework. Given a document withN rows andM columns, we first generate embedding vectors for each cell in the doc- ument as explained in the previous section. We then pad the document with special vectors to distinguish borders of the document. We use 1 for left and right padding cells, and -1 for top and bottom padding cells. The result is a tensor (T D ) of size (N + 2) (M + 2)d 0 . There will beN + 2 row sequences andM + 2 column sequences for the document. To explain how our classification framework works, let us focus on the cell in rowi and column j in the tensor we created (this corresponds to the cell in row i 1 and column j 1 in the original document because of the padding process), and call it the target cell. In order to classify the target cell, the row LSTM network observes i’s row, and the column LSTM network observes j’s column in the table tensorE. Moreover, j’s hidden output from the row LSTM (h r j ), and i’s output from the column LSTM (h c i ) corresponds to the target cell. We add these two vectors and use a 2-layer perceptron with ReLU activation (similar to previous section) produce the probabilities for different types for the target cell. (3.2) formally describes our RNN-classification head. Here, W 1 , b 1 , w 2 , and b 2 are the MLP parameters; r and c are row and column LSTM block parameters respectively. p(l i;j =kjE; r ; c ) = e ^ y k i;j P k 0 2L e ^ y k 0 i;j (3.2) ^ y i;j =W 2 +b 2 =ReLU(W 1 (h r;r j +h c;c i ) +b 1 ) 62 3.3 Empirical Evaluation 3.3.1 Experimental setup Datasets We evaluate our system on three real-world spreadsheet datasets containing tables with a sig- nificant variety of data layouts. The first dataset (DeEx), used in the DeExcelerator project 3 contains 216 annotated Excel files in the financial and business domain from ENRON 4 , FUSE [47], and EUSES [48]. The second dataset, used in [42], is 2010 Statistical Abstract of the United States SAUS, consisting of 1,369 Excel files downloaded from the U.S. Census Bureau 5 . The third dataset is from the Crime In the US (CIUS) 6 in 2007 and 2017, consisting of 1005 Excel files. We use the annotations provided in DeEx dataset, and manually annotate 200 and 250 Excel files, randomly selected from each of SAUS and CIUS datasets respec- tively. We use the XCellAnnotator Tool 7 for the annotation task. We put each spreadsheet from these Excel files into a single document. This leads to 457, 210, and 268 annotated documents in DeEx, SAUS, and CIUS datasets respectively. 3 https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/ 4 https://www.cs.cmu.edu/ enron/ 5 http://dbgroup.eecs.umich.edu/project/sheets/datasets.html 6 https://ucr.fbi.gov/crime-in-the-u.s 7 https://github.com/elviskoci/XCellAnnotator 63 Train/test split: We perform our evaluations using 10-fold data split, i.e. we randomly split the documents in each dataset into 10 same-size folds. We repeat all experiments 10 times, each time taking one of the folds as test hold out set, and the other 9 folds as train and validation sets. Therefore 90% of the documents in each dataset is used training and validation, and 10% for testing and reporting the results. From the 90% training documents, we randomly select 5% of them for tuning the hyper-parameters (e.g. number of trees in Random Forest, and number of epochs to train the neural network models). Note that Koci et al. [5] use a different method for train/test split on DeEx dataset. In their evaluations, they use a heuristic to downsample the cells from the DeEx dataset (in order to remove the class imbalance caused by large number of data cells compared to other types), and their train/test split is on cell level, i.e. they shuffle all the cells in the downsampled dataset and generate random stratified train/test splits [5]. We believe that splitting by document is more appropriate as it leads to testing performance on unseen documents, where none of the cells in the test documents have been used for training. Baseline systems: We compare our system with two baseline methods that have been proposed in previous work. The first baseline is proposed by Koci et al. in [5] uses a set of manually crafted cell features which cover formatting, styling, and typographic features of tabular cells. This baseline uses a Random Forest classifier to classify individual cells in tabular documents. The second baseline is proposed in [42], and also uses manually crafted formatting, styling, 64 Figure 3.5: Comparison of the vanilla neural-network models, E2E and CV , with our full classification model, CE. E2E trains both the cell embedding network and the classification head end-to-end, while CE uses pre-trained cell embeddings. CV uses similar classification head, but uses the cell property encoding vectors as features. and typographic cell features, but uses a Conditional Random Field (CRF) classifier for cell type classification, in order to take into account cell type dependencies. We refer to the first and second baselines as RF and CRF throughout the rest of this section. We also introduce two vanilla neural-network-based baseline systems. These two base- lines use part of our proposed cell embedding model and are intended to evaluate the benefit of our pre-trained cell embedding model. The first vanilla baseline is CV, where we use the cell vectors we introduced in Section 2.2 followed by a multi-layer perceptron to classify the cells. The second vanilla baseline is E2E, where we follow our proposed cell classification framework we introduced in this chapter, however, we do not pre-train the cell embedding network and train the whole classification model end-to-end. E2E baseline helps us evaluate the effect of our MCM pre-training framework. Figure 3.5 shows the overview of these two vanilla baselines and compares them with our full classification model using pre-trained cell embeddings (CE). 65 Experimental details: In the experiments in this chapter, we use CTrans cell embedding model which is pre- trained on a corpus of about 30,000 excel sheets from various domains (see Section 2.5). In our experiments the sentence vector dimension is determined by FastText model which is d = 300, and we choose the cell embedding dimension to be d = 256 which showed good performance and efficient training run-time in our development experiments. We use the validation set for early stopping while training the cell classification network and use F1-macro score as the stopping criterion. As suggested in [5], we use Weka [49] for the RF baseline. [5] suggests a heuristic method for downsampling the training set to resolve the class imbalance in training data that involves sampling cells from the first and last rows of blocks of data cells. In our preliminary experiments mini-batch bagging achieved better results, and given that it is is a more principled approach to address class type imbalance, we use mini-batch bagging for the RF baseline in our experiments. We also followed the instructions in [3] to implement the CRF baseline. Since the feature set introduced by RF baseline is more comprehensive and covers CRF features, we use the RF feature set for both baselines in our experiments. 3.3.2 Experimental results We investigate two research questions in our experiments. First, we investigate whether the proposed cell embeddings learn useful information in a given domain. To this end, we compare the performance of our system with the baseline systems in an in-domain training setting, where we fine-tune our classification model using the training set from each of our 66 Figure 3.6: Overview of the workflow for the in-domain and out-domain evaluation settings. In the in-domain evaluation setting the classification model is trained on the whole training set from the target dataset. In the out-domain evaluation setting, the classification model is trained on out-domain datasets, plus sample documents from the training set of the target dataset. datasets. Second, we investigate if our proposed cell embeddings are able to capture domain agnostic and general data layout regularities. To this end, we compare our system with baseline systems in an out-domain training setting (for a transfer learning scenario), where we fine-tune our classification model on two of our datasets and test them on the third one (target dataset). In the out-domain training setting, we also experiment with the number of training samples required from the target dataset to achieve good performance. Figure 3.6 illustrates the overview of the workflow for these training settings. We also investigate the performance of the classification models on documents that are not richly formatted, such as CSV files. To this end, we use a set of reduced cell features related to syntactic features of cell values (csv features). We refer to the complete set of features, i.e. cell features only available in richly formatted documents plus the csv features, 67 as excel features. We perform the experiments in both in-domain and out-domain settings with csv, as well as excel features to evaluate how much the performance of the systems is dependent on rich styling features. In-domain evaluation: In order to investigate the ability of each system to learn data layout patterns from a dataset, in this training setting, we evaluate the systems on each dataset separately. In our system, we pre-train our cell embedding model on a separate dataset of about 30,000 unlabeled spread- sheets from various domains. We then fine-tune our cell embedding model, and train the newly added classification head parameters on the train set from the target dataset. We also use the validation set to check for early stopping in the training process. We repeat this eval- uation 10 times on each dataset with different random train, test, and validation sets using a 10-fold data splitting. We also repeat this experiment using the csv features. Note that in our system, we set the stylistic cell vector (V s ) to zero in case of csv features evaluation. Table 3.1 (see page 78) shows F1 scores for each cell type for this evaluation, averaged over 10 experiments. To explain some takeaways from this table, let us focus on the research questions we described above. Do our proposed cell embeddings embed useful information? Figure 3.7 shows a 2D visualization, obtained using the TSNE dimension reduction method, of the cell embedding vectors achieved from applying the pre-trained cell embedding model on the tables in the CIUS dataset. The plot shows clearly defined clusters which confirms that the pre-trained cell embedding vectors capture meaningful information about the tabular data layout. These 68 Figure 3.7: 2D visualization of cell embeddings for CIUS dataset. The numbers of TA, D, MD, B, LA, and N points in this plot are 5486, 21210, 914, 6376, 58041, and 2854 respectively. vectors are the initial cell feature space for our classification model, and we will discuss the effect of the pre-training on cell classification performance later in this section. How well does our proposed classifier perform? To answer this question, let us focus on the excel features evaluation results and F1-macro scores in Table 3.1. We focus on F1-macro scores since it is a better representative of the overall classification performance since we have significant class imbalance in our classification problem. We observe that RF method outperforms CRF in our evaluations, which we hypothesize is because RF better handles the 69 class imbalance since it uses mini-batch bagging while CRF model works on all the cells in a training table. Note that DeEx dataset contains very complex tables and all systems tend to perform worse on this dataset compared to SAUS and CIUS. Our results show that our model (CE) outperforms RF model in SAUS dataset with statistical significance, and performs close to RF in DeEx and CIUS datasets. How well the systems perform on documents without rich styling? To answer this ques- tion, we first compare the performance of baseline systems on csv and excel features. The scores for RF show that performance of RF degrades when rich styling features are unavail- able, especially in DeEx, where F1-macro is 13% lower for csv features compared with excel features. CRF performance also degrades in all three datasets. CE, performs better than RF with larger margin in all datasets, especially in DeEx and CIUS datasets with statistical sig- nificance. Note that, unlike RF, the performance of our system (CE) suffers much less when not having the rich stylistic features, especially, the F1-macro for CE is only 3% lower for csv features compared with excel features. Overall, our proposed cell classification model performs better than previous feature based methods for tabular documents which lack the rich stylistic features. Does the cell embedding layer help with classification performance? To answer this question, we focus on the performance of our full model CE, compared with the CV baseline. CV baseline uses the cell property encodings, which are the input to the cell embedding layer in our model (see Section 2.2). Although we observe slightly better performance for CV compared to CE in the csv features on SAUS dataset, and in the excel features on CIUS dataset, other results suggest that our full CE model outperforms the CV model. Especially, 70 for the case of csv features on DeEx dataset, our full model (CE) performs 15% better than CV . This result shows that our cell embedding layer is successful in bringing extra contextual information about tabular cells in the cell embedding vectors. Is the pre-training step necessary? In our model, we use a pre-trained cell embedding model. To investigate how much the pre-training step contributes to the results, we focus on the performance of our full model CE, compared with the E2E baseline. E2E baseline uses the same network architecture as CE, with the difference that it initialized the weights in the cell embedding model randomly instead of using pre-trained weights. The results in Table 3.1 shows that the E2E baseline performs poorly across all settings and datasets. This is expected since learning all the parameters in the E2E model needs a much larger amount of training data, which is not feasible for human user to annotate. This results shows that our MCM framework is successful in pre-training the cell embedding model to capture meaningful cell information. Analysis for different cell types: In order to better illustrate the performance of our model on each cell type, Figure 3.8 shows the confusion matrices for CE and RF models for different datasets. These confusion matrices are normalized based on the true label counts (row-wise). The more diagonal the confusion matrices, the better the model is performing on the classification task. We can see both our model (CE) and RF baseline perform well on CIUS and SAUS datasets. Our model confuses metadata cells with note in the SAUS dataset, while RF model confuses metadata and note cells with left attribute. Overall, our model performs better and more consistent compared to RF on CIUS and SAUS datasets. In DeEx datasets, both RF and CE models make more mistakes, especially note cells are misclassified 71 (a) RF on DeEx (b) CE on DeEx (c) RF on CIUS (d) CE on CIUS (e) RF on SAUS (f) CE on SAUS Figure 3.8: Confusion matrices for the case of training the systems within each dataset (in- domain setting). 72 as data cells, however, our model makes fewer such mistakes (71% in RF compared to 57% in CE). This is because many note cells in DeEx dataset contain numeric values which is often a characteristic found in data cells. Our model on the other hand confuses note cells as metadata in DeEx dataset, which is consistent with the pattern we observed in SAUS dataset. This can be justified since the metadata and note cells are both explaining extra information about the data in the table. Moreover, both CE and RF model confuse left attribute cells with data in DeEx dataset, which we hypothesize is because left attribute cells in the DeEx dataset contain abbreviated text and numerical values which often is a characteristic found in data cells. Overall, the confusion matrix for CE model is more diagonal compared to RF, which indicates better model performance. Out-domain evaluation: The purpose of the experiments in this section is to investigate whether the classification systems can learn general conventions for presenting data in tables. We investigate whether models trained on some datasets are able to generalize to new and unseen datasets. In the experiments, we train the systems on two of our datasets (train set) and evaluate the obtained models on the other dataset (test set). Also, similar to the in-domain experiments, we perform experiments with both excel and reduced set of features. We only report the evaluation scores of our model (CE) and RF baseline in this section, since the other systems were not very competitive in the in-domain evaluation settings Table 3.2 shows the F1-macro classification scores for different cell types in this experiment. Let us focus on some research questions to summarize some takeaways from this result. 73 Does cell embedding model result in better model generalizability? To answer this question, let us focus on the case of fully out-domain training, i.e. when 0 training documents from the target domain are used to fine-tune the classification model. We observe that the performance of CE and RF both degrade compared to the in-domain setting, especially on DeEx dataset. This is expected because the tables in different datasets are very different in our experiment, and especially the layout patterns in DeEx dataset are more complex than CIUS and SAUS. Our results show that our proposed method (CE) perform much better than the RF baseline in this setting for both csv and excel feature settings across all target datasets. Especially in the case of excel features on CIUS dataset, our proposed method (CE) performs 90% better than RF in terms of F1-macro score. This result confirms that our pre-trained cell embedding model captures data layout patterns which provide a generalizable cell feature space for the classification model. One interesting observation from the results in Table 3.2 is the fact that CE model suffers in the case that rich stylistic features are not available, while RF model performs better with csv features compared with excel features. This can be justified by the fact that in the RF model, the feature space is only the manual features and since stylistic conventions are often different across different datasets, not using the stylistic features reduces the chance of overfitting to specific stylistic conventions. However, CE model combines stylistic features with positional and textual features and is able to capture more complex (and potentially generalizable) patterns. Does cell embedding model reduces manual annotation effort? Table 3.2 shows eval- uation scores for cases with various number of annotated training samples from the target dataset. The fewer the number of annotations needed from the target domain, the easier the 74 classification model can be applied to a new dataset with less manual effort. Our results shows that the performance of both CE and RF improves when training samples from the target dataset is introduced to the classification model. Also, as more training samples from the target dataset are introduced, the classification scores generally improves. Our results show that our proposed method (CE) performs much better than RF baseline, with fewer annotation samples. Especially, the performance of CE improves significantly even with 5 training samples in CIUS dataset. One interesting observation from the results in Table 3.2 is that although csv features are more generalizable for the RF baseline when few train- ing samples from the target dataset is provided, with more training samples from the target dataset (200 for DeEx, 20 for CIUS, and 5 for SAUS), excel features result in better model performance. Overall, the results in Table 3.2 confirms that our proposed pre-trained cell embeddings reduces the manual effort needed to annotate many training samples from a new dataset to apply the cell classification model. Analysis for different cell types: In order to better illustrate the performance of our model on each cell type, Figure 3.9 shows the confusion matrices for CE and RF models for different datasets, for the case of fully out-domain evaluation (0 training documents from the target dataset). These confusion matrices show some similar patterns that we observed in in-domain setting, for example our model (CE) confuses metadata and note cells in CIUS and SAUS datasets. CE also confuses metadata cells as left attribute in SAUS dataset, which is because of the group level metadata cells in SAUS dataset (see Figure 3.1). In CIUS dataset, both RF and CE confuse left attribute cells as data, which is because of abbreviated text in such cells that is more often a characteristic of data cells. However, our model (CE) 75 makes much fewer such mistakes compared to RF. The confusion matrices for DeEx dataset are much noisier compared to SAUS and CIUS datasets, which shows that the data layout patterns in DeEx dataset are complex. Similar to the in-domain setting, both systems confuse note cells as data, because the note cells in DeEx dataset often contain numeric values which is usually a characteristic of data cells. Overall, our proposed classification model (CE) results in more diagonal confusion matrices compared to the RF baseline in all datasets, which confirms the effect of our pre-trained cell embedding model in generalizability of the classification model. Moreover, our system makes more justifiable errors than the RF system, such as confusing metadata and note cells, confusing top attribute and metadata cells (in SAUS for cases where metadata is in between left attributes), and left attribute and data cells (in SAUS for multi-level left attributes). 76 (a) RF on DeEx (b) CE on DeEx (c) RF on CIUS (d) CE on CIUS (e) RF on SAUS (f) CE on SAUS Figure 3.9: Confusion matrices for the case of fully out-domain training setting, where no training documents from the target domain is introduced to the classifier. 77 per-class F1 Macro LA D TA MD N F1 Accuracy excel features DeEx RF [5] 69.1 ±21.8 98.5 ±0.4 81.4 ±3.3 85.2 ±2.5 14.3 ±13.7 69.7 ±6.5 96.8 ±0.7 CRF [3] 54.4 ±22.2 91.4 ±3.4 63.4 ±2.7 74.0 ±1.7 10.5 ±1.1 58.7 ±6.6 95.2 ±1.3 CV 54.0 ±29.0 98.1 ±0.5 80.6 ±2.8 82.5 ±7.1 21.3 ±19.3 67.3 ±7.8 96.3 ±1.0 E2E 0.0 ±0.0 95.1 ±0.8 0.0 ±0.0 0.0 ±0.0 0.0 ±0.0 19.0 ±0.2 90.7 ±1.4 CE 64.3 ±18.6 98.6 ±0.3 86.2 ±4.1 86.2 ±4.8 11.6 ±10.1 69.4 ±4.8 96.9 ±0.6 CIUS RF [5] 96.2 ±3.6 99.6 ±0.4 99.7 ±0.1 99.3 ±0.6 99.9 ±0.3 98.9 ±0.7 99.3 ±0.6 CRF [3] 86.4 ±4.6 99.3 ±0.3 97.1 ±2.7 98.3 ±0.7 96.4 ±0.2 95.5 ±0.8 99.1 ±0.5 CV 96.8 ±3.0 99.7 ±0.3 99.7 ±0.5 99.5 ±0.6 99.7 ±0.3 99.1 ±0.7 99.4 ±0.6 E2E 29.0 ±35.7 91.2 ±2.4 0.0 ±0.0 23.0 ±28.2 14.9 ±22.9 31.6 ±17.0 83.5 ±3.2 CE 96.1 ±3.4 99.5 ±0.4 99.9 ±0.1 99.5 ±0.6 99.8 ±0.3 99.0 ±0.8 99.2 ±0.6 SAUS RF [5] 94.8 ±2.7 99.5 ±0.2 97.8 ±2.9 86.7 ±6.8 87.3 ±4.4 93.2 ±2.4 98.6 ±0.8 CRF [3] 91.2 ±2.9 98.6 ±0.3 95.1 ±2.5 79.2 ±6.6 80.1 ±5.1 88.9 ±2.5 98.1 ±0.9 CV 94.6 ±2.4 99.4 ±0.2 97.9 ±2.8 83.3 ±7.9 86.3 ±6.7 92.3 ±1.8 98.5 ±0.7 E2E 67.2 ±19.0 96.3 ±2.1 59.8 ±39.9 7.4 ±14.8 37.2 ±32.4 53.6 ±18.3 89.6 ±5.0 CE 96.0 ±2.7 99.6 ±0.2 97.8 ±2.9 87.6 ±1.9 94.2 ±2.0 95.0 ±1.1* 98.9 ±0.8 csv features DeEx RF [5] 47.7 ±11.9 97.1 ±0.6 61.4 ±5.1 74.7 ±6.9 23.1 ±13.7 60.8 ±3.2 94.2 ±1.0 CRF [3] 39.2 ±15.8 96.6 ±0.8 55.3 ±6.2 62.1 ±7.1 7.9 ±4.3 52.2 ±2.8 91.3 ±0.9 CV 42.3 ±30.0 97.3 ±0.8 70.7 ±6.4 68.9 ±11.3 12.1 ±17.3 58.3 ±7.9 94.8 ±1.5 E2E 0.0 ±0.0 95.1 ±0.8 0.0 ±0.0 0.0 ±0.0 0.0 ±0.0 19.0 ±0.2 90.7 ±1.4 CE 52.6 ±25.1 98.5 ±0.4 86.1 ±4.7 84.7 ±5.1 13.4 ±7.7 67.0 ±6.0* 96.6 ±0.8 CIUS RF [5] 94.1 ±3.9 99.5 ±0.3 95.6 ±2.3 84.7 ±10.2 99.8 ±0.3 94.8 ±2.9 98.6 ±0.7 CRF [3] 85.1 ±4.0 98.9 ±0.3 91.3 ±2.7 91.1 ±1.3 96.7 ±0.6 92.6 ±0.9 98.6 ±0.7 CV 96.3 ±2.9 99.5 ±0.3 98.3 ±1.9 99.5 ±0.6 98.2 ±0.5 98.4 ±0.8 99.1 ±0.6 E2E 56.1 ±29.0 93.1 ±2.3 57.2 ±29.8 41.8 ±39.9 32.0 ±29.5 56.0 ±23.5 87.1 ±4.0 CE 96.5 ±3.5 99.6 ±0.4 98.7 ±1.2 98.2 ±0.9 97.1 ±2.3 98.0 ±1.0* 99.1 ±0.6 SAUS RF [5] 87.9 ±3.3 98.0 ±0.7 92.9 ±3.9 80.9 ±8.4 88.1 ±4.1 89.6 ±2.9 96.1 ±1.4 CRF [3] 83.2 ±5.1 97.8 ±0.9 91.1 ±2.9 80.1 ±6.8 81.6 ±5.8 86.8 ±2.8 95.9 ±1.5 CV 90.8 ±3.9 98.8 ±0.5 95.4 ±4.0 69.8 ±9.8 72.4 ±6.2 85.5 ±1.8 97.2 ±1.2 E2E 60.9 ±19.4 81.5 ±22.6 70.1 ±8.2 0.0 ±0.0 32.2 ±26.8 48.9 ±13.1 76.1 ±20.2 CE 95.3 ±3.6 99.3 ±0.4 96.8 ±3.6 76.1 ±5.3 88.8 ±2.2 91.3 ±2.5 98.4 ±1.1 Table 3.1: Evaluation scores for in-domain evaluation setting. DeEx, SAUS, and CIUS are the datasets. RF and CRF are baseline methods form the previous work. CV and E2E methods are vanilla baseline models which are built from part of our cell embedding model. CE is our full model, using pre-trained cell embeddings using CTrans model. Note that ±std shows the standard deviation of the scores across different folds, and the results with an asterisk (*) are statistically significant using a two-tailed P-value threshold of 0.05, and comparing with the RF baseline. 78 Macro F1 scores for different # of training tables 0 5 10 20 50 100 200 excel features DeEx RF [5] 30.2 ±2.0 35.3 ±4.8 38.2 ±3.2 40.2 ±3.4 40.9 ±2.4 42.7 ±2.1 54.8 ±5.3 CE 36.7 ±2.1 48.5 ±3.9 48.8 ±4.6 48.9 ±6.1 54.2 ±9.1 51.6 ±9.1 59.4 ±6.7 CIUS RF [5] 46.3 ±1.7 55.6 ±2.9 54.4 ±2.5 81.3 ±3.2 87.6 ±1.3 94.0 ±1.9 97.7 ±0.8 CE 88.4 ±2.1 94.6 ±1.7 95.0 ±1.9 95.7 ±1.2 97.0 ±0.9 98.4 ±0.6 99.0 ±0.8 SAUS RF [5] 49.6 ±2.1 68.8 ±3.4 71.8 ±2.6 78.5 ±1.4 87.2 ±1.6 89.9 ±2.2 93.3 ±2.4 CE 81.8 ±3.3 83.4 ±2.6 84.2 ±2.5 86.9 ±1.8 93.9 ±1.3 93.3 ±2.0 94.7 ±1.9 csv features DeEx RF [5] 34.8 ±2.1 37.9 ±3.3 39.0 ±3.9 40.2 ±4.0 42.1 ±3.8 43.2 ±4.1 48.5 ±4.3 CE 37.5 ±1.9 42.1 ±5.5 41.9 ±6.2 44.3 ±6.7 49.9 ±9.8 48.9 ±7.7 49.7 ±7.5 CIUS RF [5] 58.0 ±2.0 59.6 ±1.6 59.8 ±2.5 62.1 ±3.0 69.5 ±6.8 85.8 ±3.2 91.0 ±2.5 CE 77.5 ±1.5 95.1 ±1.2 94.5 ±1.7 96.2 ±1.1 95.6 ±1.5 97.3 ±1.2 98.2 ±1.1 SAUS RF [5] 50.3 ±2.4 57.0 ±3.1 58.7 ±2.7 65.2 ±1.9 73.0 ±2.0 77.7 ±1.1 81.8 ±2.0 CE 72.3 ±2.0 81.4 ±2.1 81.4 ±2.2 85.7 ±2.5 88.7 ±3.3 90.1 ±1.8 92.2 ±1.9 Table 3.2: Evaluation scores for out-domain evaluation setting. DeEx, SAUS, and CIUS are the target datasets. 79 Chapter 4 Classifying Web Tables 4.1 Introduction There are hundreds of millions of tables in web pages. Web tables present the data in a semi-structured way which makes them useful for data integration tasks, e.g. schema auto- completion, query answering, and knowledge base (KB) completion [50]. However, web tables are designed to be visually interpretable rather than machine interpretable, i.e. they usually contain short or abbreviated text, and little supporting context for the data. In addi- tion, web tables are also used for page formatting, and less than 20% of them contain useful data (these are called data tables) [50]. Moreover, the alignment of data in tables varies, and some previous work defined different table types by the way they present the data (data layout). Crestan et al. introduced a table taxonomy for different table types, among which, entity (E), relational (R), matrix (M), and list (L) are major layout types for data tables [11]. Figure 4.1 shows examples of these table types. We consider all other web tables which do not belong to any of these data table types as non-data (ND) in this chapter. Table type determines the organization of information in the table as well as the type of relation between the table cells. An entity table contains multiple property names and their corresponding values for a single entity. A matrix table represents the values of a single property, for a single entity, in different dimensions (e.g. price of gold in different times 80 (a) entity table (b) matrix table (c) list table (d) relational table Figure 4.1: Examples of data table types on the Web from human trafficking advertisements. and locations). A list table contains different values for a multi-valued property, correspond- ing to a single entity (e.g. list of cities in the US). A relational table has property values corresponding to an entity in each row, and usually the property names appear as column or row headers. Determining the type of data layout of web tables significantly helps with understanding the data represented by them [11, 45]. In chapter 2 we proposed various methods to generate embedding representations for tabular cells. In this chapter, we introduce a methodology to employ these representations to calculate a vector representation for web tables in the aim of forming meaningful clusters on a large corpus of web tables from a specific domain. The goal is that the table clusters represent the data formation of web tables, and the tables in each cluster are structurally 81 similar (and consequently belong to the same table type). After the table clusters are formed, user supervision is used for labeling the clusters, by asking the user to annotate a few tables from each cluster. More formally, given a set of table types L ,fE;R;M;L;NDg, and a corpus of N web tables T i ’s, we wish to form M clusters of tables and assign a cluster number 1 c i M to each table in the corpus. We then ask the user assign a set of cluster labelsfl 1 ;l 2 ;:::;l k ;:::;l M g, where8 k l k 2 L. We wish to find a clustering and label assignment where all the tables belonging to clusterk are of table typel k . In following sections, we first present our proposed methodology for solving this prob- lem. We then evaluate our method on four real world datasets of HTML pages from various domains, and compare the performance of our system against state of the art systems which use manually engineered features. 4.2 Methodology 4.2.1 Table vector calculation A table type defines how different concepts are organized in a table (within table rows, table columns, and the whole table). Our hypothesis is that the cell representation may be used to analyze this organization and consequently identify the table type. For example, if we consider relational and matrix table types, the cells in every column contain similar concepts in both table types. However, the cells in every row contain similar concepts in matrix tables but not in relational tables. One simple way to measure consistency over a set of cells is by computing the deviation from the mean and deviation from the median for the cell vectors 82 −30 −20 −10 0 10 20 30 −20 −10 0 10 20 entity relational matrix list nondata Figure 4.2: An example of table vectors shown in 2D space. (common measures of variability and consistency of sample or population). Considering both of these measures over a collection of cell vectors gives us a better understanding of the distribution of data in corresponding cells. For example, consider cell representations for the cells within a relational table column, it is expected that deviation from the median is small while deviation from the mean is large, since the header cell has a very different vector compared to the data cells. As another example, the cells in a list table column, both deviation from the mean and the median are expected to be small. 83 Our proposed method is as follows. Given a web table T with N rows and M columns, we first calculate the cell vectors using the method explained in chap- ter 2, so that: T = fv c ij ; 1 i N; 1 j Mg: We denote devia- tion from median with dev median and deviation from mean with dev mean . We calcu- late six vectors for each data table, dev mean rows , dev median rows , dev mean cols , dev median cols , dev median all , dev median all . We show how the vectors for deviation from the median are cal- culated in equations 4.1-4.3. The vectors for deviation from the mean are cal- culated similarly. In these formulas, element-wise power of 2 is used for calculating deviation from the mean/median vectors, and the resulting deviations are vectors with the same length as the word embeddings. Also, note that median i (v c ij ), median j (v c ij ), and median i;j (v c ij ) denote the median of the cell vectors in row i, column j, and the whole table respectively. Similarly, mean i (v c ij ), mean j (v c ij ), andmean i;j (v c ij ) denote the mean of the cell vectors in row i, column j, and the whole table respectively. dev median rows ( ~ T ) = 1 NM X i X j v c ij median i (v c ij ) 2 (4.1) dev median cols ( ~ T ) = 1 NM X j X i v c ij median j (v c ij ) 2 (4.2) dev median all ( ~ T ) = 1 NM X i;j v c ij median i;j (v c ij ) 2 (4.3) 84 dev mean rows ( ~ T ) = 1 NM X i X j v c ij mean i (v c ij ) 2 (4.4) dev mean cols ( ~ T ) = 1 NM X j X i v c ij mean j (v c ij ) 2 (4.5) dev mean all ( ~ T ) = 1 NM X i;j v c ij mean i;j (v c ij ) 2 (4.6) We define the table vector as the concatenation of these six vectors. Fig. 4.2 illustrates an example of the table vectors for different table types, reduced to 2D space using t-SNE [51]. As we can see, distinguishable clusters are formed for different data tables. 4.2.2 Table type classification We cluster the table vectors from a specific domain using Kmeans [52] clustering algorithm. We perform a linear search over different choices for number of clusters for each domain, and use Silhouette score [40] to pick the best number of clusters in kmeans algorithm. Note that this is done in an unsupervised manner, and can be efficiently done on a large set of tables from each domain. The clustering algorithm yields clusters of tables, but does not assign a label to each cluster to identify the table type. We collect 5 nearest neighboring points of each cluster centroid, and ask the user to assign a table type to the corresponding cluster. 1 When the tables sampled from a cluster are not from one type, the user is advised to 1 5 is chosen to keep the annotation effort minimal. 85 Figure 4.3: Screen-shot of our table cluster labeling tool (in Jupyter Notebook). For each cluster, the system selects 5 tables near the center of the cluster, and asks the user to assign a label to the cluster. In this example, the system is showing five tables from four different websites and the user is labeling the cluster as entity. assign the majority type as the cluster label 2 . Fig. 4.3 shows our annotation tool for labeling the clusters. 4.3 Empirical Evaluation We evaluate our method on four real-world datasets, and compare it with three state of the art systems for table type classification, using ground truth tables from each domain. Some of the baseline systems come with pre-trained classification models, so we first evaluate those models on our domains. We try to investigate if the proposed unsupervised method for 2 In a tie situation, one label is randomly chosen and assigned to the cluster. We did not encounter a tie situation in our evaluations. 86 0 100 200 300 400 500 number of rows 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 number of columns (a) ATF 0 10 20 30 number of rows 0 2 4 6 8 10 number of columns (b) HT 0 200 400 600 800 1000 number of rows 0 5 10 15 20 number of columns (c) MC 0 10 20 30 40 50 number of rows 0 5 10 15 20 25 number of columns (d) WCC Figure 4.4: Number of rows and columns for web-tables in each domain. Note that range of the axises differ in different domains. generating table vectors is capable of capturing information about table layout type, and if it can be used to reduce the manual effort for table type classification on large set of web pages. To this end, we evaluate our method, along with the baseline systems, in three settings: 1) weakly supervised classification (cluster and label clusters), 2) supervised with few training points, 3) supervised with many training points. We evaluate on per-class and micro F1 scores. 87 pages tables tables/page tables> 1 1 ATF 200K 496K 2.5 454K HT 60K 30K 0.5 24K MC 200K 337K 1.7 288K WCC 9K 24K 2.6 20K Table 4.1: Number of pages and tables in each dataset. 4.3.1 Evaluation setup Datasets: We use raw web pages extracted from three unusual domains, human trafficking advertise- ments (HT), fire arms trading (ATF), and microcap stock market (MC). We selected these unusual domains because of the social good impact that accurate knowledge extraction can have by helping investigators identify fraud on the web. In addition, as Fig. 4.4 shows, these domains are technically interesting in the way that they contain unusually large or small tables. Also, we use a random subset of July 2015 Common Crawl (WCC) as our fourth dataset as a generic domain. For the unusual domains, web pages are all crawled from websites and forums from the open web, however, due to privacy agreements the raw web pages cannot be released to public. Table 4.1 and Fig. 4.4 shows statistics of tables in these domains. Constructing groundtruth: Collecting a uniform sample of all table types is challenging because more than 80% of the web tables are non-data tables as mentioned before. To overcome this issue, we take uni- form random sample batches from the corpus of web tables, and manually collect a balanced 88 R E M L ND all ATF 114 94 81 0 106 395 HT 150 174 115 125 118 682 MC 159 134 81 0 97 471 WCC 96 95 39 35 121 386 Table 4.2: Number of tables for each table type in the groundtruth for each dataset. R, E, M, L, and ND stand for relational, entity, matrix, list, and non-data respectively. ground truth set for various table types. We sample 500 tables from each domain, and man- ually collect tables of different types from these samples. We repeat the sampling until we have at least 80 tables for each type, and after each iteration if we already have more than 100 tables for a type, we stop adding more to the groundtruth (to avoid biasing the groundtruth). We did not find any list tables in the ATF, and MC domains, also for WCC domain we could not find enough matrix or list tables after several iterations. Table 4.2 shows a summary of our groundtruth for different domains. Baseline systems: We choose three state of the art systems, webcommons extraction framework 3 (WebC), Dres- den Web Table Corpus extractor 4 (DWTC), and TabNet 5 [45]. WebC and DWTC both use hand-crafted features, and TabNet is a neural network based model. DWTC is an improve- ment on WebC and both systems come with pre-trained models which we use in our evalua- tions. We only consider DWTC and TabNet when comparing with our method (since DWTC contains WebC features). 3 http://webdatacommons.org/ 4 https://wwwdb.inf.tu-dresden.de/misc/dwtc/ 5 The authors of TabNet provided us with their implementation, however, we do not have access to their trained model used in their paper. 89 5 10 15 20 25 30 num clusters 0.1 0.2 0.3 0.4 0.5 0.6 silhouette score ATF HT MC WCC Figure 4.5: Silhouette score varying number of clusters. Choosing number of clusters: We cluster table vectors, varying the number of clusters to determine the optimal number of clusters for each of our domains. We then calculate mean Silhouette coefficient for each choice of cluster number. Silhouette coefficient gives a measure of inter- vs. intra- cluster distance for table vectors, and uses no annotations. The number of clusters determines the amount of manual effort needed to inspect the clusters and assign them the proper table type. Therefore, we try to keep the number of clusters under 30 in this thesis. Fig. 4.5 shows the Silhouette score for different domains, varying the number of clusters from 3 to 30. The best number of clusters are 28, 26, 25, 26 for ATF, HT, MC, and WCC respectively. 4.3.2 Evaluation results Pre-trained models: Table 4.3 shows per-class and overall classification scores for pre-trained models contained in WebC and DWTC frameworks. Although DWTC performs better than WebC in most 90 system per class F1 score F1-M R E M L ND ATF WebC 0.22 0.06 0.03 - 0.05 0.09 DWTC 0.20 0.16 0.03 - 0.04 0.11 HT WebC 0.46 0.38 0.00 0.00 0.07 0.32 DWTC 0.35 0.42 0.00 0.00 0.25 0.32 MC WebC 0.44 0.45 0.03 - 0.04 0.34 DWTC 0.41 0.47 0.07 - 0.00 0.37 WCC WebC 0.27 0.13 0.09 0.00 0.11 0.15 DWTC 0.22 0.31 0.00 0.00 0.06 0.17 Table 4.3: Results for pre-trained baseline models. cases, both systems show poor evaluations on our datasets. This means that models trained from these methods are not transferable to new domains, and in-domain training is needed to obtain good performance. In the following, we compare the performance of our method (TabEmb) with the base- line methods (DWTC and TabNet) under two different scenarios of weakly supervised, and supervised learning. Weakly supervised: TabNet does not fit in this setting, since it does not produce anything without training data. We use DWTC framework to produce a feature space for the tables in each domain. This feature space is then standardized 6 so that it can be used by kmeans algorithm. Also, best number of clusters is found using Silhouette coefficient method. Table 4.4 shows the results of classification evaluation on different domains. Our method performs better on the unusual domains, especially for HT domain. Also for the generic domain (WCC) it performs very close to DWTC. 6 we used scikit-learn’s StandardScaler method. 91 Supervised: Table vectors produced by our method can be used to train a classifier when training data is available. We consider two settings, one with using a small training set where we use 10% of the groundtruth data for training, and one with a large training set where we use 90% of the groundtruth data for training. We use the portion of groundtruth data that is not used for training as test set for both settings. Implementation details: We used random forest classifier (as used in [53]) withn = 20 for features extracted by DWTC, and linear logistic regression classifier for table vectors from our method. Small training set: Table 4.4 shows the results of our system and DWTC for different domains. Our method and DWTC perform similarly in this setting. DWTC’s performance improved significantly compared to weakly supervised setting while performance of our method does not enhance much. We did not include TabNet here because it produces very poor results. Large training set: Table 4.4 shows the results of DWTC, TabNet, and our system for different domains. TabNet performs poorly in our experiment, proving it needs many more training samples to learn a usable model. Both DWTC and our system perform better in this setting. Our method again is very close to DWTC and performing better than DWTC in two out of the four cases. 92 system per class F1 score F1-M R E M L ND weakly supervised ATF DWTC 0.75 0.96 0.62 - 0.86 0.79 TabEmb 0.84 0.77 0.90 - 0.84 0.83 HT DWTC 0.95 0.87 0.48 0.83 0.55 0.76 TabEmb 0.97 0.91 0.82 0.84 0.76 0.87 MC DWTC 0.60 0.77 0.56 - 0.57 0.64 TabEmb 0.74 0.71 0.46 - 0.75 0.68 WCC DWTC 0.67 0.77 0.44 - 0.86 0.75 TabEmb 0.68 0.72 0.29 - 0.89 0.74 10% training ATF DWTC 0.87 0.93 0.92 - 0.85 0.87 TabEmb 0.84 0.78 0.86 - 0.74 0.82 HT DWTC 0.95 0.93 0.90 0.83 0.58 0.86 TabEmb 0.95 0.93 0.89 0.85 0.77 0.89 MC DWTC 0.70 0.89 0.45 - 0.77 0.72 TabEmb 0.72 0.81 0.53 - 0.72 0.71 WCC DWTC 0.70 0.76 0.45 - 0.85 0.76 TabEmb 0.84 0.81 0.79 - 0.64 0.79 90% training ATF DWTC 0.94 0.97 0.96 - 0.94 0.95 TabNet 0.24 0.12 0.34 - 0.18 0.25 TabEmb 0.90 0.93 0.92 - 0.88 0.91 HT DWTC 0.98 0.98 0.97 0.84 0.79 0.93 TabNet 0.06 0.40 0.03 0.21 0.18 0.26 TabEmb 0.99 0.98 0.97 0.94 0.92 0.97 MC DWTC 0.82 0.93 0.71 - 0.89 0.84 TabNet 0.30 0.26 0.21 - 0.06 0.24 TabEmb 0.80 0.91 0.74 - 0.86 0.83 WCC DWTC 0.81 0.88 0.62 - 0.89 0.84 TabNet 0.09 0.41 0.13 - 0.25 0.27 TabEmb 0.93 0.93 0.93 - 0.87 0.92 Table 4.4: Evaluation results. TabEmb is our method. R, E, M, L, ND and F1-M stand for relational, entity, matrix, list, non-data, and F1-micro score. Evaluation discussion: Our system achieves competitive performance compared to baseline systems in our evalua- tions, which showcases the quality of our proposed cell embeddings. Our method is partic- ularly valuable for unknown domains, since it can simply run on a corpus of web pages and form meaningful table clusters. Users are then able inspect a small number of clusters and assign a table type to each cluster. In contrast, to train classification models in other systems, users are required to find a large number of tables and label them. Given that the majority 93 of tables in the web pages are non-data tables, finding a balanced training set needs a lot of user effort. 94 Chapter 5 Retrieving Tables for Natural Language Query 5.1 Introduction Finding relevant documents for a keyword query has a long history of research and has been the core of Web search engines [54]. In this problem, the user provides a keyword query and the goal is to return a ranked list of relevant documents. Figure 5.1 shows an overview of the document retrieval problem. Majority of the previous work in this area focus on documents containing unstructured text, such as web pages [55]. Conventional document retrieval methods used the terms in the documents and the user query, and created term- frequency and term document-frequency vectors for all the terms in the user query as well as the terms in each document. They then used various methods to score each document to predict the relevance of the document to the query, for example BM25 [55]. In recent years, pre-trained language models (such as BERT) have shown state of the art results for document retrieval problem [56]. Tabular documents contain useful domain information which can help answer user queries. Tables enable a compact representation of disparate data which is efficient for 95 Figure 5.1: Document retrieval overview. Given a keyword query and a set of documents, the goal is to calculate relevance score for each document and rank them accordingly. The top ranked documents are ideally most relevant to the user query. human interpretation, however, it is challenging for machines to interpret tabular docu- ments. Conventional document retrieval methods fail to perform well on tables because tables often contain short text and abbreviations which is different from unstructured text. Previous research on table retrieval focus on relational tables with rather simple data lay- out and try to apply different scoring methods for various parts of the table [57, 58]. More recently, researchers have introduced methods for applying pre-trained language models to retrieve such relational tables which are relevant to a user query [59]. In Chapter 2, we introduced the MCM framework and showed its ability to capture semantic and structural cell information. In this chapter we propose a method for aggre- gating the cell embedding vectors in a tabular document and generating a table vector repre- sentation to tackle the table retrieval problem. Our method uses a deep neural network on top of the cell embedding model to calculate query and table vectors and produce the relevance score for each query-table pair. We then use a supervised pointwise learning to rank (LTR) framework to train this network, and fine-tune the parameters of our cell embedding model for this task. To formally state the problem of table retrieval, assume there is a set of can- didate tabular documentscand =fD 1 ;D 2 ;:::;D n g and a given keyword queryq. The goal 96 is to calculate a relevance scores i for each documentD i in the candidate set. These score functions then will be used to rank the candidate documents with respect to their relevance to the user queryq. We evaluate our proposed method dataset of keyword queries and their corresponding candidate tables annotated with relevance scores which has been used in previous work [57]. The candidate tables in this dataset are selected from Wikipedia pages and are a subset of the WikiTables dataset 1 . Figure 5.3 shows an example of query and table candidates in this dataset. All of the tables in this dataset are relational tables (see Chapter 4), and there are metadata information from the corresponding HTML page associated with each table in this dataset. The metadata information includes table caption, section title, and page title which often are important for predicting the relevance score of the table to the query. Our cell embedding model does not handle extra metadata information, however, in Chapter 4 we showed that it can identify elements of complex tabular data layouts. To be able to use our cell embedding model for the Wikipedia tables and also incorporate the metadata information, we add these extra fields as metadata cells to the tabular document itself. In the remaining of this chapter we explain our proposed method for calculating the relevance score of a tabular document in detail and present evaluation results. 5.2 Methodology The cell embedding model produces an embedding vector for each tabular cell, and contains semantic and structural information about the tabular cell and its context (see Chapter 2). We 1 http://websail-fe.cs.northwestern.edu/TabEL/ 97 Figure 5.2: Overview of our pointwise learning to rank framework for table retrieval. The query is encoded into a vectorq using FastText model. The cell embedding tensor is aggre- gated into a table vectort using an attention mechanism. A multi-layer perceptron classifier then takes concatenation ofq andt vectors to produce the relevance probability. propose to use our pre-trained cell embedding model for achieving the table retrieval task. To this end, we use a pointwise learning to rank (LTR) framework where we treat the problem as a binary classification problem and try to generate relevance probabilities for each tabular document [60]. We then rank the candidate documents for each query in descending order of the associated relevance probabilities. The input to the classification network are two vectors, one representing the tabular document, and the other representing the user query. Figure 5.2 shows an overview of our method. Given a query and a candidate tabular documentD, we first use FastText [20] to encode the query into a vector representationq. We then use the pre-trained CTrans model to gen- erate the cell embedding tensorE. A tabular document may contain disparate information within its cells and not all the tabular cells are relevant to a user query. We use a global atten- tion mechanism to generate attention weights (a i;j ’s) for different cells in the table (C i;j ’s) using the cell embedding vectors (E i;j ’s) and the query vectorq. We then use the cell atten- tion weights to aggregate the cell embedding tensorE into a table vectort. Equation (5.1) 98 formally describes the formulation for calculating the table vectort. in this equation,a i;j is the attention weight for the cellC i;j in documentD, andscore is the content-based function for the attention mechanism which generates a scalar value [35].W a is a learnable parameter matrix in this function. t = X i;j a i;j E i;j (5.1) a i;j = exp(score(E i;j ;q) P i 0 ;j 0 score(E i 0 ;j 0;q) score(E i;j ;q) =E i;j W a q T We use a 2-layer perceptron classifier to predict the relevance probability for the query- table pair. This classification module takes the concatenation of the query vector q and the table vector t to produce the relevance probability p(y = 1jq;t), which we use as the relevance score s i to rank the candidate documents for the query. Equation (5.2) formally describes this classification network. Here, is the Sigmoid function, andW 1 ,b 1 ,W 2 ,b 2 are learnable parameters. p(y = 1jq;t) =(W 2 +b 2 ) (5.2) =ReLU(W 1 [q;t] +b 1 ) To train our learning to rank framework, we use a training corpus of user queries, can- didate tabular document for each query, and a binary annotation for each query-document pair corresponding to document relevance to the query. We use a binary cross entropy loss function to train the newly introduced parameters, as well as fine-tuning the CTran model parameters for this task. Equation (5.3) formally describes this loss function. Here,y D;q is 99 Figure 5.3: Example of annotations for keyword query and candidate tables. Each table occurs in a Wikipedia page and includes metadata information from the page, i.e. page title, section title, and table caption. The annotated relevance score determines how relevant the table is to the query, 2 means very relevant, 1 means somewhat relevant, and 0 means not relevant. Note that this figure shows a sample of the candidate tables for the query, the dataset contains 60 candidate tables for this query. the true label for the query-table pair (1 if relevant and 0 otherwise), and 0 < w < 1 is a hyper-parameter to mitigate the class imbalance problem in the training dataset (number of irrelevant candidates are much larger than relevant candidates). loss = X q X D2cand(q) wy D;q log p(y = 1jq;t D ) +(1w)(1y D;q ) 1log p(y = 1jq;t D ) (5.3) 100 5.3 Empirical Evaluation 5.3.1 Data preparation We use a dataset of keyword queries and their corresponding candidate tables annotated with relevance scores which has been used in previous work [57]. In this dataset there are 60 key- word queries, and for each query there are about 60 candidate HTML tables. The candidate tables in this dataset are selected from Wikipedia pages and are a subset of the WikiTables dataset 2 . Each candidate table is annotated by a relevance score: 2 meaning relevant, 1 mean- ing somewhat relevant, and 0 meaning irrelevant. Figure 5.3 shows an example of query and table candidates in this dataset. All of the tables in this dataset are relational tables (see Chapter 4), and there are metadata information from the corresponding HTML page associ- ated with each table in this dataset. The metadata information includes table caption, section title, and page title which often are important for predicting the relevance score of the table to the query. Our cell embedding model does not handle extra metadata information, how- ever, in Chapter 4 we showed that it can identify elements of complex tabular data layouts. To be able to use our cell embedding model for the Wikipedia tables and also incorporate the metadata information, we add these extra fields as metadata cells to the tabular document itself. Figure 5.4 shows an example of this process. 2 http://websail-fe.cs.northwestern.edu/TabEL/ 101 Figure 5.4: Example of combining a Web table and its metadata information from the corre- sponding Wikipedia page into a single tabular document. We set the styling of top attribute cells as bold faced, and set the font size of the metadata cells larger than other cells. 5.3.2 Experiment details We use our CTrans cell embedding model, pre-trained on our table corpus of about 30,000 tables from various domains (see Section 2.5). We choose the cell embedding dimension asd = 256, and the FastText encoding vector for the query has the dimension of 300. To mitigate the class imbalance issue in our pointwise learning to rank framework, we use a positive (relative document) class weight ofw = 0:7 in our loss function. We compare our method with the recent state of the art system which is based on pre- trained language models [59]. We also run the experiments with a baseline employing con- ventional document retrieval techniques using BM25 [55]. For this baseline, we create a textual document by concatenating all the cell textual contents and metadata for a table into a single unstructured document. We used the rank-bm25 3 package to implement this base- line. In our evaluation, we perform a 5-fold cross validation where we train the systems on 80% of the queries in the dataset, and test the systems on the other 20% of the queries. We 3 https://pypi.org/project/rank-bm25/ 102 hold 5% of the training queries as validation set, and use early stopping based on the valida- tion loss in our learning to rank framework. We use common scoring functions in informa- tion retrieval, normalized discounted cumulative gain (NDCG) and mean average precision (MAP) to evaluate the systems. We introduce these scoring methods in the following. normalized discounted cumulative gain (NDCG): NDCG score measures the ranking of candidate documents for a query, by penalizing the relevant documents being ranked low. NDCG score is between 0 and 1, and the closer the score is to 1, the better the predicted ranking. NDCG is calculated given an specified rank cut-off point (NDCG@N) for the predicted candidate ranking, and the scoring function only considers documents up to the cut- off in the predicted ranking. (5.4) formally describes the formula for calculating NDCG@N, given a query and predicted ranked list of candidate documents. In this formula, (rel) i is the true relevance score for the document at ranki, and IDCG is calculated on a perfect ranking of candidates and is used for normalizing the score between 0 and 1. NDCG@N = DCG@N IDCG@N (5.4) DCG@N = N X i=1 (rel) i log 2 (i+1) IDCG@N =DCG@N given the perfect ranking mean average precision (MAP): MAP measures the average precision (AP) for differ- ent test queries and averages them together to calculate a single score. Average precision is calculated for a ranked set of candidate documents. AP calculates the precision of doc- ument retrieval at different cut-off points in the predicted ranking of candidate documents, 103 and aggregates the average precision at different cut-offs into a single score. AP is a score between 0 and 1, and the closer the score is to 1, the better the predicted ranking. (5.5) formally describes the MAP score function. In this formulation, GTP q is the number of relevant documents in the candidate pool of documents for the queryq. MAP = 1 N N X q=1 AP q (5.5) AP q = 1 GTP q mq X k=1 P@krel k P@k = P k i=1 rel i k rel i = 8 > > < > > : 0 ; if document at ranki is not relevant 1 ; if document at ranki is relevant 5.3.3 Evaluation results Table 5.1 shows the evaluation results for different methods. This experiment confirms our hypothesis that conventional document retrieval methods are not very suitable for tabular data, and BM25 method performs poorly in our experiment. Our method based on the pre- trained cell embeddings (CE) achieves higher scores across all metrics, compared with the BM25 baseline. We observe that for all approaches, NDCG@N score improves as the cut- off point pushed towards largerN. Although CE performs better than the BM25 baseline, it performs worse than the state of the art (SOTA) method. It is important to note that the SOTA method is specialized to the table retrieval task, and uses much larger language models. One 104 NDCG@5 NDCG@10 NDCG@15 NDCG@20 MAP SOTA [59] 60.5 62.7 63.9 66.4 61.9 BM25 27.4 30.1 34.1 36.0 35.5 CE 47.5 49.1 52.1 54.4 47.5 Table 5.1: Evaluation results for the table retrieval task. CE is our system using the pre- trained cell embeddings. potential limitation of our model is the lack of support for incorporating metadata informa- tion, and known data layout of table in cell embedding vectors. We combined the metadata information into the table presentation, and relied our cell embedding model to identify the important data elements. However, the SOTA system leverages the metadata information, and the fact that the candidate tables are relational tables with known column headers. There is a potential research direction to investigate how to incorporate metadata, and known data layout information to improve the quality of cell embedding vectors. 105 Chapter 6 Related work We categorize the related work to this thesis into two major categories, previous work on unsupervised representation learning, and previous work on downstream problems to utilize tabular data in various tasks such as: knowledge graph completion, information retrieval, and question answering. In the remaining of this chapter we introduce some of these previous work. 6.1 Unsupervised Representation Learning Representation learning has a long history in the field of natural language processing in the form of vector space models (VSMs) [61]. The core VSM hypothesis is that tokens or pat- terns of tokens co-occur across documents in a semantically meaningful way. This hypothe- sis originates from the distributional hypothesis in semantic theory of language usage, which states that words that are used and occur in the same context tend to purport similar meanings [62]. Accordingly, vector representations derived from these co-occurrences can implicitly capture semantics. Earlier VSM approaches construct a co-occurrence matrix for all the words in the vocabulary, and subsequently using algebraic methods such as latent semantic analysis to construct denser word vector representations and identify patterns in text [61]. 106 Word embedding techniques employed artificial neural nets to learn word vector rep- resentations (embeddings) [1, 63]. Such techniques proved to be more scalable since they do not require costly algebraic transformations, and have better performance than previous VSMs. Word embeddings showcase the distributional hypothesis by learning a vector space for words where words with similar semantics are close, and semantic relationships between different pair of words, results in geometric relationships between them in the vector space [17]. Word embeddings assign a vector representation to each word in their training corpus which has two limitations. The first limitation is the problem of out-of-vocabulary words, that is words that do not appear in the training corpus do not have any representations. Bojanowski et al. tried to resolve this problem by learning sub-word embeddings [20]. The second limitation of word embeddings is that words with the same spelling and different meanings get the same vector representations. Researchers introduced contextualized lan- guage models to resolve this issue. In these methods, the context of the word is captured in its embedding representation by employing recurrent neural networks [64, 36], or more recently transformer networks which use dot-product self-attention mechanisms [9, 65, 66]. Employing representation learning methods for tabular data is rather unexplored and, to the best of our knowledge, there are only a few works in this area. Gentile et al. [67] used table embeddings for the blocking step in entity matching problem. They assume that the header rows and attribute value relationship (for entity model) is known, and they train a word embedding model based on this information. Yin et al. [68] used a transformer based model to learn cell embedding representations to answer specific questions about a given 107 table. Unlike our proposed MCM framework which does not make any assumptions on the table data layout, these methods assume relational data layout for the tables. Wu et al. proposed Fonduer, a system for automatic knowledge construction from tabular documents [46]. Their technique has three phases and uses styling, structural, and semantic information to form relations between values in cells. They use an RNN-based method in the last phase of their system to validate the candidate relations. Unlike our proposed cell classification method in Chapter 3 [38], they do not use the RNN network to directly classify all the cells in the document. 6.2 Tabular Data Downstream Problems Research from Google (Octopus system) [50, 69, 70] pioneered web-scale knowledge extrac- tion from web tables. In their research they try to extract relations from the Web tables. Their system combines search, extraction, data cleaning, and data integration. Web tables have been afterwards considered for relation extraction [71, 72], knowledge base (KB) creation and enhancement [73, 74, 75], data augmentation [76, 77], and query answering [78, 79, 80, 81]. There is also a large amount of recent work investigating spreadsheets for different tasks, such as data transformation, relational data extraction, and query answering. Data transformation methods focus on transforming spreadsheets with arbitrary data layout into more formal database tables. These techniques often use rule-based methods for the transformation. These rules are often engineered [82, 83, 84, 85], user-provided [86], or automatically inferred [87, 88]. While some of these techniques use semantic information of 108 tabular cells [87], these methods often rely on formatting, styling, and syntactic features of tabular cells Some proposed techniques in previous work tried to understand the table cells and their relationships by extracting relational data from tabular documents. Ahsan et al. [89] pro- posed a Data Integration through Object Modeling (DIOM) framework for spatial-temporal spreadsheets. Eberius et al. [90] introduced a framework for extracting relational data from spreadsheets. Chen et al. [42] presented a framework to automatically extract data using annotations (annotation-to-data mapping) in spreadsheets. They introduced a semi- automatic approach using an undirected graphical model to automatically infer parent-child relationships between annotations. The work of Eberius et al. and Chen et al. [42, 90] used manually crafted styling, typographic, and formatting features and used supervised classifi- cation methods to infer the data layout of tabular documents. In order to understand the relationship between tabular cells, it is useful to understand the data layout of the tables. Some previous work focused on detecting elements of the data layout in tabular documents. Chen et al. [42] used manually crafted features and a conditional random field classifier to identify different elements of tabular data layout. In a later work, Chen et al. [91] used active learning and rules to detect properties of spreadsheets, such as aggregated columns and merged cells, and integrated an active learning framework where users could provide rules to save human labeling effort. These methods are tuned for special types of tables, called dataframes. There are also methods for inferring the full data layout of tabular documents by identifying blocks of cells with the same cell type in a tabular document. Koci et al. [5] used formatting and typographic features for cell classification, 109 and used the classification result for layout inference. We used their method as a baseline for our experiment in Chapter 3 to evaluate our cell classification framework. These cell blocks could then be used to detect tables in documents that contain multiple tables, as proposed in [41]. Although Web tables have proven to be useful data sources, detecting useful informa- tion from millions of Web tables is challenging. Therefore, table classification by layout type has received attention in previous work. Researchers have been working on methods to detect useful data tables from the Web [92]. Some research has been focused on labeling of table cells [93, 94] or table columns [95]. These works often use an existing KB to infer the label of cells or columns. Crestan et al. [11] introduced seven different table types for web tables (listings, attribute/value, matrix, enumeration, form, navigational, and formatting), and presented a feature-based classification technique. They used structural features, and syntactic content features (e.g. number of empty cells, number of cells containing number, etc.), to identify the table type. Later, Eberius et al. [53] enhanced this model by introducing more features, and used their system to build a table corpus of Web tables from 3.6 billion Web pages. More recently, Nishida et al. [45] proposed a supervised deep learning method (TabNet) using a hybrid deep neural network architecture, based on hierarchical attention networks. They reported significant improvement compared to previous feature-based meth- ods. However, they used more than 60,000 annotated web tables, to train their model. In Chapter 4 we proposed a weakly supervised method for table classification which does not require manually annotated training data [96]. We use user annotations for examining the 110 clusters, but the effort of doing this task is minimal compared to collecting an annotated training dataset. Finding relevant documents for a keyword query has a long history of research and has been the core of Web search engines [54]. Tabular documents contain useful domain infor- mation which can help answer user queries. Regular document retrieval methods [55, 56] fail to perform well on tables because tables often contain short text and abbreviations, and are structured differently from unstructured text. Previous research on table retrieval focused on relational tables with rather simple data layout and tried to apply different scoring methods for various parts of the table [57, 58]. More recently, researchers have introduced methods for applying pre-trained language models to retrieve such relational tables which are relevant to a user query [59]. There has also been some effort in previous work to use tabular data for answering specific queries from tables. Jauhar et al. [97] proposed a framework to create a corpus of crowd-sourced multi-choice questions from tables containing general knowledge facts. They then introduced a reasoning technique for answering multi-choice questions from tables. Their technique uses some heuristic to detect column header cells, and is tuned for specific data layouts. More recently, Yin et al. [68] used a pre-trained language model to answer specific questions about a given relational table. In their method, they first identify the related rows to the user query, and then use cell and column representations combined with a reinforcement learning framework to find the answer to the user question in the table. 111 Chapter 7 Conclusion and Future Research Directions Tabular data enables dense presentation of multi-dimensional data and there is a large amount of useful information from various domains such as environment, finance, business, and socio-politics, available in semi-structured tabular form. Tabular data format is presented in a two-dimensional matrix of tabular cells which often contain natural language text, and also incorporates rich cell stylistic features to help the human users to interpret the data more easily; nevertheless, it is challenging for machines to understand complex data relations in many tabular documents. Previous approaches for understanding the complex data layout of tables relied on manually-engineered stylistic, formatting, and typographic cell features to detect tabular data layout elements. Such features were often dependents on special data formats with rich styling, prone to overfitting on domain specific conventions thus require extensive re-training for domains where different conventions are used, and ignored the cell content beyond simple featurization. Representation learning methods such as language models have been used to achieve state of the art results on complex tasks in different fields of research, such as informa- tion retrieval and natural language processing. Inspired by the success of those models, in 112 this thesis, I investigated deep neural-based models to learn cell representations (cell embed- dings) from unlabeled tabular documents to enable capturing general patterns and regularities about tabular cells and their context in a dense vector space model. I also presented an MCM framework to pre-train these cell embedding models on a corpus of unlabeled tabular doc- uments. I then presented methods to achieve three downstream tasks on tabular documents using the pre-trained cell embedding model. The main hypothesis of this thesis was that pre-trained cell embeddings are able to capture general patterns and regularities in tabular data, thus they provide a rich cell feature space which can help with achieving downstream tasks on tabular documents requiring less annotated training data and manual effort. There- fore, in this thesis I investigated the generalizability of our proposed models for achieving the downstream tasks on tabular documents, and showed the cell embeddings are especially useful for tables with complex data layouts. In the remaining of this chapter, I summarize the contributions of this thesis, then describe the limitations of our proposed models, and finally suggest future research direc- tions. 7.1 Contributions of This Thesis The main idea we investigated in this thesis was to devise an unsupervised representa- tion learning framework to capture the regularities, general data patterns and relationships between tabular cells, and show that employing this framework can improve the state of the art results for automated tabular understanding as well as reduce the manual effort for this process. In Chapter 2 of this thesis we introduced our proposed methods for unsupervised 113 learning of tabular cell representations. Chapters 3, 4, and 5 introduced our proposed meth- ods which used pre-trained cell representations to achieve three downstream problems in tabular data understanding: cell classification by role type in the data layout of the table, table classification by data layout type, and table retrieval and ranking. Although tabular cells contain natural language text, there are major differences between tabular data and unstructured text which prevents off-the-shelf language models to be directly applicable to tables. The building elements of tabular data are cell values which may vary from a single number to multiple sentences. Also, language models assume a one- dimensional sequence of words while tables represent a two-dimensional cells matrix with non-local relational dependency between tabular cells. Moreover, tabular cells present stylis- tic features in addition to their natural language content. We proposed deep neural archi- tectures for capturing multi-channel cell properties as well as important cell context in the tabular document. We used FastText, a pre-trained sentence encoding model, to encode the textual content of cells, and introduced stylistic and positional encodings to present impor- tant cell properties to the cell embedding model. We then introduced three cell embedding architectures, bags of cell (BoC) model which was able to capture the local cell context, cell graph network (CGN) which used graph neural networks to capture distant cell context, and cell transformer (CTrans) which used transformer technique to capture distant cell context within its corresponding row and column. We also proposed a masked cell model (MCM) as a general framework to pre-train cell embedding models on a corpus of unlabeled tab- ular documents. Our empirical results from pre-training the cell embedding models on a corpus of about 30,000 spreadsheets and 200,000 Web tables showed that pre-trained cell 114 embedding vector space captures useful semantic and structural information about tabular cells. Chapter 3 presented our approach for cell classification by the cell role in the data layout of the tables. This task is towards detecting elements of complex tabular data layouts which is an important step in automated understanding of the information in tables and an extract- ing relational information from them. We focused on five major cell types, left attribute, data, top attribute, metadata, and note. Our approached comprised of using the pre-trained cell embedding model, and adding a classification head which predicts cell types using the cell embedding vectors. We used a supervised learning framework to train the classification head and fine-tune the cell embedding model using annotated training tables. We evaluated our method on three real-world datasets from various domains, DeEx, CIUS, and SAUS. We compared our model with different baselines including previous methods which used manually engineered cell features. Our experiments were performed in various settings, in-domain, out-domain, using excel features, and only using csv features. The in-domain setting evaluated the classification models in terms of their ability to learn useful data layout features. The out-domain setting was meant to evaluate the generalizability of the trained models to unseen datasets, which is an important aspect to investigate since our goal was to reduce the manual effort needed to curate an annotated training set for each new dataset. Moreover, while excel features are available in spreadsheets, rich stylistic features are absent from many tabular documents (such as CSV files), and the csv features setting tried to sim- ulate the performance of different classification models on such data formats. Our proposed model showed superior or competitive performance across all evaluation settings for cell 115 classification task. Especially, our model outperformed previous method by a large mar- gin in the out-domain evaluation setting, which illustrated that our proposed pre-trained cell embeddings provide a generalizable cell feature space which results in a classification model that can be applied to unseen datasets. Web tables are not only used to present data but also are used for page formatting. Only less than 20% of Web tables are data tables which contain useful information. In Chapter 4, we introduced a semi-supervised approach to identify data tables, and also categorize them according to their data organization. This is an important first step to extract relational infor- mation from Web tables. We used the table taxonomy from previous work to categorize data tables into four major categories, entity (E), relational (R), matrix (M), and list (L). We also considered all other web tables which do not belong to any of these data table types as non- data (ND) category. In our proposed approach, we employed pre-trained cell embedding vectors to calculate a vector representation for web tables that capture data organizational characteristics of tables. We then used these table vectors to cluster Web tables in a corpus of HTML pages into meaningful clusters. The goal was that the table clusters represent the data formation of web tables, and the tables in each cluster are structurally similar and conse- quently belong to the same table type. After the table clusters were formed, user supervision was used for labeling the clusters, by asking the user to annotate a few tables from each cluster. We then assigned a cluster label to all the tables within that cluster. We evaluated our system on four real-world dataset of Web tables, and compared its performance with two previous methods, DWTC, and TabNet. The evaluation results showed that our proposed 116 table vectors are able to capture data organizational patterns and reduce the manual effort for annotating training samples. In Chapter 5, we proposed a method for aggregating the cell embedding vectors in a tab- ular document and generating a table vector representation to tackle the table retrieval and ranking problem. In our approach, the user query was first encoded into a vector representa- tion using FastText language model. We then combined the candidate Web tables and their metadata into a single table presentation. Our method then used an attention mechanism on top of the cell embedding model to calculate a table vectors and produce the relevance score for each query-table pair. We then used a supervised point wise learning to rank (LTR) frame- work to train this network, and fine-tune the parameters of our cell embedding model for the table retrieval task. We evaluated our proposed method on a dataset of keyword queries and their corresponding candidate tables, and compared its performance with an BM25 baseline and the state of the art method for this task. Our method performed worse than the state of the art method, which used pre-trained language models, on this task. We hypothesize this is because of two reasons. First, our model is best suited for tables with complex data layout while the evaluation dataset contains only relational tables. second, the state of the art method is specialized for this task and treats different elements of tables (such as column headers) and table metadata differently, while our method relies on the cell attention network to distinguish these elements. 117 7.2 Future Research Directions Our proposed architectures in Chapter 2 focused on embedding cell context within the tabular document. However, tabular data may be associated with external metadata and context. For example, links to spreadsheets are mentioned in HTML pages which may contain extra contextual information explaining the data in the tabular document. Also, as we saw in Chapter 5, Web tables are associated with extra metadata information from the HTML page. Exploring additional techniques to incorporate such information in the cell embeddings is an interesting future research direction. In Chapter 5, we used pre-trained cell embeddings to rank tables based on relevance to a given user query. Our evaluation results did not show improvement compared to previous methods which we hypothesized is due to the dataset characteristics. One potential research direction is to curate a dataset of queries and candidate tables with more complex data layouts (such as spreadsheets). Another potential research direction is the way to aggregate cell embedding vectors to generate a table vector representation. We used a simple attention mechanism for table retrieval task in Chapter 5, and a heuristic method for table classification task in Chapter 4. The cell graph network (CGN) architecture that we introduced in Chapter 2 can potentially generate vector representations for table components at various granular levels (cell, row, column, and table). In this thesis, we experimented with graph attention networks (GAT) which showed lower performance compared to our cell transformer (CTrans) model. A potential future research direction is to explore various other graph neural network models, such as gated attention. 118 Because of computational limitations, our experiments in this thesis was based on cell embeddings pre-trained on about 30,000 spreadsheets and 200,000 Web tables. I believe pre-training the cell embeddings on a much larger table corpus can potentially improve the quality of cell embedding vectors and subsequently the performance of the models for the downstream tasks. This can be an interesting future experiment. 119 Reference List [1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008. [3] Z. Chen and M. Cafarella, “Automatic web spreadsheet data extraction,” in Proceedings of the 3rd Inter- national Workshop on Semantic Search over the Web. ACM, 2013, p. 1. [4] M. D. Adelfio and H. Samet, “Schema extraction for tabular data on the web,” Proceedings of the VLDB Endowment, vol. 6, no. 6, pp. 421–432, 2013. [5] E. Koci, M. Thiele, ´ O. Romero Moral, and W. Lehner, “A machine learning approach for layout inference in spreadsheets,” in IC3K 2016: Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management: volume 1: KDIR. SciTePress, 2016, pp. 77–88. [6] Y . Li and T. Yang, “Word embedding for understanding natural language: A survey,” in Guide to Big Data Applications. Springer, 2018, pp. 83–104. [7] Y . Lin, Z. Liu, M. Sun, Y . Liu, and X. Zhu, “Learning entity and relation embeddings for knowledge graph completion,” in Twenty-ninth AAAI conference on artificial intelligence, 2015. [8] N. Wang and D.-Y . Yeung, “Learning a deep compact image representation for visual tracking,” in Advances in neural information processing systems, 2013, pp. 809–817. [9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [10] P. Wright and K. Fox, “Presenting information in tables,” Applied Ergonomics, vol. 1, no. 4, pp. 234–242, 1970. [11] E. Crestan and P. Pantel, “Web-scale table census and classification,” in Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 2011, pp. 545–554. [12] X. Wang, “Tabular abstraction, editing, and formatting,” Ph.D. dissertation, University of Waterloo, 1996. [13] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016. [14] Y . Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014. [15] M. Neishi, J. Sakuma, S. Tohda, S. Ishiwatari, N. Yoshinaga, and M. Toyoda, “A bag of useful tricks for practical neural machine translation: Embedding layer initialization and large batch size,” in Proceedings of the 4th Workshop on Asian Translation (WAT2017), 2017, pp. 99–109. [16] I. Goodfellow, Y . Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www. deeplearningbook.org. 120 [17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119. [18] D. Cer, Y . Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al., “Universal sentence encoder,” arXiv preprint arXiv:1803.11175, 2018. [19] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sen- tence representations from natural language inference data,” arXiv preprint arXiv:1705.02364, 2017. [20] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. [21] E. Koci, M. Thiele, O. Romero, and W. Lehner, “Cell classification for layout recognition in spread- sheets,” in International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowl- edge Management. Springer, 2016, pp. 78–100. [22] T. Mikolov, Q. V . Le, and I. Sutskever, “Exploiting similarities among languages for machine translation,” arXiv preprint arXiv:1309.4168, 2013. [23] M. Kusner, Y . Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in International conference on machine learning, 2015, pp. 957–966. [24] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in neural information processing systems, 2017, pp. 1024–1034. [25] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. V on Lilienfeld, “Quantum chemistry structures and properties of 134 kilo molecules,” Scientific data, vol. 1, no. 1, pp. 1–7, 2014. [26] T. Hamaguchi, H. Oiwa, M. Shimbo, and Y . Matsumoto, “Knowledge transfer for out-of-knowledge-base entities: A graph neural network approach,” arXiv preprint arXiv:1706.05674, 2017. [27] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2008. [28] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in ICML, 2017. [29] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y . Philip, “A comprehensive survey on graph neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2020. [30] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” arXiv preprint arXiv:1812.08434, 2018. [31] P. Veliˇ ckovi´ c, G. Cucurull, A. Casanova, A. Romero, P. Li` o, and Y . Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018. [32] Y . Belinkov and J. Glass, “Analysis methods in neural language processing: A survey,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 49–72, 2019. [33] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and trans- late,” arXiv preprint arXiv:1409.0473, 2014. [34] J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networks for machine reading,” arXiv preprint arXiv:1601.06733, 2016. [35] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Process- ing, 2015, pp. 1412–1421. [36] I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112. 121 [37] R. Caruana, “Multitask learning: A knowledge-based source of inductive bias.” in ICML, vol. 1. Citeseer, 1993, p. 2. [38] M. G. Gol, J. Pujara, and P. Szekely, “Tabular cell classification using pre-trained cell embeddings,” in 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019, pp. 230–239. [39] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” arXiv preprint arXiv:1607.01759, 2016. [40] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53 – 65, 1987. [41] E. Koci, M. Thiele, W. Lehner, and O. Romero, “Table recognition in spreadsheets via a graph represen- tation,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018, pp. 139–144. [42] Z. Chen and M. Cafarella, “Integrating spreadsheet data via accurate and low-effort extraction,” in Pro- ceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 1126–1135. [43] M. Taheriyan, C. A. Knoblock, P. Szekely, and J. L. Ambite, “A scalable approach to learn semantic models of structured sources,” in 2014 IEEE International Conference on Semantic Computing. IEEE, 2014, pp. 183–190. [44] P. Azunre, C. Corcoran, N. Dhamani, J. Gleason, G. Honke, D. Sullivan, R. Ruppel, S. Verma, and J. Morgan, “Semantic classification of tabular datasets via character-level convolutional neural networks,” arXiv preprint arXiv:1901.08456, 2019. [45] K. Nishida, K. Sadamitsu, R. Higashinaka, and Y . Matsuo, “Understanding the semantic structures of tables with a hybrid deep neural network architecture.” in AAAI, 2017, pp. 168–174. [46] S. Wu, L. Hsiao, X. Cheng, B. Hancock, T. Rekatsinas, P. Levis, and C. R´ e, “Fonduer: Knowledge base construction from richly formatted data,” in Proceedings of the 2018 International Conference on Management of Data. ACM, 2018, pp. 1301–1316. [47] T. Barik, K. Lubick, J. Smith, J. Slankas, and E. Murphy-Hill, “Fuse: a reproducible, extendable, internet- scale corpus of spreadsheets,” in 2015 IEEE/ACM 12th Working Conference on Mining Software Reposi- tories. IEEE, 2015, pp. 486–489. [48] M. Fisher and G. Rothermel, “The euses spreadsheet corpus: a shared resource for supporting experi- mentation with spreadsheet dependability mechanisms,” in Proceedings of the first workshop on End-user software engineering, 2005, pp. 1–5. [49] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009. [50] M. J. Cafarella, A. Halevy, D. Z. Wang, U. C. Berkeley, and E. Wu, “Uncovering the relational web,” Proceedings of the 11th International Workshop on Web and Databases (WebDB 2008), 2008. [51] L. Van Der Maaten, “Accelerating t-sne using tree-based algorithms.” Journal of machine learning research, vol. 15, no. 1, pp. 3221–3245, 2014. [52] S. Lloyd, “Least squares quantization in pcm,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, March 1982. [53] J. Eberius, K. Braunschweig, and M. Hentsch, “Building the dresden web table corpus: A classification approach,” Big Data Computing, 2015. [54] C. D. Manning, H. Sch¨ utze, and P. Raghavan, Introduction to information retrieval. Cambridge univer- sity press, 2008. [55] S. Robertson and H. Zaragoza, The probabilistic relevance framework: BM25 and beyond. Now Pub- lishers Inc, 2009. 122 [56] Z. A. Yilmaz, S. Wang, W. Yang, H. Zhang, and J. Lin, “Applying bert to document retrieval with birch,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demon- strations, 2019, pp. 19–24. [57] S. Zhang and K. Balog, “Ad hoc table retrieval using semantic similarity,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1553–1562. [58] L. Zhang, S. Zhang, and K. Balog, “Table2vec: neural word and entity embeddings for table popula- tion and retrieval,” in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 1029–1032. [59] Z. Chen, M. Trabelsi, J. Heflin, Y . Xu, and B. D. Davison, “Table search using a deep contextualized language model,” arXiv preprint arXiv:2005.09207, 2020. [60] T.-Y . Liu, Learning to rank for information retrieval. Springer Science & Business Media, 2011. [61] P. D. Turney and P. Pantel, “From frequency to meaning: Vector space models of semantics,” Journal of artificial intelligence research, vol. 37, pp. 141–188, 2010. [62] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–162, 1954. [63] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empir- ical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [64] K. Cho, B. Van Merri¨ enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [65] Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. [66] Z. Yang, Z. Dai, Y . Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V . Le, “Xlnet: Generalized autore- gressive pretraining for language understanding,” in Advances in neural information processing systems, 2019, pp. 5753–5763. [67] A. L. Gentile, P. Ristoski, S. Eckel, D. Ritze, and H. Paulheim, “Entity matching on web tables : a table embeddings approach for blocking,” in Advances in Database Technology - EDBT 2017 : 20th Interna- tional Conference on Extending Database Technology, Venice, Italy, March 21?24, 2017, Proceedings. Konstanz: OpenProceedings, 2017, pp. 510–513, online-Ressource. [68] P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “Tabert: Pretraining for joint understanding of textual and tabular data,” arXiv preprint arXiv:2005.08314, 2020. [69] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y . Zhang, “WebTables: Exploring the Power of Tables on theWeb,” Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 538–549, aug 2008. [70] M. J. Cafarella, A. Halevy, and N. Khoussainova, “Data integration for the relational web,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 1090–1101, aug 2009. [71] H. Wang, A. Liu, J. Wang, B. D. Ziebart, C. T. Yu, and W. Shen, “Context Retrieval for Web Tables,” in Proceedings of the 2015 International Conference on Theory of Information Retrieval - ICTIR ’15. ACM Press, 2015, pp. 251–260. [72] Y . Wang and Y . He, “Synthesizing mapping relationships using table corpus,” in Proceedings of the 2017 ACM International Conference on Management of Data, ser. SIGMOD ’17. New York, NY , USA: ACM, 2017, pp. 1117–1132. [73] E. Mu˜ noz, A. Hogan, and A. Mileo, “Triplifying Wikipedia’s Tables,” in Proceedings of the First Inter- national Conference on Linked Data for Information Extraction - Volume 1057. CEUR-WS.org, 2013, pp. 26–37. 123 [74] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: a web-scale approach to probabilistic knowledge fusion,” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pp. 601– 610, 2014. [75] C. Ran, W. Shen, J. Wang, and X. Zhu, “Domain-Specific Knowledge Base Enrichment Using Wikipedia Tables,” in 2015 IEEE International Conference on Data Mining, nov 2015, pp. 349–358. [76] D. Ritze, O. Lehmberg, Y . Oulabi, and C. Bizer, “Profiling the potential of web tables for augmenting cross-domain knowledge bases,” in Proceedings of the 25th International Conference on World Wide Web, ser. WWW ’16. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 2016, pp. 251–261. [77] A. Ahmadov, M. Thiele, J. Eberius, W. Lehner, and R. Wrembel, “Towards a Hybrid Imputation Approach Using Web Tables,” in Proceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Com- puting, BDC 2015, 2016, pp. 21–30. [78] R. Pimplikar and S. Sarawagi, “Answering table queries on the web using column keywords,” Vldb, vol. 5, no. 10, pp. 908–919, 2012. [79] S. Sarawagi and S. Chakrabarti, “Open-domain quantity queries on web tables,” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pp. 711– 720, 2014. [80] F. Chirigati, J. Liu, F. Korn, Y . W. Wu, C. Yu, and H. Zhang, “Knowledge exploration using tables on the web,” Proc. VLDB Endow., vol. 10, no. 3, pp. 193–204, 2016. [81] H. Sun, H. Ma, X. He, W.-t. Yih, Y . Su, and X. Yan, “Table cell search for question answering,” in Proceedings of the 25th International Conference on World Wide Web, ser. WWW ’16. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 2016, pp. 771–782. [82] J. Cunha, J. Saraiva, and J. Visser, “From spreadsheets to relational databases and back,” in Proceedings of the 2009 ACM SIGPLAN workshop on Partial evaluation and program manipulation. ACM, 2009, pp. 179–188. [83] A. O. Shigarov, V . V . Paramonov, P. V . Belykh, and A. I. Bondarev, “Rule-based canonicalization of arbitrary tables in spreadsheets,” in International Conference on Information and Software Technologies. Springer, 2016, pp. 78–91. [84] A. O. Shigarov, “Table understanding using a rule engine,” Expert Systems with Applications, vol. 42, no. 2, pp. 929–937, 2015. [85] H. Su, Y . Li, X. Wang, G. Hao, Y . Lai, and W. Wang, “Transforming a nonstandard table into formalized tables,” in Web Information Systems and Applications Conference, 2017 14th. IEEE, 2017, pp. 311–316. [86] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer, “Wrangler: Interactive visual specification of data transformation scripts,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Sys- tems. ACM, 2011, pp. 3363–3372. [87] W. Dou, S. Han, L. Xu, D. Zhang, and J. Wei, “Expandable group identification in spreadsheets,” in Pro- ceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 2018, pp. 498–508. [88] R. Abraham and M. Erwig, “Inferring templates from spreadsheets,” in Proceedings of the 28th interna- tional conference on Software engineering. ACM, 2006, pp. 182–191. [89] R. Ahsan, R. Neamtu, and E. Rundensteiner, “Towards spreadsheet integration using entity identification driven by a spatial-temporal model,” in Proceedings of the 31st Annual ACM Symposium on Applied Computing. ACM, 2016, pp. 1083–1085. 124 [90] J. Eberius, C. Werner, M. Thiele, K. Braunschweig, L. Dannecker, and W. Lehner, “Deexcelerator: a framework for extracting relational data from partially structured documents,” in Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 2013, pp. 2477–2480. [91] Z. Chen, S. Dadiomov, R. Wesley, G. Xiao, D. Cory, M. Cafarella, and J. Mackinlay, “Spreadsheet property detection with rule-assisted active learning,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 2017, pp. 999–1008. [92] Y . Wang and J. Hu, “A Machine Learning Based Approach for Table Detection on The Web,” Proceedings of the 11th International Conference on World Wide Web, pp. 242–250, 2002. [93] J. Fang, P. Mitra, Z. Tang, and C. L. Giles, “Table header detection and classification,” in Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, ser. AAAI’12. AAAI Press, 2012, pp. 599–605. [94] D. Deng, Y . Jiang, G. Li, J. Li, and C. Yu, “Scalable column concept determination for web tables using large knowledge bases,” Proceedings of the VLDB Endowment, vol. 6, no. 13, pp. 1606–1617, 2013. [95] P. Venetis, A. Halevy, J. Madhavan, M. Pas ¸ca, W. Shen, F. Wu, G. Miao, and C. Wu, “Recovering seman- tics of tables on the web,” Proc. VLDB Endow., vol. 4, no. 9, pp. 528–538, 2011. [96] M. Ghasemi-Gol and P. Szekely, “Tabvec: Table vectors for classification of web tables,” arXiv preprint arXiv:1802.06290, 2018. [97] S. K. Jauhar, P. Turney, and E. Hovy, “Tables as semi-structured knowledge for question answering,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2016, pp. 474–483. 125
Abstract (if available)
Abstract
A vast amount of useful information from various domains such as environment, finance, business, and socio-politics, is available in semi-structured tabular form. Often, tabular data is meant for human consumption, using data layouts with complex multi-dimensional data relations that are difficult for machines to interpret automatically. In this thesis, we propose three deep neural network models to embed semantic and contextual information about tabular cells in a low-dimensional cell embedding space. We also propose a framework to pre-train these cell embedding models on a large corpus of unlabeled tabular documents from various domains. Pre-trained cell embedding models capture general patterns and relationships about tabular cells, which can be applied to unseen tabular datasets and help to achieve downstream tasks in tabular data understanding with better performance and requiring less manual annotation effort. We present our proposed methods for fine-tuning our pre-trained cell embedding models for three important problems in tabular data understanding, Web data table detection and classification, detecting elements of complex tabular data layout, and table retrieval and ranking.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Empirical study of informational regularizations in learning useful and interpretable representations
PDF
Theory and applications of adversarial and structured knowledge learning
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Multimodal representation learning of affective behavior
PDF
Object localization with deep learning techniques
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
Machine learning for efficient network management
PDF
Deep representations for shapes, structures and motion
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Invariant representation learning for robust and fair predictions
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
Asset Metadata
Creator
Ghasemi Gol, Majid
(author)
Core Title
Learning distributed representations of cells in tables
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/27/2021
Defense Date
01/04/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cell classification,deep learning,information extraction,OAI-PMH Harvest,representation learning,Tables,tabular cell,tabular data
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Szekely, Pedro (
committee chair
), Bogdan, Paul (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
ghasemig@usc.edu,majid.ghasemi91@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-418311
Unique identifier
UC11673247
Identifier
etd-GhasemiGol-9250.pdf (filename),usctheses-c89-418311 (legacy record id)
Legacy Identifier
etd-GhasemiGol-9250.pdf
Dmrecord
418311
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ghasemi Gol, Majid
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cell classification
deep learning
information extraction
representation learning
tabular cell
tabular data