Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advances in understanding and leveraging structured data for knowledge-intensive tasks
(USC Thesis Other)
Advances in understanding and leveraging structured data for knowledge-intensive tasks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADVANCES IN UNDERSTANDING AND LEVERAGING STRUCTURED DATA FOR KNOWLEDGE-INTENSIVE TASKS by Kexuan Sun A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2024 Copyright 2024 Kexuan Sun Dedication To my family. ii Acknowledgements I am deeply grateful to my advisor, Prof. Jay Pujara, whose guidance and mentorship have been invaluable throughout my PhD journey. Jay’s insightful feedback, constructive criticism, and consistent support have shaped me as a researcher and contributed to my personal growth. Beyond his role as an advisor, Jay has fostered an exceptional research environment within our group, providing opportunities for collaboration and friendship. I am thankful for Jay’s mentorship and constant belief in my potential, qualities I will always cherish and remember. I am immensely grateful to my dissertation committee members, Prof. Aiichiro Nakano and Prof. Gerard Hoberg, and my qualifying committee members, Prof. Bistra Dilkina, Prof. Pedro Szekely, and Prof. Muhao Chen, for their invaluable suggestions and encouragement. Their knowledge has been crucial in improving this work. We’ve explored interesting research paths during discussions on this dissertation and related topics. I extend special thanks to Prof. Craig A. Knoblock for his guidance during our research meetings on structured data, which also significantly influenced this dissertation. Throughout my master’s degree and the initial phase of my Ph.D. journey, I had the privilege of collaborating with exceptional individuals who profoundly influenced my academic and professional trajectory. I am sincerely grateful for the invaluable mentorship provided by Prof. Sven Koenig, Prof. T. K. Satish Kumar, Dr. Hong Xu, and Dr. Jiaoyang Li. In particular, I extend my heartfelt thanks to Dr. Hong Xu for providing essential insights into project initiation and scientific experiment design from both research and iii engineering perspectives. Working with these respected mentors helped me learn more and inspired me to start my Ph.D. with fresh motivation. I want to express my gratitude to my internship mentors from IBM Almaden Research - Dr. Yunyao Li, Dr. Lucian Popa, Dr. Prithviraj Sen, and Dr. Yannis Katsis - as well as from Amazon Alexa AI - Dr. Pedro Szekely, Dr. Alessandro Moschitti, Nicolaas Jedema, Ruben Janssen, and Dr. Karishma Sharma. During these internships, I gained valuable knowledge in natural language processing techniques, particularly generative models. Moreover, I had the opportunity to apply research concepts to real-world products. I appreciate their guidance, which has contributed significantly to my career growth and skill development. These internship experiences have played a significant role in shaping this dissertation. I want to extend a heartfelt thank you to my colleagues at the Information Sciences Institute. I have cherished every moment spent with my lab mates Pegah Jandaghi, Pei Zhou, Avijit Thawani, Lee Kezar, Kian Ahrabian, Dong-Ho Lee, Yifan Jiang, Eric Boxer, as well as my esteemed colleagues Fei Wang, Basel Shbita, Minh Pham, Binh Vu, and Yixiang Yao. Our research discussions and collaborations have been invaluable, and I will always cherish the bond we’ve formed, supporting each other. Their support has made our work environment both inspiring and enjoyable. I also appreciate the help from Amy Feng, Alma Nava, and Karen Rawlins for making travel and meetings easy. I am grateful for the support, understanding, and encouragement from my dear friends Di Huang, Tingting Tang, and Xi Li throughout this journey. Special appreciation goes to Di Huang and Yozen Liu for their comforting dinners during challenging times. I also want to express my heartfelt thanks to Buer Qi; her visits over the past few years have meant much to me despite our distance. These friendships have been a constant source of strength and joy, making overcoming challenges easier. Finally, I am grateful to my parents, Xiuling Lu and Haifeng Sun, and my fiancé, Juntao Tan, for their consistent love and belief in me. Their unconditional support and sacrifices have been crucial to shaping this PhD journey. I am eternally grateful for their presence in my life. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2: Table-based Structural Analysis using Probabilistic Soft Logic . . . . . . . . . . 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 PSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Cell Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Cell Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Block Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 Layout Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 The Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Data Type Cell Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.6 Functional Block Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.1 Generating Candidate Blocks with MCMC . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.2 Generating Candidate Blocks with Agglomerative Clustering . . . . . . . . . . . . 15 2.6.3 Enforcing Probabilistic Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.4 Block Coalescing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Layout Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8 Datasets for Table Structural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8.1 Prior Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8.2 A New Dataset from Data.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9.1 Cell Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.9.2 Block Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 v 2.9.3 Layout Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3: Financial Table-based Question Answering with Case-based Reasoning . . . . 39 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 CBR for Financial QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.2 Case Retriever . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.3 Fact Retriever . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.4 Program Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.2 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Chapter 4: Scientific Knowledge Graph Construction and Representations . . . . . . . . . 56 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.1 Micro and Macro Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.2 Scientific Knowledge Graph Construction . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.3 Scoring Publications with KG Embeddings . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.1 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 5: Knowledge Graph-based Question Answering with Contextual Ranking . . . 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Retrieve-Rerank-Generate Pipeline for KGQA . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1 KGQA Task Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.2 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.1 Knowledge Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapter 6: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 Table Processing and Analyzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Knowledge-intensive Tasks with Tables and KGs . . . . . . . . . . . . . . . . . . . . . . . . 86 Chapter 7: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 vi List of Tables 2.1 The data type classification results on DG. Avg is the macro F1 score (%) ±, the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 The block-level evaluation results in terms of Precision (Pr), Recall (Re), and F-score (F1). . 32 2.3 The cell-level evaluations results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 The ablation study on different components in the agglomerative clustering-based approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 The layout prediction results on the DG dataset. . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 The comparison between the FiD-based system with prior systems with different generation modules. FiD-base is based on T5-base which is a smaller-sized model. . . . . . 50 3.2 The main results of the low-resource scenarios. 5, 10, and 20 indicate training systems with 5, 10, and 20 questions per program pattern, respectively. Each number represents the exact match performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 The statistics of the KGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 The main results on two datasets. Each number represents an RMSE/KT score. The best-performing scores are highlighted, and the second-best scores are underlined. . . . . 66 4.3 The ablation study results on different text and number encoders. . . . . . . . . . . . . . . 66 5.1 The overall performance of our framework. The columns LF? Indicate whether the model uses logic forms. FreebaseQA does not have logic from annotations. DecAF - Answer only is a variant of DecAF that does not leverage logic forms. For each category (use or ignore LF), results with the best performance are highlighted in bold font, while those with the second-best performance are underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 vii 5.2 The re-ranker ablation studies on FreebaseQA and WebQSP. We show the retriever, re-ranker, and generator performance regarding Hit@K. We additionally report the GT Triple hit rate following the triple-level labeling strategy defined above. We report the generator performance using FiD-base with 20 passages per question to reduce the computation required. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 viii List of Figures 1.1 The illustration of different tables from different topic domains or with different structures. 2 2.1 The architecture of the proposed neuro-symbolic structural analysis system. . . . . . . . . 11 2.2 The top-down block generation approach. T is the original table, B1, B2, B3, and B4 are non-overlapping blocks in the table (left) and leaf nodes in the decision tree (right). . . . . 13 2.3 The adjacent cell pairs with the same data types from an example table. . . . . . . . . . . 17 2.4 The two solutions for merging unaligned blocks, i.e., extending and shrinking. . . . . . . . 20 2.5 The potential infinite loop issue with the min-merging strategy. . . . . . . . . . . . . . . . 21 2.6 The distance distributions of sampled and ground-truth block pairs. . . . . . . . . . . . . . 22 2.7 The ablation study on the effect of the sample size and the probability threshold on the DG dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.8 The comparison between different block detection results on an example table. . . . . . . 36 3.1 The illustration of applying case-based reasoning techniques for answering financial questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 The workflow of the proposed approach. It consists of three main modules: case retrieval, fact retrieval, and program generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 The training process of the dense case retriever. The model is guided to make positive questions close to and negative questions far away from the anchor question. . . . . . . . 44 3.4 The question augmentation process that identifies attributes and replaces them with new attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 The ablation results on the number of facts in FiD models. For each dataset and each number of facts, we take the average over the 3 low-resource results (i.e. 5, 10, and 20 questions per pattern). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 ix 3.6 The illustration of how the case retriever affects the overall QA performance. The oracle variant uses the ground-truth pattern as the input. . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 The illustration of a scientific KG with additional information-associated entities. PA, PB are paper entities, A, U, and V are author, organization, and venue entities, respectively. . 57 4.2 The workflow of representing entities with both graph structures and the different types of information associated with them, and using the representations to predict reproducibility scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.1 The comparison between our proposed ranking process with 1) classic retrieve-thengenerate pipeline, and 2) ranking with coarse-grained document-level labeling strategy used during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 The proposed framework for KBQA. The framework contains three modules: Retriever, contextual Re-ranker, and Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 The architecture of the contextual re-ranker. Each input has: question, candidate triple, and context . The output is either 0 (irrelevant) or 1 (relevant). . . . . . . . . . . . . . . . 72 5.4 The error analysis with example questions sampled from both the FreebaseQA and WebQSP datasets. Each example includes the raw question, the gold answers, the predicted answers from the best-performing model, the error type, and a detailed rationale for the error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 x Abstract Over the past few decades, the Web has evolved into an essential information hub. Among the vast repository of information, structured data, including well-organized tables, charts, and knowledge graphs, distinguishes itself as a valuable source of knowledge. This dissertation investigates techniques for understanding and leveraging such structured data to enhance knowledge-intensive applications. To effectively harness information from structured data for downstream applications, artificial intelligence systems typically require two key capabilities: a) the automated understanding of complex data structures, and b) the precise selection and practical usage of relevant information from the data for generating task-specific outputs. In this dissertation, we research these two capabilities for two essential types of structured data: tabular data and knowledge graphs (KGs). The first part of the dissertation focuses on tabular data. Given the diversity and complexity of table structures, I first investigate approaches for understanding these structures, the capability above a), by introducing an automated hybrid probabilistic system. The system identifies sub-structures within tables and their relationships, offering potential benefits for downstream tasks like data integration. I then explore approaches for selecting valuable information from various sources to answer questions relying heavily on financial tables. We approach this task by leveraging case-based reasoning, adapting solutions from existing questions to answer new questions effectively. xi The second part of the dissertation explores the domain of KGs. Most of the KGs benefit from curation by domain experts, leading to well-defined structures. Consequently, this part primarily focuses on investigating techniques to select and exploit helpful information, the capability above b), from KGs and external resources for various KG-based tasks. I begin by investigating constructing domain-specific KGs and entity representation learning of the KGs that combine inherent graph structures and external entity-associated information. Additionally, I introduce a novel approach for accurately selecting important information from existing KGs to answer general-domain questions. xii Chapter 1 Introduction Structured data is one of the most vital sources of information, empowering decision-making and enabling deep understanding across various domains. For instance, tabular presentations of financial statements and stock market data are valuable tools for financial decision-making and economic analysis. These financial tables provide detailed information about companies that financial experts and analysts rely heavily on to evaluate market trends and investment opportunities. Similarly, knowledge graphs (KGs), another type of structured data, play an important role in various knowledge-intensive tasks such as information retrieval and question-answering. With the rich and high-quality knowledge of the interconnections among entities, KGs have become crucial tools for researchers and knowledge seekers. In this dissertation, we focus on tables and KGs, two essential structured data types. As a result of the data explosion on the World Wide Web, a substantial volume of structured data spanning diverse subject domains becomes available online. For example, Wikidata [125] - one of the most popular KGs - includes billions of triples, and the sec.gov website contains financial tables from numerous companies every year. The total volume of such structured data has been consistently increasing. This situation highlights the need for automated systems that can efficiently handle significant volumes of data. These systems are essential for understanding the complex structures of this data, extracting important knowledge from the abundant information, and then putting that knowledge to use in various applications. 1 (a) (b) (c) Figure 1.1: The illustration of different tables from different topic domains or with different structures. 1.1 Challenges In recent years, numerous efforts have led to advancements in these research areas. Nonetheless, significant challenges remain that necessitate attention and resolution. The first main challenge is around the inherent complexity of structured data structures. For example, the diversity of tables’ domains, formats, and layouts makes it difficult to automatically parse, extract, and relate content in different tables. Figure 1.1 presents three other tables from different domains and conveys different types of information. Table (a) consists primarily of textual values, while table (b) conveys mostly numerical values, and table (c) includes meta information and has nested headers or attributes. A practical understanding of complex table layouts could benefit many downstream tasks that rely on the relationships from table sub-structures. The second main challenge pertains to selecting and using relevant knowledge from structured data or external sources that are closely connected to the data. Whether dealing with tables or KGs, the sheer volume of available expertise can be overwhelming. The major sub-task when confronted with a task is identifying the most essential knowledge from all available data. For instance, the critical initial step in responding to a given question involves selecting the most relevant triples or cells from KGs or tables respectively. If the question involves specific entities, additional entity information can also serve as helpful 2 knowledge for models to exploit. This selected knowledge is the foundation for tackling a range of tasks, while irrelevant knowledge can adversely affect system performance and introduce hallucinations. This dissertation seeks to bridge the gap between prior research and the development of advanced AI systems by enhancing two critical aspects: 1) advancing structural understanding, with a specific emphasis on tabular data because table structures are generally more diverse and complex, and 2) identifying and exploiting relevant information, including both the selection of relevant structured data and the identification of auxiliary knowledge sources. By advancing in these two directions, we aim to improve our ability to understand and effectively utilize structured data for various downstream tasks. The subsequent sections of this dissertation detail our research efforts and findings within these aspects and offer insights into promising directions for future research. 1.2 Approaches In this section, I introduce the details of the subsequent chapters, which follow the two research directions. These chapters are divided into two parts based on the structured data types they address. The first part involves our investigations into structural analysis and knowledge identification for table-related tasks. In the second part, I describe our studies in knowledge identification from both KGs and external resources, spanning domain-specific and general KGs, alongside exploring two downstream tasks. In Chapter 2, I present a hybrid probabilistic approach for understanding the structure of tables [118, 119]. We envision this system to analyze table structures at three levels of information: individual cells, subsets of a table, and the entire table. Our proposed approach combines hidden representations of tables with high-level rules commonly employed for table organization. Furthermore, we introduce a new dataset comprising hundreds of tables from diverse domains and varying structures. Finally, we conduct experimental analyses of the system using this novel dataset. 3 In Chapter 3, I focus on the research direction of knowledge selection and usage for the financial QA task. Unlike classic extractive text QA, financial questions usually involve multi-step mathematical operations. In addition, as most numerical information is stored in financial tables, systems are required to select essential knowledge from tables and textual descriptions. We introduce a new pipeline with case-based reasoning (CBR) that retrieves similar questions and their associated solutions and abstract mathematical programs from a case memory and uses them to answer unseen questions. The recovered solutions and the selected information from financial reports are leveraged in the generation module for producing new solutions. This study is primarily grounded in the observation that similar financial questions apply similar mathematical computation steps for deriving answers. We observe that, during program generation, the abstract programs retrieved using CBR serve as valuable auxiliary knowledge, leading to performance improvements, especially under low-resource scenarios [116]. Our investigations on KGs start from Chapter 4. Within this chapter, we explore techniques to combine various sources of knowledge to represent entities in scientific KGs [117]. We first describe the KG construction process using the features extracted from multiple scientific graphs. Subsequently, using these KGs, we extend existing general-domain KG representation approaches tailored to our KGs. These approaches integrate the structural information of KGs and auxiliary entity-level features, such as text descriptions and category-specific features, as context. Finally, we conduct experimental investigations on the reproducibility prediction task, demonstrating that merging information from both macro and micro perspectives benefits our task. These findings align with prior studies that advocate the fusion of diverse information types for completing tasks in general-domain KGs. In Chapter 5, I delve deeper into the knowledge selection task, focusing on the QA domain using general-domain KGs. Building upon prior research, our new pipeline follows the ’retrieve then generate’ paradigm. Extracting the most relevant triples from a pool of billions is a crucial step for such systems. To enhance this process, we introduce innovative approaches by integrating a contextual ranking module 4 between the retriever and the generator [115]. Once the retriever identifies candidate triples from the entire KG, the ranking module re-evaluates these triples using the question, candidate triples, and onehop additional triples as context. The top-K re-ranked triples then serve as the final context for answer generation. Our findings show the ranking system accurately identifies informative triples, improving end-to-end QA performance even with a relatively smaller generation model. This dissertation encapsulates our endeavors to understand and exploit structured data for diverse knowledge-intensive tasks. Our exploration encompasses two studies on tabular data presented in Chapter 2 and Chapter 3, alongside two studies dedicated to KGs in Chapter 4 and Chapter 5. Building upon these advancements, we discuss future directions in Chapter 7. We believe that our findings can contribute to developing integrated systems utilizing various forms of structured data in the future. 5 Chapter 2 Table-based Structural Analysis using Probabilistic Soft Logic 2.1 Introduction In this chapter, I introduce our approach for automatically analyzing complex table structures. Tables, a standard tool for storing information, generally contain relational information between different types of values, such as entities and quantitative measurements. To better understand such relational information and leverage it for downstream tasks, the first step is to understand their complex structures. Toward practical structural analysis, an intelligent system must understand the data types within a table and their relationships. Tables are comprised of individual cells which contain a particular kind of information. Cells are spatially organized into regions (or blocks) that share a standard function. The relationship between the values in a table can be expressed as a relationship between these blocks. Our table structural analysis system adopts the paradigm of [98], which decomposes this problem into three tasks: cell classification, block detection, and layout prediction. Many prior works have addressed the problems of structural analysis in different, piecemeal ways. Traditionally, semantic typing approaches have sought to understand the data types within tables but only operate when data are organized into well-defined, homogeneous columns. Prior cell classification approaches have focused on identifying functional roles at a cell level rather than capturing broader spatial relationships between tabular regions. These methods range from using handcrafted cell stylistics, 6 formatting typographic features [67, 23], or embedded vector representations to classify cells [47]. Koci et al. [68] have investigated finding larger block structures in tables. However, the primary goal of this approach is to correct imperfect cell classification results as a post-processing step for cell classification. To our knowledge, the task of layout prediction in tables has not been studied in prior work, although the broader problem of semantic modeling [103] is an active area of research. Our work is the first to combine all three levels of structural analysis tasks in a single, end-to-end system. Furthermore, our novel neuro-symbolic method can capture stylistic features, prior approaches’ statistical patterns, and spatial relationships that encode how humans organize tabular data. Despite the availability of vast amounts of tabular data on the Web, annotations for data types, functional regions, and relationships within these tables are very rare. Our approach is designed for this environment by combining unsupervised representation learning and higher-level symbolic constraints that capture frequent patterns in data. Specifically, we use high-dimensional table embeddings learned from thousands of tables in concert with a probabilistic graphical model, probabilistic soft logic (PSL), that captures structural patterns, such as consistency of types among contiguous cells. Together, these two techniques can harness vast amounts of data while explicitly incorporating the intuitions humans use when constructing tables. Since previous work does not have a rich enough learning resource that covers all aspects of these tasks, we introduce a new benchmark dataset comprised of 431 tables downloaded from the U.S. Government’s open data ∗ . These tables are in different formats, such as spreadsheets and comma-separated values, and from various topic domains. Most existing benchmark datasets (such as DeEx [35], SAUS [104] and CIUS [28]) consist of only Excel files, are from narrow domains, and cover only cell functional types. Crossvalidation shows that the proposed system outperforms other baseline approaches in most cases. The results align with our hypotheses: the pre-trained cell embeddings encapsulate useful information about ∗ www.data.gov 7 cells, and probabilistic constraints can help enforce the logical consistency of prediction. Accordingly, the neuro-symbolic approach gets the best of both worlds. The idea of a neuro-symbolic combination could potentially be applied to investigate other tasks for tables, such as data transformation and relation extraction. In addition, the blocks and the relationships extracted by our system could potentially be used for automated tabular analysis, such as finding important patterns from a table [142]. 2.2 Preliminaries 2.2.1 PSL Probabilistic Soft Logic [6] is a probabilistic inference framework. A PSL model is defined on a set of predicates and a set of first-order logic rules consisting of the predicates. An example PSL rule: w : P1(X, Y ) ∧ P2(X, Z) ⇒ P3(Y, Z) where w is the weight of this rule, P1, P2 and P3 are three predicates, and X, Y and Z are variables. During inference, the variables are grounded by constants. PSL uses hinge-loss Markov Random Fields (MRFs) to solve the convex optimization problem efficiently. PSL has been successfully applied to solve different tasks such as knowledge base completion [22], recommender systems [69], and entity resolution [70]. 2.2.2 Cell Embeddings Cell embeddings are vector representations of tabular cells. Gol, Pujara, and Szekely [47] proposed to use neural networks to learn the cell embeddings. Each embedding consists of two parts, i.e., the contextual embedding captures the semantic information of the cell content, and the stylistic embedding encapsulates the stylistic features of the cell. Specifically, the authors exploit pre-trained language models to identify 8 the local context of cells and auto-encoders to encode the stylistic features. Accordingly, we use their pre-trained cell embedding model learned from thousands of tables. 2.3 Problem Definition In this section, we reformalize tabular data and three table understanding tasks introduced in [98]. A table is a matrix T = {vi,j |1 ≤ i ≤ N, 1 ≤ j ≤ M} where N is the number of rows, M is the number of columns, and vi,j is the cell in the i th row and the j th column. 2.3.1 Cell Classification The goal of cell classification is to assign a label to each cell. In most prior work, cell classification is to assign each cell a label indicating the functional role the cell presents in the data layout. This process should consider both the cell information and functional information of neighboring cells or even the whole table. In this work, we classify cells based on types of cell content rather than functional roles within the table. We leave the task of identifying cell functional roles to the block detection task. There are different ways to classify cell contents. Since the content usually is a sequence of characters, the most basic types can be Number, String, Datetime and Empty. However, these types are not informative enough. For example, knowing a cell contains a string is insufficient for understanding the table structure. We use more fine-grained semantic data types. A number can be a Nominal, Cardinal or Ordinal, and a string can be the name of a Person, an Organization, a Location, or Other string. Mainly, this task can be challenging for humans. For example, a zip code is a Nominal. Still, it can be easily identified as a Cardinal, and a year is a Datetime but can also be classified as a Cardinal. 9 2.3.2 Block Detection The goal of block detection is to identify regions of cells playing the same functional role. We refer to each part as a block. We define a block as a rectangle denoted as ⟨T, L, B, R⟩ where the four indices represent the Top row, Left column, Bottom row and Right column of the block, respectively. The smallest block is a single cell, and the largest union is the table itself. When we treat each cell as a block, the system eventually solves the cell functional type classification. However, our goal is to identify regions as large as possible such that cells in the same area indeed play the same functional role. As shown in [68], considering blocks rather than individual cells could potentially reduce the number of misclassifications. We consider the following functional roles: Metadata denotes the global information of the table (such as the title and source); Header indicates attribute names of the table columns; Attribute presents attribute names of the table rows, and Data shows the main content of the table. 2.3.3 Layout Prediction Given the blocks identified from the previous task, the goal of layout prediction is to determine the relationships between blocks. Each pair of blocks will be assigned a directed relationship. We consider the following four relationships: Subset_of denotes that a block shows the subcategory information of another block (usually between two Header or Attribute blocks); Header_of indicates a block to contain the header information of another block (usually between a Header and a Data block); Attribute_of marks a block to record the attribute information of another block (usually between an Attribute and a Data block); and Global_Attribute marks a block containing the global knowledge of another block (usually between a Metadata block and another block). 10 Figure 2.1: The architecture of the proposed neuro-symbolic structural analysis system. 2.4 The Proposed System We introduce our system for table structural analysis, illustrated in Figure 2.1. Specifically, the system takes a table as the input and uses a pre-trained cell embedding model to represent each cell that is to be processed by several downstream predictors. More specifically, the cell classifier leverages the embeddings to predict cell data types, and the block detector uses both the embeddings and the data types to generate blocks. The layout predictor then predicts relationships between blocks. The cell classifier, block detector, and layout predictor model are the three corresponding tasks as PSL problems. 2.5 Data Type Cell Classification The cell classifier operates in two processes. In the training phase, a statistical learning model learns to predict a candidate data type using cell embeddings. In the inference phase, a PSL model enforces constraints between the data types. To model this task, we first identify several simple features based on cell content, such as IsNumber, IsDate and IsEmpty, indicating if the content could be parsed as a number, a date/time, or other special cases such as empty cells or reserved values such as “n/a”, respectively. We then extract dependencies between the features and data types. The PSL model consists of a set of rules 11 presenting such dependencies. In our model, we are interested in predicting the cell data types (DataType). Some example rules are: IsDate(C) ⇒ DataType(C, “datetime”) CELabel(C, T) ⇒ DataType(C, T) The first rule expresses: if the cell content C is successfully parsed as a datetime, we could confidently label it as “Datetime”. CELabel shows the candidate data type predicted by the statistical learning model. The above implication rules are introduced to avoid cell embeddings being overfitted and making mistakes in apparent situations. However, the above rules are not always sufficient to differentiate data types from each other, especially for numbers. For example, “2020” could indicate a quantity or a specific year. To alleviate this issue, we also introduce several conjunctive rules. For example: IsNum(C)∧! IsInt(C) ⇒ DataType(C, “cardinal′′) HasAlpha(C)∧! HasNum(C) ⇒! DataType(C, “cardinal′′) OneWord(C) ∧ HasNum(C) ⇒! DataType(C, “person′′) These rules are derived from constraints upon how humans express different types of cell content. The first two rules demonstrate how numbers exist in tables. Usually, floating numbers only show quantities, such as average numbers of people and average scores of classes, which are cardinals. In addition, cardinals always contain some numeric characters. The third rule shows a person’s name usually is not a single token with numeric values. We note that the above rules are only a subset of rules used in the system. 12 Figure 2.2: The top-down block generation approach. T is the original table, B1, B2, B3, and B4 are non-overlapping blocks in the table (left) and leaf nodes in the decision tree (right). 2.6 Functional Block Detection The block detector takes the cell data types and cell embeddings as inputs to identify blocks and assign each block a functional label. This component operates in three steps: generating candidate blocks, enforcing probabilistic constraints, and coalescing blocks. Specifically, we introduce two different algorithms for generating candidate blocks: the first one is a top-down approach using Markov Chain Monte Carlo (MCMC), and the second one is based on agglomerative clustering, which leverages cell embeddings to generate blocks in a bottom-up manner. 2.6.1 Generating Candidate Blocks with MCMC We introduce a top-down approach, unlike the region-based approach [68] that groups adjacent cells into rectangular regions. It starts from a whole table and recursively splits the table into smaller blocks. This idea is inspired by the Bayesian CART model, which constructs a decision tree by recursively partitioning a space into subsets [27]. Figure 2.2 demonstrates such a decision tree. Instead of constructing the tree greedily, Chipman, George, and McCulloch [27] introduced a Markov Chain Monte-Carlo (MCMC) approach to finding the tree. Formally, we start from the table T, at each step i, for each active node Bij , we choose to either split this node into two children nodes or stop at this 13 node with the splitting probability psplit. If a node is chosen to be split, we select the row/column to split following the rule distribution prule; otherwise, we stop splitting this node, which becomes a leaf node. For example, in Figure 2.2, T is split into B1 and another node. Bi is not split, so it becomes a leaf node while the other node is split into two extra nodes. Each leaf node becomes a candidate block in our problem. Following this process, the system generates N such trees and randomly selects one tree, with the weight function Went, to finalize candidate blocks. Algorithm 1 shows the details. We call the function SampleATree to generate candidate blocks. Similar to the Bayesian CART model, we set the splitting probability to be psplit(d) = 1 (1 + d) β where β is a hyperparameter and d is the depth of the node. A larger β indicates a smaller tree. We make use of the cell data types provided by the cell classifier to decide the rule distribution. At node Bi , suppose it can be split into two children nodes Bi1 and Bi2, the data type distributions of Bi1 and Bi2 are Di1 = d 1 1 , d1 2 , · · · , d1 k and Di2 = d 2 1 , d2 2 , · · · , d2 k where k is the number of data types. We set the rule distribution to be prule(Bi1, Bi2) = λ · e − λ |Di1−Di2| 2 2 where λ is a hyperparameter. This is designed based on the assumption that a split which makes the distributions more diverse is more likely to be chosen. We apply the exponential family λ · e −λ to make the difference between the distributions more significant. We leverage entropy to determine the weight Went of a tree. Went(B) = λ · e −λ· P b:B − P |b| b ′ :B |b ′ | P t p t b log p t b 14 Algorithm 1: Candidate Block Generation 1 Function Split(block): 2 queue ←− {(block, 0)}; blocks ←− {}; 3 while queue ̸= ∅ do 4 (⟨t, b, l, r⟩, d) ←− queue.get() // t, b, l, and r are indices. 5 Randomly select a number v within [0, 1] 6 if v < psplit(d) then 7 Randomly split ⟨t, b, l, r⟩ into B1 and B2 using prule. 8 queue.push((B1, d + 1)) 9 queue.push((B2, d + 1)) 10 else 11 blocks.add(⟨t, b, l, r⟩) 12 return blocks; 13 Function GenerateATree(T, types): 14 row_blocks = Split(T); // Row-wise 15 blocks ←− ∅; 16 foreach B in row_blocks do 17 blocks.union(Split(B)) // Column-wise 18 return blocks; 19 Function SampleATree(T, types, N): 20 trees ←− ∅ 21 foreach 1 ≤ i ≤ N do 22 trees.add(GenerateATree(T, types)) 23 return Sample a tree from trees using Went. where B is a set of blocks , b is a block, |b| is the size (area) of b, and p t b is the ratio of the cells with the data type t in b. The λ here is the same as the λ in prule. 2.6.2 Generating Candidate Blocks with Agglomerative Clustering The main drawbacks of the MCMC method are two-fold. First, it only leverages data types, whereas global spatial constraints are ignored. Second, the two steps (row-wise and column-wise splitting) are inflexible 15 for complex tables and may further propagate errors. We introduce another agglomerative clusteringbased method for generating candidate blocks to address the abovementioned issues. Agglomerative clustering [86, 131, 90] is a class of algorithms used for cluster analysis in data mining. The algorithms recursively merge small clusters into larger clusters based on a dissimilarity measure. In functional block detection, each block is treated as a cluster. The algorithm treats individual cells as blocks and recursively merges them into larger blocks. An agglomerative clustering method contains three components: (1) a cluster representation method capturing essential features, (2) a measure of dissimilarity between clusters based on the representation, and (3) a termination condition to stop clustering. We describe the details of this method as follows. Dissimilarity Measure An essential part of dissimilarity computation is a function that determines the closeness between two blocks based on their feature representations. Experiments by Gol, Pujara, and Szekely [47] have shown that cell embeddings capture rich information about table cells and can be used for cell functional role classification. Based on the ability of cell embeddings to capture available roles, we adopt cell embeddings as the foundation for our block representation. In cell embeddings, each cell is represented as a continuous-valued vector. In functional block detection, we design the dissimilarity function combining four types of information: local information between blocks, margin dissimilarity between adjacent rows or columns, coherence of the new block, and domain knowledge about data types. With the dissimilarity function, at each step, the algorithm proceeds by merging a pair of adjacent blocks with a minimum dissimilarity. 1) Block Dissimilarity To merge two blocks, we first consider directly measuring the block-level dissimilarity. We leverage a block embedding, V (B), to represent each block in a vector space. Given a block is composed of a region of cells, where each cell can be defined as a vector, we represent the block 16 Figure 2.3: The adjacent cell pairs with the same data types from an example table. using an aggregation of vectors of its constituent cells. In this work, we use average aggregation. Let CE be the function that maps cells into a vector space; a block B = ⟨t, l, b, r⟩ can be represented as V (B) = P b i=t Pr j=l CE(Ti,j ) |B| where Ti,j is the cell at the i th row and j th column of T. Using this mapping, blocks B1 and B2 can be represented as VB1 = V (B1) and VB2 = V (B2) respectively. The dissimilarity between B1 and B2 is DB(B1, B2) = Dist(VB1 , VB2 ) where Dist is a distance metric (e.g., Euclidean or Cosine distances). 2) Margin Dissimilarity Measuring dissimilarities between blocks only takes the local information of the overall block into account. However, higher-order information, such as adjacent rows or columns, are also important. For example, the table in Figure 2.3 presents the R&D employment information. Since “All industries” does not have an NAICS code, the cell is empty. In addition, since the data for 2020 is unavailable, all cells in that column are empty. If only considering local information, the two empty cells on the third row are likely to be merged because they have the same textual information and are spatially close. However, they have different functional roles. This issue could potentially be solved if a higherorder measure is also incorporated into the dissimilarity function. One way to measure the higher-order dissimilarity is to consider dissimilarity between cells on adjacent rows or columns, which we refer to as margin dissimilarity. A margin dissimilarity can be computed using pairwise dissimilarity between vectors 17 of adjacent cells. For example, if B1 and B2 are row-wise adjacent where the bottom row of B1 is row i and the top row of B2 is row i + 1, the margin dissimilarity between them is DM(B1, B2) = Pn k=1 Dist(CE(Ti,k), CE(Ti+1,k)) n Dist is the same metric used for computing local dissimilarities. Likewise, if B1 and B2 are column-wise adjacent blocks, DM(B1, B2) = Pm k=1 Dist(CE(Tk,i), CE(Tk,i+1)) m . 3) Coherence In addition to minimizing the block and margin dissimilarities, another way to measure dissimilarity is the coherence. Coherence captures the homogeneity of cell content in the block. A highly coherent block will have all cell vectors close to each other in the embedding space, and a block of functionally equivalent cells will have higher coherence than a random block. One way to measure the coherence is to consider the maximum distance between cell vectors and the mean vector of all cells. Formally, given a block B = ⟨t, l, b, r⟩, the coherence of B is: CB(B) = − b max i=t r max j=l Dist(CE(Ti,j ), V (B)) For every block pair, the coherence is measured for the new block produced from merging the two blocks. For example, given adjacent blocks B1 and B2, suppose merging B1 and B2 results in a new block B′ , CB(B′ ) will be used for scoring merging B1 and B2. 4) Domain Knowledge about Data Types Besides the measures above, another helpful piece of information for assessing block dissimilarity is data types. Different data types can be a valuable signal for separating blocks. For example, cells with the same data type are more likely to be in the same block than 18 cells with different data types. In addition, other pairs of data types may have differing impedance based on the context. For example, considering fine-grained data types, a pair of cells with types <cardinal, location> is more likely to be in different blocks than those with types <organization, location>. To capture the difference between data types, we develop a learning process that estimates the confidence of separation for pairs of data types. Given a table where each cell has an associated data type, each pair of adjacent cells can be indexed based on their data types. For each pair of data types, we create two collections that separate the adjacent cell pairs into positive and negative sets, where pairs in the positive set appear in the same block and pairs in the negative set appear in different blocks. Figure 2.3 shows an example. For the pair of data types <string, string>, the pair of cells <Industry, NAICS Code> will be put into the positive set, and the pair <Industry, All Industries> will be put into the negative set. Each set is associated with a distribution over the dissimilarities of the cell pairs contained in the set, and we estimate the confidence of separation based on the difference between the two distributions. Formally, for each pair of data types ⟨t, t′ ⟩, the two collections with pairs of cells in the same blocks and different blocks are denoted by S+ and S−, respectively. The confidence of separation is defined as wt,t′ = max(D− − D+, 0) where D− = Avg(c,c′)∈S− Dist(CE(c), CE(c ′ )). D+ is computed in the same way. This equation is designed based on the assumption the larger the difference between the two collections is, the more separable adjacent cells with this pair of data types are. We use this confidence of separation as a weighting factor for assessing the dissimilarity between blocks. We then normalize all weighting factors such that wt,t′ = 1 + wt,t′ max t∗,t∗′ wt∗,t∗′ (If a pair of data types is not seen in the training set, its weighting factor is 1). Accordingly, for adjacent blocks B1 and B2, the average weighting factor is wprior(B1, B2) = Avg(c,c′)∈SB1,B2 wtc,tc ′ 19 (a) Column-wise (b) Row-wise (c) Extending (d) Shrinking Figure 2.4: The two solutions for merging unaligned blocks, i.e., extending and shrinking. where SB1,B2 = {(c, c′ )|c ∈ B1&c ′ ∈ B2 & adjacent(c, c′ )}, tc and tc ′ represent the data types of c and c ′ , respectively. Overall Dissimilarity Function We combine the block dissimilarity (DB), the margin dissimilarity (DM), block coherence (CB), and a domain-specific weight (wprior) into an overall dissimilarity function. Given two adjacent blocks B1 and B2, and a new block B′ generated after merging B1 and B2, their dissimilarity is Doverall(B1, B2) = wprior(B1, B2) · (DB(B1, B2) + DM(B1, B2) − CB(B ′ )) (2.1) Block Merging The next step is to design merging strategies based on the aforementioned task-specific dissimilarity measure. In our problem setting, one goal is to identify non-overlapping rectangular blocks. This indicates that block merging strategies should always satisfy the rectangular constraint. However, merging two adjacent blocks does not always result in a rectangular block. For example, Figure 2.4a and Figure 2.4b present two scenarios. The orange and blue blocks are not perfectly aligned such that merging them would lead to a non-rectangular block. Since such pairs can be prevalent, avoiding them would lead to early stopping and produce insufficient merging results. Accordingly, we introduce two new merging strategies to address this issue. 1) Max-Merging Given B1 = ⟨t1, l1, b1, r1⟩ and B2 = ⟨t2, l2, b2, r2⟩, the two blocks are extended into two smallest-possible properly aligned blocks. Specifically, if B1 and B2 are row-wise adjacent, let 20 Figure 2.5: The potential infinite loop issue with the min-merging strategy. lmin = min(l1, l2) and rmax = max(r1, r2), the new blocks are B′ 1 = ⟨t1, lmin, b1, rmax⟩ and B′ 2 = ⟨t2, lmin, b2, rmax⟩. Similarly, if B1 and B2 are column-wise adjacent, B′ 1 = ⟨tmin, l1, bmax, r1⟩ and B′ 2 = ⟨tmin, l2, bmax, r2⟩ where tmin = min(t1, t2) and bmax = (b1, b2). Figure 2.4c shows new blocks extended from the blocks in Figure 2.4a. Merging the new blocks leads to a valid rectangular block. 2) Min-Merging In addition to Max-Merging, we introduce a second strategy: Min-Merging to handle situations when Max-Merging will split existing blocks. Given B1 and B2, this strategy shrinks them into two largest-possible properly aligned blocks. Specifically, if B1 and B2 are row-wise adjacent, B′ 1 = ⟨t1, lmax, b1, rmin⟩ and B′ 2 = ⟨t2, lmax, b2, rmin⟩ where lmax = max(l1, l2) and rmin = min(r1, r2). If they are column-wise adjacent, B′ 1 = ⟨tmax, l1, bmin, r1⟩ and B′ 2 = ⟨tmax, l2, bmin, r2⟩ such that tmax = max(t1, t2) and bmin = min(b1, b2). Figure 2.4d shows the resulting blocks from shrinking the two blocks in Figure 2.4b. A potential shortcoming of this strategy is that it may lead to infinite loops. Figure 2.5 presents an example. Merging blocks A and B leads to A′ and B′ , and merging A′ and B′ brings A′ ′ and B′ ′ which are exactly same as A and B. To avoid infinite loops, the algorithm keeps track of block pairs merged in previous steps. We refer to such block pairs as invalid block pairs. The algorithm will ignore invalid block pairs in future steps. We note that directly applying min-merging may sometimes increase the remaining blocks. To make the algorithm as efficient as possible, we also treat such block pairs as invalid block pairs. Sampling-based Threshold Selection Another critical component in agglomerative clustering algorithms is the stopping criterion. In our problem, the stopping criterion is the dissimilarity threshold such 21 (a) (b) Figure 2.6: The distance distributions of sampled and ground-truth block pairs. that two blocks will not be merged if their overall dissimilarity (section 2.6.2) is more significant than a threshold. However, since tables can be very diverse, a fixed dissimilarity threshold may not be suitable for all tables. For example, Figure 2.6a shows distance distributions of sampled block pairs from two different tables. The ranges of the two distributions illustrate that other tables may have different thresholds. To address this problem, we introduce a sampling-based threshold selection method that provides a personalized threshold for each table. The details of this process are shown in Algorithm 2. Given a table, the algorithm first randomly samples a set of synthetic block pairs such that they are either row-wise or column-wise adjacent. With the dissimilarity function, each block pair will have an associated dissimilarity. We set the threshold for this specific table to be the value at the p% (0 ≤ p ≤ 1) of the frequency distribution where 22 Algorithm 2: Thresholds Selection 1 Function select_thresholds(table, p, k = 5000): 2 pairs ←− sample k adjacent block pairs // Keep track of all dissimilarity values 3 dists ←− [] 4 for b1, b2 ∈ pairs do 5 dists.append(Doverall(b1, b2)) 6 thre = the distance at the p% of the sorted dists 7 return thre p is a hyper-parameter. Figure 2.6b shows two distance distributions of block pairs sampled from different adjacent ground-truth blocks and those sampled from the same ground-truth blocks. Block pairs from other ground-truth blocks generally have more distances than those from the same ground-truth blocks. In addition, the ranges of the distribution of table B in Figure 2.6a and that in Figure 2.6b are similar, which indicates that the algorithm can potentially identify reasonable thresholds. The Overall Algorithm Algorithm 3 shows the overall algorithm. The function clustering takes a table and a parameter p as the input and produces candidate blocks. At the beginning of the algorithm, every cell is treated as an active block. At each iteration, an adjacent block pair with the smallest dissimilarity ( Doverall in Section 2.6.2) is selected to merge. To merge the blocks, the algorithm checks if they can be merged without splitting other existing blocks (valid_max_merge in Algorithm 3). If so, the algorithm chooses to merge them using the “max-merging”; otherwise “min-merging” strategy. After merging them, new block pairs become active, and some existing ones may become invalid. For example, some blocks become inactive because they are subsets of the new blocks. If the smallest dissimilarity exceeds the threshold, the algorithm stops, and all currently active blocks are returned as candidate blocks. 2.6.3 Enforcing Probabilistic Constraints In the second step, we assign a functional label to each previously identified candidate block using a PSL model. Like the cell classifier, we perform this step with two components: a statistical learning model that 23 Algorithm 3: Agglomerative Clustering 1 Function clustering(table, p): // table is a n × m matrix // Keep track of all active blocks 2 active ←− {} 3 invalid ←− {} 4 thre ←− select_thresholds(table, p) 5 for i < n do 6 for j < m do // Four indices: top, left, bottom, right 7 active.add((i, j, i, j)) 8 while |active| > 1 do // Select the block pair with minimum dissimilarity from all valid pairs 9 b ∗ 1 , b∗ 2 = argminb1,b2∈active & (b1,b2)̸∈invalid Doverall(b1, b2) 10 if Doverall(b1, b2) ≥ thre then 11 break // Check if max-merge without breaking blocks 12 if valid_max_merge(b ∗ 1 , b∗ 2 ) then // Extend b ∗ 1 and b ∗ 2 and update active 13 max_merge(b ∗ 1 , b∗ 2 , active) 14 else // Shrink b ∗ 1 and b ∗ 2 and update active 15 min_merge(b ∗ 1 , b∗ 2 , active) // Make the pair invalid 16 invalid.add((b ∗ 1 , b∗ 2 )) // Return all blocks that are still valid 17 return active; takes a cell embedding as an input to predict a functional label for each cell and a PSL model that enforces probabilistic constraints. A set of example rules are listed below: CELabel(B, L) ⇒ BT(B, L) FirstRow(B) ⇒ BT(B, ”header”) BT(B, L) indicates the possibility that block B is assigned label L. Like the cell classifier, CELabel shows how the block detector uses cell embeddings. The statistical learning model assigns each cell a label. Since a block may have more than one cell, CELabel(B, L) reveals the possibility that block B 24 is assigned label L. This corresponds to the ratio of the number of cells assigned label L over the total number of cells in B. The second rule expresses blocks on the first row usually are a Header block. In addition to the simple rules listed above, we apply several conjunctive rules that fuse the inherent positional constraints within the table layouts. For example: SameRow(B1, B2) ∧ BT(B1, ”header”) ⇒ BT(B2, ”header”) Abv(B1, B2) ∧ Abv(B2, B3) ∧ BT(B1, C) ∧ BT(B3, C) ⇒ BT(B2, C) The first rule indicates that if a block is a Header block, blocks on the same row as this block are also Header blocks. The last rule considers neighboring blocks. If the neighbor above and the below have the same label, it should also be assigned this label. These conjunctive constraints exploit the power of collective classification that probabilistic models perform. 2.6.4 Block Coalescing After assigning labels to the candidate blocks, we apply a post-processing step to merge small blocks into large blocks. In this step, we first join neighboring blocks if they have the same top row, bottom row, and labels. Similarly, we then merge neighboring blocks if they have the same left column, right column, and labels. The block detector finally passes the merged blocks to the layout predictor. We designed this step to resolve the issue of over-partitioning and produce better blocks. 25 2.7 Layout Prediction The layout predictor is the last component of our system. It predicts a relationship between each pair of blocks identified by the block detector. We model the task as a PSL problem utilizing relative positional relationships between the blocks. A set of example rules are listed below: Adj(B1, B2) ∧ BT(B1, ”data”) ∧ BT(B2, ”data”) ⇒ Rel(B1, B2, ”empty”) Hrz(B1, B2) ∧ BT(B1, ”attr”) ∧ BT(B2, ”data”) ⇒ Rel(B1, B2, ”attribute”) These rules illustrate our hypotheses about positional relationships between blocks. If two data blocks are neighbors, they might not have any special relationship. If two blocks are horizontally aligned, one is an attribute block, and the other is a data block, the attribute block might reveal attributes of the data within the data block. 2.8 Datasets for Table Structural Analysis 2.8.1 Prior Datasets There are three main datasets DeEx, CIUS and SAUS designed for evaluating cell functional type classification. The DeEx dataset was collected in the DeExcelerator project [41] and contains 457 annotated sheets. The CIUS dataset was originally from the Crime In the US (CIUS) database and contains 268 sheets. The SAUS dataset was downloaded from the U.S. Census Burea [24] and contains 210 sheets. Both the CIUS and SAUS datasets were annotated by [47]. We evaluate our approach on all these three datasets for the task of block detection. These three datasets are cell-level annotations based on functional roles. 26 2.8.2 A New Dataset from Data.gov The existing datasets, designed for evaluating cell functional type classification, have relatively narrow domains and focus only on Excel files. To evaluate the three aforementioned table understanding tasks, we introduce a new dataset DG. We downloaded 1837 files from the U.S. Open Data website (data.gov). These files are from different topic domains such as agriculture, climate, ocean, and ecosystem, and are in different formats (i.e. CSV and Excel). We sampled 431 tables from these files and annotated them for the three table understanding tasks. To show inter-annotator agreements, we ask two annotators to independently annotate 25 tables and evaluate the Cohen’s kappa coefficients [87] for three tasks. The results are 0.937 for cell classification, 0.960 for block detection, and 0.936 for layout prediction, which indicate the good reliability of the annotations. Specifically, for block detection, we align the blocks from annotator A with those from annotator B and then compare the labels. 2.9 Experimental Evaluation In this section, we first introduce four datasets (including DG) and experiment settings. We then show the experimental results of cell classification, block detection, and layout prediction. Specifically, for block detection, we show several metrics designed for evaluating block quality. After that, we present several experiments evaluating the performance of different methods on boundary detection and functional role classification. We finally provide further analysis of the proposed method. All experiments are run on an Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz. We run a 5-fold cross-validation evaluation in all experiments where the train and develop sets have a ratio of 9:1. Datasets We use four datasets: CIUS, SAUS, DeEx, and DG. CIUS was originally collected from the Crime in the US [47] † containing 269 annotated sheets. SAUS was downloaded from the U.S. Census † https://ucr.fbi.gov/crime-in-the-u.s 27 Table 2.1: The data type classification results on DG. Avg is the macro F1 score (%) ±, the standard deviation. Emp Crd Str Dat Loc Org Ord Nom Per Avg CRF 81.9 82.5 42.4 56.2 34.4 16.8 0.0 36.0 1.3 39.1±1.8 MLP 84.5 85.6 69.1 59.3 54.9 46.8 0.0 52.0 1.2 50.4±5.4 RF 85.0 84.4 73.2 61.4 65.2 55.5 0.3 53.4 39.3 57.5±4.7 PSL (MLP) 96.5 88.3 70.2 77.8 55.8 43.3 0.3 52.4 1.0 54.0±3.1 PSL (RF) 96.8 87.8 74.3 78.4 66.1 52.5 0.2 53.0 31.7 60.1±3.2 Burea ‡ . It contains 223 annotated sheets. DeEx was created collected in the DeExcelerator project [41] § which has 444 annotated sheets. These three datasets provide cell-level functional role annotations. 2.9.1 Cell Classification As described in Section 2.3.1, we evaluate our cell classifier on the DG dataset. Each cell is classified into 1 of the 9 data types: empty (Emp), cardinal (Card), Nominal (Nom), ordinal (Ord), datetime (Date), location (Loc), organization (Org), person (Per), and other string (Str). We compare our system with the following baselines: 1. Random Forest (RF): We use the RandomForest classifier in the scikit-learn library [15]. It takes cell embeddings as input to predict a cell data type. We select n_estimator among [100, 300], max_depth among [5, 50, None], min_sample_split among [2, 10] and min_samples_leaf among [1, 10]. We use the bootstrap mode with balanced sub-sampling. 2. Conditional Random Field (CRF): CRFs are a type of probabilistic graphical model that takes the context (neighboring cells in tabular data) into consideration. This experiment uses a feature set introduced in [23] to make predictions. We choose 2-dimensional CRF to represent row-wise and column-wise neighborhood interactions. We use GridCRF class from the pystruct library [89]. We set the max_iter to be 500, tolerance to be 0.01, and select c_range among [0.01, 0.1, 1.0]. ‡ http://dbgroup.eecs.umich.edu/project/sheets/datasets.htm § https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/ 28 3. Multi-layer Perceptron (MLP) We use the pytorch library [96] to create a two-layer neural network with the Rectified Linear activation function (ReLU). It also takes cell embeddings as input. We set batch size to be 32, learning rate to be 0.0001, and epoch to be 50. We use cross-entropy loss. Main Results Table 2.1 shows the results of this experiment. For both RF and MLP, the PSL model improves over their results. The results demonstrate that the logical rules can provide useful high-level constraints between data types and cell-level features. Compared to CRF, it is more flexible to enforce explicit constraints in PSL. 2.9.2 Block Detection The proposed method leverages a cell embedding model to provide cell vector representations and cell data types. We use the cell embedding model introduced in [47] as the default model, and we run the cell classifier to generate cell data types. We use the most fine-grained data types: cardinal, nominal, ordinal, person, location, organization, other string, datetime, and empty. For the agglomerative clustering-based block generation method, we use the supervised Neighborhood Components Analysis (NCA) algorithm implemented in the metric-learn library¶ [34] to learn task-specific distance metric. For SAUS, CIUS, and DeEx datasets, we select the parameter p among [0.2, 0.3, 0.4, 0.5] according to their cell-level macro F1 scores. For the DG dataset, we choose the parameter p among [0.2, 0.4, 0.6, 0.8] based on the block-level macro F1 scores. We compare the proposed method with the following baseline methods: 1. Conditional Random Field (CRF) was originally used in [23]. The implementation uses the pystruct library with max_iter set to be 1000, tol set to be 0.01, and C_range selected from [0.1, 0.3, 0.5, 0.7, 1.0] and uses the stylistic and formatting features, which are automatically extracted. 2. Random Forest (RF) was used as a classifier to evaluate the cell embedding model in previous papers and also as a base model in the block detector in [118]. We use the implementation from ¶ http://contrib.scikit-learn.org/metric-learn/index.html 29 scikit-learn ∥ . The parameter n_estimator is selected among [100, 300], max_depth is selected among [5, 50, None], and min_sample_split is selected among [1, 10]. 3. Recurrent Neural Network (RNN) is another classifier introduced in [47]. It uses LSTM blocks to encode neighborhood information. We set the epoch to be 50 and the learning rate to be 0.0001. For the three baseline methods, we use the region-based approach from [68] to create blocks. It first merges adjacent cells on the same row to build row intervals and then merges adjacent row intervals into rectangular blocks. Evaluation Metrics To evaluate the performance of different methods of functional block detection, we use two main types of metrics. 2) Error-of-Boundary (EoB) EoB was originally introduced in [38] for evaluating the table detection models. It measures how precisely a predicted rectangular region is aligned with a ground-truth rectangular region. Given a ground-truth block B = ⟨t, l, b, r⟩ and a predicted block B′ = ⟨t ′ , l′ , b′ , r′ ⟩, the EoB between them is EoB(B, B′ ) = max |t − t ′ |, |b − b ′ |, |l − l ′ |, |r − r ′ | . To evaluate the EoB over all blocks, we use another variant of EoB that measures the table-level EoB: EoBt = X 1≤i≤N,1≤j≤M 1 |Bij ∩ B′ ij | EoB(Bij , B′ ij ), where Bij is the ground-truth block that the cell at i th row and j th column belongs to, and B′ ij is the predicted block that this cell belongs to. A smaller EoBtable indicates a better performance. In addition to ∥ https://scikit-learn.org 3 such table-level EoB, to avoid the Effect of the number of blocks in a table, we also evaluate the pairwise EoB. EoBp = Avg B,B′ ,|B∩B′ |≥1 EoB(B, B′ ). 2) Precision and Recall Although the two variants of EoB are useful in evaluating the block boundary quality, it is unbounded and not designed for evaluating the classification results. We borrow metrics from the multi-class object detection task in computer vision [43] to simultaneously measure detection and classification challenges. For each table, given a set of ground-truth blocks B = {B1, B2, · · · } and a set of predicted blocks B ′ = {B′ 1 , B′ 2 , · · · }, each predicted block B′ is assigned to a ground-truth block B according to the overlap ratio IoU: IoU(B, B′ ) = area(B) ∩ area(B′ ) area(B) ∪ area(B′) . There are the following situations: 1) If multiple predicted blocks are assigned to the same ground-truth block, the one with the same label and the highest IoU is a true positive, and the remaining blocks are false positives. 2) If a predicted block has the same label as the ground-truth block but the IoU < 0.5, it is a false positive. 3) If a ground-truth block has not been correctly identified, it is a false negative. 4) If a predicted block cannot be matched to any ground-truth block properly, it is considered a false positive. Given the evaluation criterion, for each class, we can compute a Precision, a Recall and the corresponding F1 measure. To consider the performance of whole predictions, we can evaluate the macro-average F1 over all classes accordingly. Main Results We first evaluate the proposed methods (the MCMC-based process is called MCMC, and the agglomerative-clustering-based way is AC) using the metrics mentioned above on the DG dataset. Table 2.2 presents the results of different methods. Note that AC uses the same labeling process as the MCMC 31 Table 2.2: The block-level evaluation results in terms of Precision (Pr), Recall (Re), and F-score (F1). Method EoBt EoBp Metadata Data Header Attribute Average F1 F1 F1 F1 Pr Re F1 CRF 5357 172 41.0 17.6 82.5 7.8 42.9 38.5 37.2 RF 63565 321 59.6 0.5 25.6 0.5 17.1 43.2 21.6 RNN 20999 289 24.9 5.1 54.3 3.4 21.6 43.8 21.9 MCMC(RF) 3904 161 42.9 30.0 70.0 14.6 36.8 52.1 39.4 MCMC(RNN) 2146 151 36.2 38.9 78.3 23.1 41.5 55.9 44.1 AC(RF) 417 60 42.3 58.8 77.7 36.9 52.9 56.6 54.0 AC(RNN) 245 39 37.1 70.1 77.4 41.0 60.1 56.2 56.4 Table 2.3: The cell-level evaluations results. Method CIUS SAUS Metadata Data Header Attribute Macro F1 Metadata Data Header Attribute Macro F1 CRF 96.5 67.6 94.9 36.8 73.9±8.9 80.7 82.2 95.7 38.2 74.2±5.8 RNN 99.5 99.2 98.2 89.2 96.5±4.0 91.9 97.0 77.6 79.5 86.5±3.7 RF 95.9 99.7 88.9 97.0 95.4±0.6 79.1 98.6 78.8 91.1 86.9±4.0 MCMC(RNN) 94.3 99.2 97.0 89.2 94.9±4.0 85.6 97.7 85.3 80.8 87.4±2.9 MCMC(RF) 93.6 99.7 96.0 97.6 96.7±1.1 80.6 99.0 85.4 92.8 89.4±2.5 AC(RNN) 96.4 99.2 98.7 89.2 95.9±3.9 93.4 97.5 84.5 80.2 88.9±3.2 AC(RF) 94.6 99.8 96.8 97.8 97.2±0.8 83.3 99.0 87.3 92.5 90.5±2.1 Method DeEx DG Metadata Data Header Attribute Macro F1 Metadata Data Header Attribute Macro F1 CRF 35.6 55.7 48.0 1.7 35.3±6.9 42.3 53.0 95.2 32.9 55.8±7.0 RNN 33.8 96.1 47.2 39.5 54.2±5.9 25.2 96.5 85.1 80.0 71.7±3.1 RF 53.4 98.4 51.0 26.5 57.3±2.0 71.9 96.0 80.6 78.0 81.6±2.4 MCMC(RNN) 38.5 97.2 53.5 44.9 58.5±8.0 62.0 96.3 89.1 78.4 81.5±4.0 MCMC(RF) 65.4 98.8 60.5 26.0 62.7±3.9 73.8 95.8 91.7 76.0 84.3±4.6 AC(RNN) 49.7 98.4 57.0 18.3 55.8±5.1 68.0 96.2 90.7 77.0 83.0±2.4 AC(RF) 64.3 98.7 59.7 27.9 62.7±3.0 81.0 95.9 93.1 76.0 86.5±4.6 method. The main difference between the MCMC and AC methods is how they identify blocks. Regarding two variants of EoB that do not consider functional role labels, the proposed AC method significantly outperforms the MCMC method, indicating AC has better alignments to the ground-truth blocks. Regarding macro-average Precision, Recall, and F1 scores, the AC method with both RF and RNN classifiers significantly improves over the MCMC method. Only F1 scores on Metadata of AC(RF) and Header of AC(RNN) perform worse than the corresponding MCMC method among the four block function classes. In general, AC could detect better functional blocks compared to the previous methods with an improvement of over absolute 12.5% F1 score on average. 32 Table 2.4: The ablation study on different components in the agglomerative clustering-based approach. Method EoBt EoBp Pr Re F1 F1c AC Full 417 60 52.9 56.6 54.0 86.5 - w/o Margin 529 66 50.2 53.5 50.6 84.7 - w/o Coherence 3212 191 31.1 54.2 37.7 85.1 - w/o Domain 519 59 51.4 49.1 48.7 81.3 Cell-level Functional Type Experiment In addition to the block-level evaluation, we also conducted an auxiliary experiment for evaluating cell-level functional roles. This is based on the assumption that better blocks could also assist the classification of cells within the blocks. Given a labeled block, all cells within this block are also automatically assigned the same label of functional role. In this experiment, we use all four datasets: CIUS, SAUS, DeEx, and DG. For CIUS, SAUS, and DG datasets, AC (RF) performs the best, and for both RF and RNN classifiers, AC improves over the corresponding MCMC methods by a relatively large margin. The only exception is the DeEx dataset; AC (RF) shows similar performance, and AC (RNN) performs worse than MCMC (RNN). In the proposed AC method, we learn the domain knowledge about data types, which uses data types predicted from a cell classifier and ground-truth blocks. We automatically create ground-truth blocks for SAUS, CIUS, and DeEx by merging same-label adjacent cells. The main possible reason for the worse performance of the DeEx dataset is the cell data type errors propagated from the cell classifier and the mistakes of the automatically created ground-truth blocks. For example, some cells should be in the same ground-truth block but are separated by empty cells that were not annotated. Ablation Studies In this section, we investigate different components of the AC method. 1) Effect of Dissimilarity Function Components As is introduced in Section 2.6.2, in the proposed method, besides the block dissimilarity, we also leverage three types of information: margin dissimilarity, coherence of the merged block, and the domain knowledge of data types. We remove these three components and provide the results in table 2.4. In general, after removing these components, the performance 33 (a) Effect of the sample size (b) Effect of the parameter p. Figure 2.7: The ablation study on the effect of the sample size and the probability threshold on the DG dataset. on all metrics becomes worse. Removing the margin dissimilarity and data type knowledge, although the performance on EoBp does not differ, cell-level and block-level F1 scores decrease significantly. Removing coherence leads to much worse EoB which indicates that coherence can make dissimilarities between block pairs more diverse such that it becomes easier to find a better threshold. 2) Effect of the number of samples in the AC approach We introduced a sampling-based method to sample synthetic block pairs and create a synthetic distance distribution to determine a personalized threshold. In this experiment, we show the Effect of the sample size. We select the dissimilarity value at 80% of the distribution (i.e., the parameter p = 0.8). We choose values ranging from 10 to 5000 and show the cell-level and block-level F1 scores. Figure 2.7a presents the results. As the sample size increases, F1 34 scores on both levels increase and stabilize. This is reasonable because when the sample size is small, the synthetic block pairs are not representative enough to determine a good threshold. 3) Effect of parameter p With a synthetic dissimilarity distribution, we still need a parameter to determine the threshold. In our method, we use a parameter p such that the dissimilarity value at the p% of the dissimilarity distribution is selected as the threshold. In this experiment, we investigate the value of p. We run experiments with p ranging from 0.1 to 1.0 and show the results on two F1 scores in Figure 2.7b. Before 0.8, as p increases, the block-level F1 score increases, and the cell-level F1 score is relatively stable, which is reasonable because when p is small, the algorithm is more accessible to meet the stopping criterion and the number of blocks will be significant and the block-level F1 score will increase. As cells within the same predicted block do have the same role, the cell-level F1 score remains stable. When p exceeds 0.8, the blocks become over-merged and both cell-level and block-level F1 scores decrease. 4) Effect of Embedding Since the AC method highly depends on the cell embedding, in this experiment, we remove the influence of the cell embedding and show how the method will perform in the ideal case. We use ground-truth blocks to generate synthetic dissimilarity instead of the default dissimilarity function. The synthetic distance generation is completed using the following steps. 1) We assume the ideal dissimilarity distribution follows a normal distribution. 2) For each table, we sample synthetic block pairs using ground-truth blocks and constitute two distributions for block pairs from the same ground-truth block and block pairs from different ground-truth blocks. We can then compute the mean and the variance of the two distributions. 3) We learn the average variances σ 2 s and σ 2 d for distributions of the same ground-truth block and different ground-truth block, and the mean µd and µs of them using a training set. We then construct the two distributions Ds = (0, σs) and Dd = (µd − µs, σd). 4) During inference, when two blocks are in the same ground-truth block, we sample a distance from Ds, otherwise, from Dd. With ideal dissimilarity distributions, the agglomerative clustering algorithm could identify nearly perfect block boundaries (with EoBt = 6, EoBp = 1, block P r = 71.1%, Re = 66.3% and F1 = 68.0%) and better 35 (a) RF (b) MCMC (c) AC Figure 2.8: The comparison between different block detection results on an example table. classification results (macro F1=92.5%). The not-good-enough F1 scores are attributed to the limitation of the current classifiers. This is also why all Precision, Recall, and F1 scores in Table 2.2 are generally relatively low. Case Study For detailed analysis, we show the output blocks of the MCMC, the AC, and a baseline method on an example table. Figure 2.8 presents the results. The result of AC is the same as gold labels. Since MCMC only depends on data types and some numeric values on the second column were not correctly classified, the second column is incorrectly separated from the rest of the data block. While the RF 36 Table 2.5: The layout prediction results on the DG dataset. EP HO AO GA SC Avg RF 81.7 1.1 2.1 22.7 0.0 21.5±1.2 CRF 88.5 33.7 32.2 40.0 0.0 38.9±3.1 PSL 89.6 70.3 32.8 43.0 25.6 52.3±3.4 classifier predicts many cells in the second column as an “attribute,” the large block is eventually classified as an “attribute” block. Compared to MCMC, AC does not only depend on data types; small distances in the embedding space make the second column merge into the rest of the data block. This also indicates that the AC method is more stable and more error-tolerant. 2.9.3 Layout Prediction We use 15 logical rules in the system. We evaluate the performance of the layout predictor on the DG dataset. We use five aforementioned relationships: empty (EP), header of (HO), attribute of (AO), global attribute (GA) and supercategory of (SC). We compare the predictor with the following two baselines: 1. Random Forest (RF): For every two blocks, we use a few manually crafted features: their functional labels (predicted by the block detector) and several relationships between blocks ( below, above, left right, adjacent, and overlap). We use the RandomForest from the scikit-learn library. We select n_- estimators among [100, 300], max_depth among [5, 50, None], min_samples_split among [2, 10] and min_samples_leaf among [1, 10]. 2. Conditional Random Field (CRF) We construct a graph CRF for this task. If we treat each block as a node and the relationship between two blocks as an edge, the above features are associated with edges. We use the EdgeFeatureGraphCRF from the pystruct library. We set max_iter to be 50, tol to be 0.01, and select C_range among [0.01, 0.1]. 37 Main Results We use the blocks generated by the block detector to run different layout prediction methods. For each block b, we match it with the ground-truth block b ′ , which shares the most overlapping cells with b. All predicted relationships for b are added to b ′ and compared with the ground-truth relationships. Table 2.5 shows the results. The PSL model performs better than the compared models in most cases. For each table, the average running time of PSL is 0.4 seconds. 38 Chapter 3 Financial Table-based Question Answering with Case-based Reasoning 3.1 Introduction Financial statements, containing rich tabular and textual data, provide essential information for investors and analysts to understand a company or industry and make better financial decisions. For example, the growth rate of net revenue in previous years helps predict the net revenue for the upcoming year. Specific expenses offer insights into the company’s plans while comparing net income differences among companies in the same industry aids in understanding the industry’s dynamics and competitiveness. With the increasing number of companies operating in various fields, the volume and complexity of financial statements have grown significantly. As a result, manually reviewing and analyzing these statements has become more challenging and time-consuming for individuals and financial teams. There are two significant challenges in financial QA. Firstly, financial questions often require multistep numerical reasoning, selecting information from tabular and textual data and demanding precise annotations of mathematical programs for accurate answers. For example, to answer the growth rate of a company’s revenue, one needs to calculate the difference between the revenue figures of two consecutive years and then determine the ratio between the results. Secondly, annotating financial questions needs domain expertise, as financial topics and terminologies are domain-specific. Towards accurate financial QA, recent studies either 1) require a sufficient amount of training data of all types of financial questions, 39 Figure 3.1: The illustration of applying case-based reasoning techniques for answering financial questions. or 2) apply a significant large-scale model with a few examples. The prior one is time-consuming, whereas the latter one is resource-intensive. In this work, based on the challenge above, we investigate financial QA within the context of lowresource settings. The focus of our study is to enhance the performance of QA systems while working within the constraints of limited annotation resources. As is commonly acknowledged, having access to the solutions of previously solved questions can significantly simplify the process of answering unseen questions. For instance, as shown in Figure 3.1, given a test question (top) regarding the growth rate, we can retrieve a similar annotated question (bottom) from the database. The test question can be answered by extracting the program pattern associated with the annotated question, i.e., subtracting, then dividing, and changing the numbers using the associated information. The program pattern associated with the train question can be treated as auxiliary knowledge that helps answer the test question. Based on this 40 finding, we propose to solve the low-resource financial QA problem with case-based reasoning (CBR) [107]. CBR is a class of approaches that solve new problems based on the solutions to the questions encountered before. A CBR system usually retrieves similar cases given an unsolved problem and then reuses or revises the keys associated with the retrieved cases to solve the new problem. Inspired by the existing “retrieve then generate" paradigm commonly used in TextQA, we propose a CBR-based framework consisting of a case retriever, fact retriever and generator. Given a question and financial statements associated with the question, our framework uses a neural case retriever to retrieve other similar questions and their related programs from training data. Each retrieved program is then revised into a pattern such that actual numbers are masked, and only operations are kept for generalization. At the same time, based on the question, a fact retriever retrieves the most relevant pieces of information from both text descriptions and tables, where tables are also verbalized into sentences. Finally, each relevant fact is combined with the retrieved pattern and the question and passed to a reader, which fuses three parts of information and produces the final program. The final programs can be executed into answers. In summary, we explore low-resource financial QA by 1) introducing a CBR-based framework that retrieves similar questions for answering an unseen question, 2) introducing a weak question augmentation model for improving the case retriever performance, and 3) investigating fusion-in-decoder style program generators for answering questions, and conducting experiments on public datasets. Figure 3.2: The workflow of the proposed approach. It consists of three main modules: case retrieval, fact retrieval, and program generation. 41 3.2 CBR for Financial QA In this section, we describe the details of our proposed CBR-based system. As is shown in Figure 3.2, it works in three steps: case retrieval, fact retrieval, and program generation. 3.2.1 Data Pre-processing To apply CBR, we assume there is a case memory containing a set of training questions. Each question is associated with a group of facts and an answer. An answer can either be a free-form string (e.g., a year “2016", a value of a metric “$100", or a choice “yes") or a program that involves a series of mathematical operations such that each operation involves some numbers(e.g., divide(100, 200), divide(subtract(100, 200), 200). Formally, let the training set be Ttr. Each example ei ∈ Ttr is represented as a (qi , Fi , ai , pi) where qi is the question, Fi = {fi1 · · · fin} is the fact set, and ai is the answer. As is shown in Figure 3.1, although similar questions use the same computation process to produce answers, the numbers involved during computations are usually different. To generalize existing annotations, besides the answer annotations, we also create a pattern for each question. Specifically, for each ai , the pattern pi is generated by keeping the operators and replacing numeric values in ai with special tokens. For example, if ai is divide(subtract(100, 200), 200) pi will be divide(subtract(#, #), #) where 100 and 200 are masked. Such patterns provide auxiliary knowledge during case reusing. 42 3.2.2 Case Retriever Given a test question, the case retriever aims to retrieve similar questions and associated patterns from the case memory to provide prior knowledge to answer the question. This module retrieves questions based on the similarities between the test question and all available training questions. Specifically, an encoding model encodes training questions and indexes them based on their representations. At the same time, the test question is also converted into a vector representation using the same encoder. We then use FAISS [62] to efficiently retrieve cases from the index with the vector representation of the test question. Model Fine-tuning Neural retrievers, such as DPR [63], have been widely used in general-domain question answering through passage retrieval. Inspired by these prior investigations, we fine-tune such a neural retriever to serve as the case retriever. In the context of passage retrieval, each question is associated with a collection of passages, and the relevance of these passages is determined based on the presence of the answer to the question within them. However, the goal of case retrieval is to find similar questions rather than relevant context, which makes it impossible to follow the same labeling approach as passage retrieval. Instead, we leverage patterns created during the data processing phase as the basis for labeling in case retrieval. Specifically, a pair of questions is considered positive if they share the same associated patterns; otherwise, they are negative pairs. Following DPR, we use a BERT-based model [37], SBERT [101], as the encoder and fine-tune it using the training questions. For each question q, we randomly sample a positive question q + and a negative question q −. For example, in Figure 3.3, given the question “what percent of net interest revenue in total expenses in 2009", we prefer the model to retrieve the positive question “what percentage of leases under noncancelable leases" than the negative question “what is the growth rate in net revenue" as both the anchor 43 Figure 3.3: The training process of the dense case retriever. The model is guided to make positive questions close to and negative questions far away from the anchor question. and positive questions focus on “percent." We optimize the loss function as the negative log-likelihood of the positive question. L(q, q+, q−) = − log e sim(q,q +) e sim(q,q+) + e sim(q,q−) where q, q + and q − are vector representations of q, q + and q −, respectively. sim() is the dot product similarity function. For each question, we sampled positive and negative questions ten times to create final training data. Weak Question Augmentation Since, in this paper, we focus on the low-resource scenarios, the training data for fine-tuning the case retriever is limited. We investigate a weak question augmentation approach to alleviate this issue to enrich the training question set. The intuition of question augmentation is that the same type of questions follow the same computation process, even if the main entities or attributes differ. For the two questions “what is the growth rate in operating profit for mst 2012?" and “what is the growth rate in the r&d in 2019?", both of them follow the same process of predicting the growth rate but with different attributes (i.e., operating profit v.s. r&d, 2012 v.s 2019). Based on this finding, question augmentation can be achieved by changing the characteristics of the questions. The augmentation is split into the following three steps: 44 Figure 3.4: The question augmentation process that identifies attributes and replaces them with new attributes. • Attributes collecting We collect a bag of attributes from the tables in the financial reports associated with the question as the candidates for both training the attribute identification model and serving as the candidates during attribute replacement. As the initial attempt, for each table, all cell values on the first row and the left column are treated as candidate attributes. • Attribute identification: We train a sequence-to-sequence, i.e., T5-base, a model that generates attributes given the question as the input. As is shown in Figure 3.4, given the raw question “I =what percent of net interest revenue in total expenses", the expected outputs are two attributes “O =net interest revenue, total expenses." During training, the attributes that appear in the raw question are treated as the target output. Specifically, for the same question, O = T5Decoder(T5Encoder(I)) 45 where T5Encoder and T5Decoder are the encoder and decoder modules in the T5 model. • Attribute Replacement: To augment questions, we replace identified attributes with other attributes collected in the first step. Each augmented question is obtained by randomly sampling an attribute from the collection and replacing the identified attribute with the randomly selected one. We note that the augmented questions are not always valid. For example, we may generate a new question “what percent of net-interest revenue in non-interest revenue" if “total expenses" is identified and replaced with “non-interest revenue." This question is invalid because net interest revenue is not part of non-interest revenue. However, it may still be helpful for case retrieval because the case retriever is expected to focus on the question pattern “what percent of A in B". 3.2.3 Fact Retriever Due to the substantial volume of information typically present in financial reports, i.e., numerous tables and lengthy passages of textual data, it is usually impractical to utilize all information from the reports to produce answers. An effective way to address this challenge is to apply a fact retriever to score and only select top-N relevant facts for final utilization. To use both structured and unstructured data in the financial reports, following prior approaches [25, 140], we convert tables into a set of sentences. More specifically, given a table, we convert the table into facts by concatenating each cell value with the corresponding top and left attributes. For instance, when the initial row (top attributes) of a table contains the cell "number of shares," and the first column (left attributes) has the cell "granted," the cell that has value "100" and aligns with both "granted" in the row and "number of shares" in the column can be expressed as "granted number of shares is 100." By merging the textual data with linearized tables, we can establish a definitive set of candidate facts. 46 Following previous studies, we apply a BERT-based model and model fact retrieval as a classification task. Given a question q and the set of candidate facts F = {f1, f2, · · · , fn}, The model produces a relevance score for each 1 ≤ i ≤ n si = BERT(concat(q, fi)) where q and fi are two parts of the input with separate token type embeddings, and si indicates the relevance between the question and this fact. During the fine-tuning process, every training question is associated with a label "1" for all relevant facts and a label "0" for irrelevant facts. We randomly sampled 20 negative facts for each question. During inference, for each test question, the fine-tuned model scores each fact. Subsequently, the facts are ranked based on their scores, and the top-N facts are chosen for usage in the generation module. 3.2.4 Program Generator The generator takes the given question, the retrieved pattern, and the selected facts as the input to either extract a span of text or generate a mathematical program. In most recent general-domain QA systems, notable advancements have been made using Transformer-based sequence-to-sequence models like BART [79] and T5 [99], resulting in remarkable performance achievements. Since answering questions often requires the retrieval of numerous relevant passages from external knowledge sources to establish context, concatenating all passages as input to the generator is impractical due to inefficient attention computations. This challenge is also relevant to our task: a single financial report contains many sentences, including verbalized tables. Moreover, complicated financial questions might involve multi-step computations, thus requiring retrieving a significant amount of sentences for program generation. To enable the possibility of utilizing more context without losing efficiency, recent studies have started using Fusion-in-Decoder (FiD) [57] based on T5 as the generation model. In general-domain QA, retrieved 47 passages are concatenated with the question and encoded by FiD separately. Following a similar process, let the question be q, the retrieved pattern be p, and the retrieved facts be [f1, f2, · · · fm]. The encoding of each fact (1 ≤ i ≤ m) is Fi = Encoder([q; p; fi ]) where Encoder is the T5 encoder and Fi is the i th encoding. The FiD model then concatenates all encodings and generates the program prog = Decoder([F1, · · · , Fm]) where Decoder is the T5-Decoder, and prog is the final program. Figure 3.2 shows the example. The question “what percent of net interest revenue in total expenses in 2009" is concatenated with the pattern “Divide # #", and each individual retrieved facts. The generator is expected to produce the program “Divide 988 3137" such that each special token # is replaced with a number extracted from the facts. 3.3 Experimental Evaluation 3.3.1 Experiment Settings In our experiments, for case retrieval, we use the a version of SBERT [101] model pre-trained on multiple QA tasks as the base model. For each training question, we randomly sample 10 positive and negative pairs of questions. We set the learning rate to 1e-5, the batch size to 16, and the training epoch to 10. During inference, the system retrieves 20 similar questions and apply majority vote (the tie is broke based on their similarity scores) on them based on their associated patterns. The most frequent pattern is selected to use during program generation. 48 For fact retrieval, we fine-tune a cross-encoder version of SBERT. We set the max sequence length to 128, the learning rate to 1e-5, the batch size to 32, and the training epoch to 10. We treat all given relevant facts as positive, and randomly sample other facts as negative instances. For program generation, we fine-tune a FiD model [57] initialized with T5-base [99]. We use the original implementation from FiD library. We set the learning rate to 5e-5 with a weight decay 0.01. By default, we set the max sequence length to 250, the batch size to 2, the number of context to 50 and fine-tune the model for 10000 steps. We run the experiments on a Quadro RTX 8000 using a single GPU. As highlighted in earlier sections, our experiments are designed on the scenarios that the training and computation resources are limited. To simulate this scenario, we adopt a strategy of sub-sampling training questions based on program patterns, regardless of the specific numbers involved in computations. For each pattern within the training dataset, we randomly sample subsets of 5, 10, and 20 questions from the entire pool of training questions. Subsequently, during the inference phase, we evaluate the performance of our system using the complete set of test or development questions provided by the dataset. 3.3.2 Evaluation Datasets We evaluate the effectiveness of CBR on two public financial QA datasets: • FinQA [25] The original dataset contains contains 6251, 883 and 1147 questions for training, validation and testing, respectively. Each question is associated with a table and a set of sentences around that table. To answer the question, a system is expected to extract information fro m both tabular and textual information. In our experiments, the sub-sampling process leads to 598, 818, and 1133 questions for 5, 10 and 20 questions per pattern experiments, respectively. • MultiHierTT [140] The original dataset contains 7830, and 1044 training and validation questions, respectively. As the test questions are not public, we evaluate and report the performance on the validation set in most of our experiments. In this dataset, each question is associated with multiple 49 MultiHierTT FinQA (dev) (test) Longformer 2.71 21.90 FinQANet (RoBERTa-large) 32.41 61.24 MT2Net (RoBERTa-large) 37.05 - TAGOP (RoBERTa-large) 19.16 - NAPG (RoBERTa-large) 48.20 - DyRRen (RoBERTa-large) - 63.30 PoT (GPT-3 Codex) - 68.00 Ours (FiD-base, w/o CBR) 44.40 62.50 Table 3.1: The comparison between the FiD-based system with prior systems with different generation modules. FiD-base is based on T5-base which is a smaller-sized model. MultiHierTT 5 10 20 w/o CBR 5.46 10.74 16.47 w/ CBR 5.56 15.42 18.20 w/ CBR + Aug 7.85 15.80 20.59 FinQA 5 10 20 w/o CBR 9.68 18.48 22.67 w/ CBR 13.30 21.71 24.01 w/ CBR + Aug 12.64 17.61 23.10 Table 3.2: The main results of the low-resource scenarios. 5, 10, and 20 indicate training systems with 5, 10, and 20 questions per program pattern, respectively. Each number represents the exact match performance. tables and sentences around these tables, it is necessary to leverage information from both tables and paragraphs. This dataset is more challenging and the prior performance is generally lower than FinQA. In our experiments, the sub-sampling process leads to 540, 751, and 1009 questions for 5, 10 and 20 questions per pattern experiments, respectively. 3.3.3 Evaluation Results Main Experiments: Investigate the effectiveness of CBR We investigate CBR for low-resource financial QA. Our experiments mainly focus on the effect of retrieved program patterns during program generation. We split the whole process into two sub-experiments: 50 Figure 3.5: The ablation results on the number of facts in FiD models. For each dataset and each number of facts, we take the average over the 3 low-resource results (i.e. 5, 10, and 20 questions per pattern). 1) Since our generation model is different from prior studies, we initiate by establishing the credibility of our FiD-based module as a valid and comparable foundational model for delving into CBR. To achieve this, we conduct a comparative analysis between our system and previous studies, utilizing the complete training datasets (6251 questions for FinQA and 7830 questions for MultiHierTT). The results of this investigation are presented in Table 3.1. It is important to note that, for the sake of simplicity, our experiment only employs a BERT-based fact retriever and a FiD-based generator initialized with the T5-base model without integrating the case retrieval module. Comparing with prior models, we observe that our system achieves competitive or close performance to prior systems that use large-sized generators (i.e. RoBERTalarge and GPT-3). These results validate that our system can serve as the foundational framework for exploring low-resource scenarios. 2) Based on these results, we conduct a set of low-resource experiments using our system. On both datasets, we run three low-resource settings: we use 5, 10 and 20 questions per pattern for training all modules. Table 3.2 presents the main results. We compare three variants of our system: w/o CBR removes the case retrieval module, w/ CBR uses the case retrieval module, and w/ CBR + Aug also leverages the weak question augmentation module. For both datasets and all three experiments, extra program patterns during program generation result in significant improvement. For example, for MultiHierTT, w/CBR + Aug 51 (a) MultiHierTT (b) FinQA Figure 3.6: The illustration of how the case retriever affects the overall QA performance. The oracle variant uses the ground-truth pattern as the input. performs better than w/o CBR by 43%, 47%, and 25%. In addition, comparing w/CBR and w/CBR + Aug, we observe that the weak augmentation module is significantly beneficial for MultiHierTT but not FinQA. One potential reason is that the weak augmentation module generates questions that are not always valid. Invalid augmented questions of this nature could potentially introduce error signals while training the case retriever, consequently leading to a degradation in overall performance in the FinQA experiments. Ablation Experiments on Number of Facts An advantageous feature of the FiD-based generator is its capacity to expand the contextual scope. Different from other models that are potentially limited by the 52 maximum sequence length, FiD leverages more context by increasing the number of passages and fuse the information during decoding. In the context of our system, each passage is a sentence from a paragraph or a verbalized cell from a financial table. In this experiment, we show that the overall performance is improved by increasing the number facts used during generation. Figure 3.5 shows the comparison. Increasing the number of facts during generation leads to improved performance across both datasets. Specifically, before 20 facts, the the performance is increased significantly for both dataset, and after 20 facts, the performance is further improved for MultiHierTT but less significant for FinQA. One potential reason is that MulHierTT has more candidate facts (i.e. more tables and paragraphs around these tables) associated to each question, which makes it more challenging to select most relevant facts from all candidates. Analysis on the Quality of Retrieved Cases We extend our experiments to illustrate how the quality of retrieved patterns influences the overall performance of the QA system. In our study, beyond employing a case retriever to select program patterns, we also conduct experiments utilizing oracle patterns during inference. The results are presented for two datasets in Figure 3.6. In line with prior experiments, the Retrieved Pattern variant employs results from w/ CBR + Aug for MultiHierTT and results from w/ CBR for FinQA. While the retrieved patterns already provide valuable insights for program generation, a comparison between the Oracle Pattern and Retrieved Pattern reveals that the utilization of Oracle patterns consistently yields significantly better performance across all runs, without necessitating module retraining. This observation suggests that enhancing case retrieval performance could potentially pave the way for enhanced overall QA performance, offering a promising direction or future research in financial QA. 3.4 Related Work There are two lines of research closely related to this paper. 53 Financial Question Answering The task of financial QA is sub-area of table-based QA. As financial data is presented in both textual and tabular formats, financial QA involves both tables and text. A few benchmark datasets [25, 143, 140] along with novel approaches were introduced to solve the problem. These datasets usually include two types of questions: span selection and program generation. The prior one identifies a span of text from financial statements (i.e.what is the income in 2020?), and the later one uses numerical values from the statements to produce multi-step mathematical operations which lead to the final answer (i.e.How much did the cost increase from 2020 to 2021?). Existing works mostly follow the retriever-generator paradigm to solve the problem. For example, [25] uses a neural retriever to retrieve sentences from linearized tables and text, and learn a RNN-based generator; [78] applies a graph-based encoder to select information pieces and a tree-based decoder to generate programs; to differentiate different types of questions from each other, [143] and [140] also incorporate a classifier to predict question type and learn separate generators to solve different types of questions. Besides the traditional retrieve-generate process, our system also integrates a case retrieval step that leverages answers/programs to annotated questions for answering unseen questions. In addition, for the first time, we propose to use a fusion-in-decoder as the generator which makes it possible to incorporate richer relevant information during generation. More recently, large language models such as GPT have brought remarkable advancements in numerical reasoning tasks with few-shot learning [20]. In this paper, we focus on using smaller-sized generators, as they are easier to further fine-tune on private data. Nonetheless, our approach, which utilizes program patterns as a part of the input to T5, is highly adaptable and can readily be extended to larger-sized seq-to-seq models with minimal effort. Case-based Reasoning CBR [76, 1, 107] is a class of approaches that solve new problems based on the solutions to existing problems. CBR has been used in different domains/tasks. For example, [33] uses CBR to retrieve sub-graph patterns and answer knowledge graph-based factual questions. [73] applied CBR techniques during both executing automatically as an algorithm and presenting visually in a user 54 interface to provide visual explanations. They were evaluated on the breast cancer management task and showed better performance compared to pure nearest neighbor classification with better explanations. [30] proposed to use CBR for data cleaning algorithms in classification and regression tasks. As the main component of the systems, the case retrieval model is composed of filter and similarity phases. Following existing CBR techniques, our framework retrieves programs of existing annotated examples to serve as auxiliary supervision during program generation. 55 Chapter 4 Scientific Knowledge Graph Construction and Representations 4.1 Introduction In this chapter, I introduce how we create domain-specific, i.e., scientific, KGs, and, more importantly, how different features influence the representation ability for the reproducibility prediction task. In the past decades, there has been an explosion in the number of scientific articles published in journals and conferences and posted on pre-print servers. For the conclusions of scientific publications to be trusted and accepted by the research community, the underlying methods and techniques must be reproducible [72, 92]. Since new research findings build upon prior results, reproducibility is an essential component of scientific research. Unfortunately, a growing body of research suggests that developments in the scientific literature are not as reproducible as expected [55, 8, 97, 7]. Some other researchers from a range of disciplines such as psychology [94], biomedicine [42], economics [17] and social sciences [4], revisited a variety of published scientific papers, manually assessed the credibility of them by conducting direct replication studies, and further confirmed the reproducibility issue. The underlying difficulties of reproducible research coupled with the growing rate of new publications motivate the urgent need for large-scale models to curate information about research methods and assess the reproducibility of scientific results. Although the replication studies have provided ways to identify credible publications, they also showed the difficulty of assessing publications at scale. This is because, in many research fields, such as social 56 PA PB A V U Affiliated Author Published in Cites Abstract P-values Method Num of Papers Rank Num of Papers Rank Affiliation Citations Figure 4.1: The illustration of a scientific KG with additional information-associated entities. PA, PB are paper entities, A, U, and V are author, organization, and venue entities, respectively. and behavioral sciences, reproducing experimental results is resource-intensive, and researchers are deincentivized from running replication studies since novel results advance careers. For example, replicating a social psychology study requires domain experts to understand the experimental design and different groups of participants for comparison. Since researchers usually have limited resources, it is impractical to evaluate all related work manually. Therefore, it becomes increasingly important to have automated systems to perform the assessments and provide insights for fellow researchers. In the spirit of automatically assessing papers, researchers have started applying advanced machinelearning techniques. For example, [5] train predictive machine learning models to study the effect of different variables in predicting reproducibility. Their experiments identified several basic experimental features, such as the sample and effect sizes of the original papers, that are useful for predicting reproducibility. In addition, [133] collects paper abstracts and trains a word embedding model [88] to capture textual information of the papers for making predictions. Despite the usefulness of these existing approaches, higher-order information is ignored. For example, a finding might be hard to reproduce if purely based on another irreproducible result. Such kind of information could potentially be captured if relationships between papers are considered. Based on this 57 intuition, in this paper, we propose a novel approach for assessing scientific research using knowledge graphs (KGs) which have been successfully used in many applications such as search, question answering, and data integration [77, 11, 85]. Our approach incorporates information from two different perspectives: micro-level and macro-level. Specifically, micro features include explicit inter-paper features such as sample sizes and effect sizes and implicit features that encode paper content with pre-trained language models. Macro features include high-level intra-paper relationships between author-paper, paper-paper, and paper-venue relationships. We then construct KGs such that entities represent different elements (i.e., papers, authors) and edges represent other relationships between the components. In addition, each entity may also have additional associated features. We then improve and apply KG embedding methods to learn hidden representations for entities. Finally, a neural network is trained to assess papers using their hidden representations. Figure 4.1 shows an example KG with task-relevant entities and relations. We propose to incorporate features from two perspectives for assessing scientific papers. We construct KGs with micro- and macro-features to encode rich information for papers. To integrate different types of features associated with entities, we then adjust the existing KG embedding methods (i.e., LiteralE) to learn hidden representations for papers. We finally experimentally demonstrate the usefulness of our approach on two benchmark datasets. 4.2 The Proposed Approach In this section, we introduce our approach in detail. We first show two levels of information (i.e., micro and macro), which are leveraged in our approach. We then describe how a knowledge graph (KG) is constructed with additional information associated with our task. We finally demonstrate the extended KG embedding method that supports our KGs. 58 4.2.1 Micro and Macro Information We consider micro- and macro-level information in our task. Specifically, micro-level information includes features of entities themselves. For example, models, P-values, and sample sizes are features within papers; years and series are features of conferences; counts of citations and articles are features of authors, etc. For assessing papers, these features could be helpful. For example, a significant P-value may indicate a low credibility, while a high citation count may suggest a high reproducibility. In addition to micro-level features, we also incorporate macro-level information. We refer to macrolevel information as information that captures relationships between base entities, such as the citation relationships between papers, affiliated relationships between authors and organizations, authorship between papers and authors, etc. We believe such information can potentially be helpful due to the intuition that social influence may exist in research studies as well. For example, papers with robust methods are likely to cite other papers with powerful techniques, papers from a higher-prestige author or institute may be more reproducible, papers published in top-tier conferences may be more reliable, etc. 4.2.2 Scientific Knowledge Graph Construction To incorporate both micro and macro information, we propose to construct knowledge graphs such that micro features are used as additional information associated with entities. In contrast, macro relationships are used to construct the network structure. Formally, a KG can be represented as G = {⟨ei , rk, ej ⟩|ei ∈ E, ej ∈ E, rk ∈ R} where E is a set of entities and R is a set of relations. ⟨ei , rk, ej ⟩ indicates that the relation rk exists between entities ei and ej . In the graph, each node represents an entity, and each edge represents a directed relation between two entities. Besides the triples, in our task, each entity ei has two types of associated information, di and ni , which represent the encoded description and some other numerical features of ei , respectively. Our KG schema includes the following six main types of entities: 59 1. Affiliation: Entities of universities, companies, and other organizations that authors are affiliated with. Their names are provided as descriptions with numeric features: rank, paper count, and citation count. 2. Author: Entities of authors of the publications. Similarly, their names are descriptions, and rank, paper count, and citation count are numeric features. 3. Field of Study: Entities of research fields that publications belong to. They have rank as numeric features and their names as descriptions. 4. Publication: Entities of papers. Abstracts (or titles, if not available) are used as descriptions. There are 24 numeric features, including experiment features (e.g., P-values, sample sizes, and number of studies), transparency-related features (e.g., if the data and code open and if there are pre-registrations), and network measures (e.g., the paper rank among all papers, the network authority and clustering coefficient of the paper in the citation network. 5. Venue: Both journal and conference entities are included. They have similar numeric features as Affiliation. 6. Constant: All other entities are constants. For example, year of the publication, type and sub-type of the Venue. No numeric features are considered for these entities. To connect entities, we consider ten different relations. For example, an authoring entity is affiliated to an affiliation entity; a publication entity is published in a venue entity; a publication cites another publication entity; an author entity is author of a publication entity. We use the Microsoft Academic Graph (MAG) ∗ to collect such relationship information. Given a set of origin publications, we traverse the MAG and keep them within two-hop away from the origin publications and their author and venue entities. ∗ https://www.microsoft.com/en-us/research/project/microsoft-academic-graph 60 P A Abstract: This paper shows … Name: Michael 0.1 10 .. .. .. 1 Text Encoder Num Encoder Num Encoder LP LA Author LiteralE Paper Embeddings Scorer Figure 4.2: The workflow of representing entities with both graph structures and the different types of information associated with them, and using the representations to predict reproducibility scores. 4.2.3 Scoring Publications with KG Embeddings We use the constructed KG to learn representations for publications and score them using their representations (also called embeddings). Most KG embedding methods focus on capturing KG structures and relation patterns. For example, TransX [12, 83, 59] considers relations as translations such that the embedding of source entities is translated into target entities using the relation embeddings. DistMult [132] applies three-way interactions between entities and relations using matrices. ComplEx [123] proposed to use complex embeddings (real and imaginary) for both entities and relations to handle antisymmetric relations. More recently, researchers have been considering involving additional information on entities to assist representation learning. For example, [129, 130] affect the structure and textual information to learn representations. LiteralE [71] combines literals and structural embeddings with a learnable function to form the final embeddings. In this paper, since LiteralE is a general framework that accepts classic KG embedding methods (i.e., TransE, ComplEx, DistMult, etc.), we adopt it as our base model. Figure 4.2 demonstrates the workflow of our approach. Let ⟨es, r, eo⟩ be a triple, ds and do are the descriptions associated to es and eo, ns and 61 no are the lists of numerical features associated to es and eo, respectively. We apply SciBERT [9], a BERTbased [37] model pre-trained on scientific text, to encode the descriptions such that v s d = SciBERT(ds) v o d = SciBERT(do). v s d , v o d ∈ RN where N is the dimension of the hidden states. The numerical feature lists are converted by an element-wise Exponent Number Converter, which collapses numbers into bins [10]. v s n = Exp(ns) v o n = Exp(no). v s n, v o n ∈ RM where M is the number of numerical features. For example, if the initial feature list is [1.1, 10.5], the number converter may take the logarithm of individual numbers and convert the list into an integer list [0, 2]. LiteralE learns a function g to convert an original entity embedding into a new embedding with both textual and numeric features involved. v s e = g(v e , v s d ∥v s n), v o e = g(v o , v o d ∥v o n) where v e and v o are original entity embeddings. With the literal-enriched vectors, g is learned according to the scoring function f(v s e , vr, v o e ) where f is determined by the base model used in LiteralE. For TransE, f = |v s e + vr − v o e |, For DistMult, f = v s e TMrv o e 62 where Mr is a diagnoal matrix for r. For ComplEx, f = Re(v s e TMrvo e ) where Re means the fundamental part of the vector. After training the KG embedding model, we take static learned publication representations and apply a Multi-Layer Perceptron (MLP) to them to predict continuous scores. For the MLP, we use Mean Squared Error as training loss. 4.3 Experimental Evaluation In this section, we first provide statistics of the constructed KGs of two different datasets. We then compare our method with several baseline methods and analyze the results. 4.3.1 Evaluation Datasets We use two datasets in our experiments. The first dataset is from the Reproducibility Project: Psychology [94]. To determine a successful replication, we use the significance of the meta-analytic combination between the original and replication study (coded as a binary). It contains 70 paper in total from the psychology domain. The second dataset, SCORE, is from the SCORE project [4]. The dataset includes 2362 papers from several different social and behavioral disciplines. In this dataset, the scores indicate the credibility of the claims in the papers. For both datasets, we consider continuous scores between 0 and 1 such that a more significant value indicates a higher credibility/reproducibility. For both datasets, we create the KG by starting from the papers within the dataset and traversing over the paper citation and author academic graphs. The papers within two hops from the root papers are kept. 63 Table 4.1: The statistics of the KGs # Nodes # Edges RPP 148,983 1,769,883 SCORE 2,287,066 36,144,015 All information (authors, affiliations, venues, etc) directly related to the selected papers are held in the KG. Table 4.1 shows more details of the constructed KGs. 4.3.2 Experimental Settings All our experiments are run on a single Nvidia Quadro RTX 8000 GPU. We use Python to implement the models † . We use PyKEEN ‡ [3] library as the backbone of the KG embedding methods. For both KGs, we set the embedding dimension to 100, batch size to 2048, negative sample size to 16, and learning rate to 0.0001. We run the KG embedding models for ten epochs for RPP and three epochs for SCORE. For our methods, we apply a simple linear layer in LiteralE to fuse description and numeric features. During running the downstream task, since different methods may have additional input dimensions, we freeze the embedding models and use an MLP, which first projects the input into a vector with 50 dimensions and then predicts a continuous score. For all methods, we run 5-fold cross-validation. For each experiment, we set batch_size to 4, the learning rate to 1e-5, and run 100 epochs. We set the random seed to be 42. We report the average performance. 4.3.3 Evaluation Results Compared Methods We evaluate the following models on two benchmark datasets: • Random: Each score is a continuous value randomly sampled between 0 and 1. † https://github.com/kianasun/kg4rr ‡ https://pykeen.readthedocs.io 64 • SciBERT: Entity descriptions are encoded by SciBERT [9]. The hidden states are used as inputs for scoring papers. They are also a part of our method. • Numeric: Only numeric features are used as inputs. • Yang: The approach proposed in [133] leverages word embeddings trained on paper abstracts collected from MAG. • TransE [12]: A classic translation-based KG embedding method. Only graph structures are leveraged in the method. • DistMult [132]: A KG embedding method based on bilinear interactions between entity and relation representations. • ComplEx [123]: A KG embedding method representing an entity from real and imaginary perspectives. • Ours (TransE): Our LiteralE-based [71] method involves both description features, numeric features, and KG structures. TransE is used as the base model in LiteralE. • Ours (DistMult): DistMult is the base model in LiteralE. • Ours (ComplEx): ComplEx is the base model in LiteralE. Metrics We consider continuous scores and leverage two metrics: Root Mean Squared Error (RMSE) and Kendall’s Tau (KT) [66]. Given a list of ground truth and predicted scores, KT measures the correlation between the two lists, and RMSE shows the difference between them. KT scores are within the range [−1, 1] such that the larger the better. RMSE scores are non-negative values such that the smaller the better. 65 Table 4.2: The main results on two datasets. Each number represents an RMSE/KT score. The bestperforming scores are highlighted, and the second-best scores are underlined. RPP SCORE RMSE↓ KT↑ RMSE↓ KT↑ Random 0.6080 -0.0642 0.3233 0.0040 SciBERT 0.4765 0.0694 0.1344 0.1675 Numeric 0.4731 0.0697 0.1379 0.1032 Yang 0.4404 0.1191 0.1326 0.1926 TransE 0.4931 -0.0856 0.1619 -0.0210 DistMult 0.5010 -0.0508 0.1674 -0.0127 ComplEx 0.5131 0.0573 0.1526 -0.0283 Ours (TransE) 0.4041 0.2454 0.1323 0.1809 Ours (DistMult) 0.4215 0.2437 0.1314 0.1894 Ours (ComplEx) 0.4420 0.1846 0.1316 0.1852 Table 4.3: The ablation study results on different text and number encoders. RPP SCORE RMSE↓ KT↑ RMSE↓ KT↑ Full (TransE) 0.4041 0.2454 0.1323 0.1809 - Remove Num Encoder 0.4665 0.0536 0.1396 0.0651 - Use BERT 0.4015 0.3299 0.1330 0.1664 - Use LongFormer 0.4528 0.1320 0.1341 0.1588 Full (DistMult) 0.4215 0.2437 0.1314 0.1894 - Remove Num Encoder 0.4661 -0.0342 0.1387 0.0674 - Use BERT 0.4258 0.1912 0.1316 0.1907 - Use LongFormer 0.4519 0.1553 0.1329 0.1614 Full (ComplEx) 0.4420 0.1846 0.1316 0.1852 - Remove Num Encoder 0.4691 -0.1669 0.1351 0.1427 - Use BERT 0.4420 0.1054 0.1309 0.1957 - Use LongFormer 0.4606 0.0788 0.1340 0.1557 Main Results We report the main results in Table 4.2. We find: 1) Both micro- and macro-features perform better than random guesses. 2) Although SciBERT and Numeric perform better than pure KG embedding methods, literal-involved KG embedding methods achieve better performance by jointly learning useful information from both micro and macro perspectives and generally outperforming the previously published method. 3) Comparing different KG embedding methods, DistMult performs better than TransE 66 and ComplEx by a small margin. 4) The RMSEs of SCORE are generally smaller than those of RPP because ground-truth scores in RPP are all binary, while the scores in SCORE are all continuous values. Ablation Study We show ablation studies in Table 4.3. We first show the effectiveness of the Exponent number encoder. We remove the number encoder and use the original numeric features to learn the embeddings and compare their results. In all cases, RMSE increases, and KT decreases after removing the encoder. This shows that dividing numbers into bins could make it easier for models to learn representations. In the second experiment, we replaced the SciBERT with another two language models, BERT and LongFormer. After using BERT, in 3 out of 5 cases (for ComplEx, the two variants achieve the same performance), RMSE increases, and in 3 out of 6 cases, KT decreases. This experiment shows that SciBERT achieves slightly better performance than the vanilla BERT. After using LongFormer, the performance decreases by a small margin in all cases. Comparing different KG embedding methods, ComplEx performs the best on SCORE, and TransE performs the best on RPP. 67 Chapter 5 Knowledge Graph-based Question Answering with Contextual Ranking 5.1 Introduction In Chapter 3, we applied a retrieval-generation-based pipeline for financial QA. In this chapter, we extend this paradigm and study the QA task with Knowledge Graphs (KGs) as knowledge sources. As discussed in previous chapters, KGs [11, 125] are high-quality, richly structured sources that contain unique information not easily found in unstructured text. As such, QA over KGs (KGQA) remains a popular parallel QA challenge. While semantic parsing approaches have traditionally dominated KGQA leaderboards, recent studies like Unik-QA [93] and DecAF [137] have sought to extend the retrieve-then-generate approach to KGQA by converting <subject, relation, object> triples into natural language sentences. The primary challenge of this line of research is the substantial scale and complexity of KGs, which make it challenging to identify the relevant triples necessary to generate a correct answer. Prior works like DecAF [137] attempt to overcome this challenge with a specialized "reader" module that produces a logical form executed against the KG to retrieve relevant triples before answer generation. While effective, this approach is limited: 1) requires extensive logical form training data, which is not always available, and 2) introduces latency due to logical form execution. In this chapter, we increase the relevance of retrieved triples by employing a carefully designed ranker. Our ranker improves the salience of retrieved candidate triples by exploiting "context": triples that share a 68 Figure 5.1: The comparison between our proposed ranking process with 1) classic retrieve-then-generate pipeline, and 2) ranking with coarse-grained document-level labeling strategy used during training. subject entity with the retrieved candidate triple. This rich context helps disambiguate similar candidates by uncovering dependencies between questions and the subgraph surrounding retrieved triples, enhancing the accuracy of rankings. Specifically, we 1) retrieve over "entity-centric" documents that contain triples with the same subject entity, 2) re-rank triples by relevance using context obtained from the one-hop neighborhood of retrieved subject entities, and 3) generate a final answer using top-K re-ranked triples. Existing KGQA benchmarks do not include labels for ranker training. In this work, we study two labeling strategies that exploit KG structure to derive these labels automatically: 1) a naive "document-level" labeling strategy that coarsely labels triples based on their membership in a relevant document and 2) a novel "triple-level" labeling strategy that leverages the co-occurrence of topic entities and their corresponding answers to granular, higher quality labels. For example, given the question "Who is Justin Bieber’s 69 brother," "document-level" labeling considers all triples containing "Jaxon Bieber" as positives. At the same time, the "triple-level" strategy only finds the most relevant triple <Jaxon Bieber, sibling, Justin Bieber> as positive. We extensively study the ability of our contextual re-ranker to improve the relevance of retrieved triples and overall KGQA performance on the popular FreebaseQA [60] and WebQSP [135] benchmarks. Our study makes the following contributions: 1) We show that incorporating a contextual ranker is an efficient and accurate way to increase the relevance of the information provided to the generator and improve KGQA performance, increasing the Exact Match score on FreebaseQA by 5.56% over DecAF [137], the prior state of the art. 2) We investigate two novel labeling strategies to infer labels for contextual ranker training, as shown in Figure 5.1, and study its benefits on ranker and overall KGQA performance. 3) We demonstrate how to construct "entity-centric" retrieval documents to produce context that improves ranker performance for the KGQA task. Figure 5.2: The proposed framework for KBQA. The framework contains three modules: Retriever, contextual Re-ranker, and Generator. 70 5.2 Retrieve-Rerank-Generate Pipeline for KGQA In this section, we formalize the KGQA task and provide details of the approach consisting of Retriever, Re-ranker, and Generator, as shown in Figure 5.2. 5.2.1 KGQA Task Formulation A KG can be denoted as G = {⟨es, r, et⟩ |es, et ∈ E, r ∈ R} where E and R represent the entity set and the relation set, respectively. Each triple ⟨es, r, et⟩ ∈ G shows the existence of the relation r between the source entity es and the target entity et . Given a natural language question q represented by a sequence of word tokens, the task of KGQA aims to find an answer a to q using triples from G. The answer a can be a natural language sequence or an entity in E. 5.2.2 Modules Our proposed Retriever-reranker-generator pipeline works in three steps as shown in Figure 5.2: 1) retriever takes a user-given question as the input and retrieves N relatively relevant documents from the pre-defined KB, 2) re-ranker selects the most informative triples from the above documents, and 3) generator leverages both the question and top-ranked triples from the previous step to generate final answers. Retriever In line with prior works [137, 93, 50], we employ both BM25 [102] (i.e. sparse) retrieval and DPR [63] (dense) retrieval methods to obtain relevant documents from our indexed KG. BM25 [102] leverages TF-IDF [100] scores for word matches between a query and our indexed KG. DPR [63] consists of two BERT-base [37] models to encode questions and documents into low-dimensional embedding spaces. The two models are trained with a contrastive objective such that the similarities between the encodings of a question and its relevant documents are maximized. 71 The "verbalization" of KG relations to text is a well-studied problem that can be addressed with both templates [93] and generative models [2]. For example, the relation “<Will Smith, Age, 54>” can easily be verbalized with a template to produce “Will Smith’s Age is 54”. We use this template strategy to verbalize our KG as single factoid sentences. However, the grouping of verbalized relations into retrieval documents with meaningful structure remains an open problem. Prior works either treat each relation as its retrieval document, such as in UnikQA [93] or apply length-based splitting of randomly grouped relations of the same subject entities, such as in DecAF [137]. The prior option results in a large number of retrieval documents that lack significant contextual information, such as other relations about the same entity, while the latter one potentially splits a triple into separate documents. We employ an alternative approach to document generation, where relations sharing the same subject are consolidated into a single document, presented randomly. The condition is that each triple must be contained within a single document. Our document generation method decreases the number of documents in our index, enhancing retrieval efficiency. Simultaneously, it ensures that each relation is situated within the context of other relations featuring the same subject entity, and each relation is placed within the same document. Figure 5.3: The architecture of the contextual re-ranker. Each input has: question, candidate triple, and context . The output is either 0 (irrelevant) or 1 (relevant). 72 Contextual Re-ranker Recent studies for answer re-ranking on text documents have demonstrated the significant benefits of contextual information, such as sentences within the same paragraph of a target sentence [75]. Inspired by this prior work, we hypothesize that such contextual information can also be helpful for triple re-ranking within the KGQA setting. The notion of context used in these prior works assumes a natural ordering of information; that is, it assumes that sentences within the same paragraph as a target sentence contain helpful contextual information. Our document generation strategy is designed to fulfill this assumption by ensuring that relations within the same document share the same subject and provide helpful contextual information. Contextual Re-ranking Model We extend the ELECTRA-large [29] contextual re-ranker proposed by Lauriola and Moschitti [75] as our re-ranker backbone. Given a question q, the retriever retrieves a list of documents [D1, D2, · · · , Dm] where each document Di = [T1, T2, · · · Tn] contains a list of triples. For each triple of each document Tj ∈ Di , the context Cj of Tj is the concatenation of other triples in the same document T1···j−1 + Tj+1···n. The input to the re-ranker is the concatenation q, Tj , and Cj , where these three input types have different token type embeddings. Figure 5.3 shows an example. In line with prior work, the re-ranker is trained with a classification objective such that the positive triple has label 1 while the negative triple has label 0. Labeling Strategy The process of assigning high-quality relation labels is essential for re-ranker training. Prior studies [46] use document-level labels for re-ranking documents, in which all documents containing the gold answer are given positive labels. In our framework, this approach degrades label quality in that it effectively teaches the model that every triple from the correct document is relevant to the question. For example, consider the question “who is Justin Bieber’s brother” and its answer, “Jaxon Bieber”. Suppose we retrieve two documents: “Justin Bieber sibling Jaxon Bieber. Justin Bieber people person ethnicity Canadian” and “Jaxon Bieber sibling Justin Bieber”. Using the "document-level" labeling strategy, the irrelevant triple “Justin Bieber ethnicity Canadian” receives the same label as the highly relevant triple 73 “Jaxon Bieber sibling Justin Bieber" only because they are from the same document. We hypothesize that the "false positive" cases introduced by this labeling strategy impede the ability of the re-ranker to differentiate between highly relevant and largely irrelevant triples. We propose to mitigate this shortcoming with a fine-grained "triple-level" labeling strategy; this strategy gives a positive label to triples that contain the gold answer and a topic entity. If no such triple exists, it removes the topic entity constraint and checks only for the gold answer. We add this fallback check because topic entities are always provided in dataset annotations and the answer entity and topic entity are not always within a single hop. Before training the re-ranker, we first infer gold documents from the KB. For each training question, we traverse over the one-hop subgraphs from both the topic entity and the answer entity. If both entities are involved in a triple, then the corresponding document is identified as a gold document. During training, we process both the inferred gold documents and retrieved documents to create positive and negative samples for the re-ranker using either the "document-level" or "triple-level" strategy. Generator Given a question q, the Retriever retrieves N documents D1, D2, · · · DN , and the re-ranker produces a confidence score for each triple Tj ∈ Di for any i. The triples are re-ranked based on their confidence scores. Only top-K triples are selected and used for answer generation. For better information aggregation, we use Fusion-in-decoder (FiD) [57] as the Generator. FiD uses sequence-to-sequence Transformer-based models T5 [99] as the backbone. For each selected triple Tj , FiD encodes it with Pj = Encoder(q; Tj ) 74 where Pj is the hidden representation of Tj . The token embeddings of all passages in the last layer of the encoder are concatenated and passed to the decoder. The decoder generates answers following. a = Decoder(P1; · · · , PK) Where a is the generated answer represented as a sequence of word tokens. 5.3 Experimental Evaluation 5.3.1 Knowledge Graph As most of the public benchmark KGQA datasets are created based on Freebase [11], in line with prior work [137], we use the full Freebase pre-processed by Wang et al. [128] as the KG in our experiments. The entities, relations, and triplets are about 88 million, 20k, and 472 million, respectively. We are applying our document generation strategy as described above with at most ten triples per document, resulting in an index of 107 million documents. Retriever We evaluate 3 retrieval techniques: BM25 [102], DPR [63], and a balanced hybrid that sums BM25 and DPR scores, all implemented using Pyserini [82]. To fine-tune DPR, we label all documents with at least one triple containing both the topic entities and answer entities as positive and leverage BM25 to retrieve hard negatives. Yu et al. showed that while BM25 and DPR produce comparable results on FreebaseQA, DPR works better on WebQSP. We thus use BM25 to retrieve documents per FreebaseQA question and a Hybrid retriever to retrieve documents per question on WebQSP. Re-ranker For each dataset, we extend the ELECTRA-large architecture proposed by Clark et al. [29] and adapted to the contextual sentence selection task by [75]. Our re-ranker is trained with a maximum 75 sequence length of 256 and a learning rate of 1e-5 for at most five epochs. The best training checkpoint is selected using the answer Hit@5 on the dev split of each dataset. Generator We fine-tune Fusion-in-Decoder (FiD) as our generator using the original FiD implementation [57]. We report our performance with T5-base and T5-large [99] as the base model—the max sequence length to 400 and the max answer length to 128 for training and evaluation. We report our results using 50 and 100 triples as input to the generator. 5.3.2 Datasets We study our approach on two popular KGQA benchmarks: WebQSP and FreebaseQA. FreebaseQA [60] consists of questions whose answers are Freebase entities. The train, dev, and test splits contain 20358, 3994, and 3996 questions, respectively. The dataset only contains annotations on topic and answer entities but not logical forms. We use Hit@1 / Exact Match as the evaluation metric. WebQSP [135] is also based on Freebase. The original data only contains train and test splits. For training, we further split the training data into train and dev splits with a ratio of 9:1. The final train dev splits contain 2789, 309, and 1639 questions, respectively. Besides topic and answer entities, the data also contain logical form annotations. We do not use logical forms in our pipeline. We report Hit@1 as calculated by the official WebQSP evaluation script. 5.3.3 Experiments End-to-end KGQA Performance To demonstrate the efficacy of our proposed pipeline, we compare the end-to-end KGQA performance with prior works in Table 5.1. Our approach significantly outperforms all prior work on FreebaseQA, including DecAF, the prior state-of-the-art, by 5.56% Hit@1. We also note that our (base, 100) and (base, 50) models outperform the prior SOTA while using 550M fewer parameters. 76 Model FreebaseQA WebQSP Hit@1 LF? Hit@1 FILM [124] 63.3 - - CBR-SUBG [32] 52.1 Yes 72.1 PullNet [113] - Yes 67.8 ReTrack [19] - Yes 74.7 DecAF (large, 100) [137] 79.0 Yes 80.7 Unik-QA [93] (base) - No 76.7 Unik-QA [93] (large) - No 79.1 DecAF - Answer only (large, 100) 79.0 No 74.7 Ours (base, 50) 80.9 No 71.8 Ours (large, 50) 84.3 No 76.9 Ours (base, 100) 80.2 No 77.2 Ours (large, 100) 84.2 No 77.8 Table 5.1: The overall performance of our framework. The columns LF? Indicate whether the model uses logic forms. FreebaseQA does not have logic from annotations. DecAF - Answer only is a variant of DecAF that does not leverage logic forms. For each category (use or ignore LF), results with the best performance are highlighted in bold font, while those with the second-best performance are underlined. Ranking Method Retriever Re-ranker Generator Hit@500 Hit@1 Hit@10 Hit@100 GT Triple Hit@100 Hit@1 No ranker 98.2 37.8 68.1 90.1 33.6 45.5 Doc-level Label 98.2 39.7 67.4 81.2 77.3 75.4 Triple-level Label 98.2 54.0 84.0 95.2 78.3 80.0 (a) Results on FreebaseQA Ranking Method Retriever Re-ranker Generator Hit@500 Hit@1 Hit@10 Hit@100 GT Triple Hit@100 Hit@1 No ranker 98.1 30.0 53.9 83.7 44.4 57.4 Doc-level Label 98.1 50.3 70.4 79.2 68.9 67.6 Triple-level Label 98.1 73.0 86.0 91.0 74.0 70.5 (b) Results on WebQSP. Table 5.2: The re-ranker ablation studies on FreebaseQA and WebQSP. We show the retriever, re-ranker, and generator performance regarding Hit@K. We additionally report the GT Triple hit rate following the triple-level labeling strategy defined above. We report the generator performance using FiD-base with 20 passages per question to reduce the computation required. On WebQSP, our approach exceeds the performance of PullNet [113], ReTrack [19], and CBR-SUBG [32] but falls short of exceeding the results of Unik-QA and DecAF. We argue that DecAF maintains a slight advantage on WebQSP by exploiting the significant volume of logical form training data not used in our 77 approach; we note a significant 3.1% improvement above the directly comparable DecAF Answer-Only setting, which does not exploit this additional data. While Unik-QA outperforms our approach when using FiD-large as the generator, it relies on an entity-linking module. It only creates an index for questionspecific subgraphs for retrieval. As explored in prior studies [111], entity linking modules tend to be dataset-dependent, making retrieval-based approaches more practical. Furthermore, rather than generating an index for each question or entity, our method establishes a single index that can be conveniently re-used for any question, potentially enhancing efficiency. Additionally, when compared to Unik-QA with FiD-base, the marginally superior performance of our approach suggests that the re-ranking process offers more significant advantages for smaller-sized generation modules. Ablation Studies on the Re-ranker To demonstrate the significant benefits of our re-ranker within our overall pipeline, we perform a no ranker ablation and doc-level labeling ablation using documentlevel labeling instead of our proposed triple-level labeling. Table 5.2 reports the performance of different components in the resulting ablated pipelines in terms of hit rates. The retriever Hit@500 shows that at least one correct document is retrieved within the top 500 for nearly all questions in both datasets. As shown Table 5.2, both re-ranker approaches substantially improve over the no-ranker setting in terms of end-to-end KGQA performance on both datasets, validating the benefit of incorporating a re-ranker in this setting. However, including a re-ranker trained with "doc-level" labels only improves the Hit@K performance for K values less than 100; this trend does not hold at K>=100, and the performance degrades below the no-ranker setting. This result supports the claim that document-level labeling produces noisy labels. A higher-quality labeling strategy (e.g., our triple-level tags) has a much more powerful ranker. In other words, both re-ranker and end-to-end KGQA performance is highly influenced by the choice of the labeling strategy: our "triple-level" strategy guides the model to differentiate relevant and irrelevant triples better and thus provide the most pertinent information to the generator. We believe that extending and further improving re-ranker label quality is a promising future direction of this work. 78 We additionally report the gold/ground-truth (GT) triple hit rate, which measures the hit rate of "triplelevel labels" in the ranker output. We note that while the answer hit rate@100 varies by only 15%, the GT triple hit@100 and end-to-end KGQA performance vary by a far more significant amount. We additionally note that end-to-end KGQA performance is within 2-4% of the GT Triple hit rate, indicating the utility of our inferred labels as a highly correlated indicator of the overall KGQA performance. Effect of Context The term "context" is an essential component of the input in our contextual re-ranker. To have a better understanding of the effect of context, we conduct a no context ablation, wherein the reranker excludes contextual triples and only considers a candidate triple and the question as input. As with the default setting, the no-context re-ranker is trained to classify the encoding of the concatenated string. We utilize the same training data as in our default setting and fine-tune the model based on a pre-trained AS2 model [75]. We evaluated the default setting and the no-context ablation using the same retrieved documents and the same generator (FiD-base), focusing solely on the top-20 re-ranked triples during the final answer generation. The results demonstrate that the contextual re-ranker consistently outperforms its no-context counterpart, achieving 82.4 vs. 81.3 on FreebaseQA and 76.8 vs. 75.9 on WebQSP. These findings suggest that the "context" offers valuable insights that assist in distinguishing between candidates more effectively, allowing for a more precise identification of the relevant triples. Error Analysis To gain deeper insights into our processing pipeline, we performed a randomized selection of examples where our model produced incorrect predictions. As is shown in Figure 5.4, these examples were subsequently divided into five primary categories: 1) Confusing Triples: In certain instances, the selected triples, while relevant, were incorrect. These confusing triples have the potential to misdirect the model, therefore degrading the overall system performance. This emphasizes the necessity for designing an effective labeling strategy. A precise approach with the proper granularity is essential for improving the performance. 79 Figure 5.4: The error analysis with example questions sampled from both the FreebaseQA and WebQSP datasets. Each example includes the raw question, the gold answers, the predicted answers from the bestperforming model, the error type, and a detailed rationale for the error. 2) Strict Evaluation: In some scenarios, predictions semantically align with the gold answers but are still treated as wrong. Moving forward, it would be beneficial to incorporate auxiliary metrics that evaluate semantic equivalence, ensuring a more comprehensive assessment of our task. 3) Incomplete Labels: There exist questions for which the predictions are accurate but are not included in the gold answer set. 4) Complex Constraints: Certain questions need the answer to satisfy all specified constraints. However, our model does not consistently meet these requirements. This can potentially be addressed by enriching the training dataset. 80 5) Relative Information: A subset of questions needs the understanding of sequential or temporal information. Our current system mainly focuses on answer extraction rather than complex reasoning. To address this, plans could integrate an enhanced generation module, focusing on complex reasoning over multiple triples/documents. From our analysis, we believe that integrating a ranking module for the KGQA task is beneficial. The key to this approach is the choice of an appropriate ranking model and developing a precise labeling strategy to train the model, enhancing its capability to select salient information while avoiding ambiguities. Besides the research direction of improving information selection, there is also potential in exploring advanced generation modules that reason within diverse constraints. 5.4 Related Work In this section, we discuss prior studies closely related to this work. KGQA Semantic parsing and Information retrieval are two primary approaches to KGQA, with semantic parsing receiving the bulk of study historically. Two prior studies are closely related to our approach. Unik-QA [93] first uses an entity linking model to identify a subgraph, then applies a dense retriever to retrieve relevant facts, and finally applies the sequence-to-sequence model to generate answers using the retrieved facts. Since most prior approaches are based on entity-linking systems, another recent paper, DecAF [137], proposed to get rid of the entity linker and directly retrieve data from an indexed KB. With the retriever, DecAF is trained to generate logic forms and direct answers and combine their results. Both Unik-QA and DecAF are closely related to this paper. Similar to DecAF, we remove the entity linking model for more efficiently solving the real-world problem. Like Unik-QA, our system focuses on direct answer generation without using logic form annotations. 81 Retriever-Reranker-Generator The retriever-ranker pipeline has been successfully used in many previous studies in different natural language tasks [80, 13]. Typically, an efficient retriever obtains relevant documents from an extensive collection, and a more powerful cross-encoder ranker refines results concerning a downstream task. Unlike these prior works, we focus on using a KG instead of natural language text, significantly changing the techniques required to develop a powerful ranker. Triple Selection Most factual questions require specific KG triples to provide answers; triple selection is thus an integral component in the development of KGQA systems. Since KG triples can be converted into natural language sentences, this task can be construed as a subgenre of the prevalent answer sentence selection task [44, 58]. Recent studies [75, 81] have shown that contextual information can significantly improve AS2 accuracy. 82 Chapter 6 Related Work Data is stored in various formats. Generally, data can be categorized into two major types: unstructured and structured. Structured data refers to data organized in well-defined structures and easily understood by humans, while unstructured data is usually in free form and lacks organization. Examples of structured data include tabular data in Excel spreadsheets or CSV formats, XML (eXtensible Markup Language) defined with hierarchical relationships, JSON (JavaScript Object Notation) represented in nested key-value pairs, and graph data, including relationships between nodes such as entities. Studying structured data has been an essential line of research in past decades, which is due to the following several reasons. First, structured data is highly formatted, making it easier to store, retrieve, and manage. Therefore, a large volume of information is organized in structured data, making it a helpful knowledge source. Second, some structured data types, such as Knowledge graphs (KGs), have predefined schema and are curated by domain experts, providing high-quality knowledge and preventing errors in the data. Lastly, structured data formats, such as XML and JSON, contribute to standardization and interoperability. They allow for the exchange of data between different systems and applications. In this dissertation, we focus on two popular types of structured data: tabular data and KGs. Tabular data is one of the most challenging types of structured data due to variations in schema and format across domains, making it difficult for analysis and knowledge integration. KGs, conversely characterized by 83 precise schemas and ontologies, offer a valuable source of rich and high-quality knowledge. Advances in inaccurate table understanding structural analysis and information selection from both tables and KGs can benefit various downstream knowledge-intensive tasks. 6.1 Table Processing and Analyzing There are several lines of research efforts seeking to solve different tasks for automated processing and analyzing tabular data. Since many tables are stored in other media such as images and PDFs, table detection becomes an important task towards automated table analysis [138, 45, 144]. Table detection seeks to recognize table layouts in documents as minimal bounding boxes. Many research works are aiming at solving this problem. The earliest studies focused on rule-based methods. For example, [53] distinguished tables from text areas according to horizontal and vertical lines, and they used a dynamic programming matching algorithm to arrange character strings. [54] partitions a document into several tables based on table quality measures. [109] detected tables from more documents with more diverse layouts using an OCR-based method, including detecting page columns, locating table columns, and marking table regions. Most recent approaches are based on machine learning algorithms, Once the tables are extracted, the next step is understanding table structures automatically. One recent paper [98] introduces a table understanding pipeline consisting of three main stages: cell classification, functional block detection, and layout prediction. These three steps aim to understand structures from three granularitys: a single cell, a small region of the table, and the whole table. Chapter 2 also studies table structural analysis from these three aspects. As introduced in Chapter 2, cell classification requires cell-level representation learning. Among the studies of representation learning for tables, few efforts have been made to structural embeddings, while more have focused on representing the semantics of tabular content. For example, TAPAS [52] employs BERT’s architecture [37] to encode tables to handle table-aided NLP problems. For the same purpose, 84 TaBERT [136], another model built on top of BERT, is jointly trained over textual and tabular data. [36] also designed a Transformer-based model to learn representations for tabular data, though its focus is on knowledge extraction from tables. On the other hand, structural representation learning is more necessary than capturing semantic features in functional block detection. [47] proposed a cell embedding model to encode individual cells considering contextual and stylistic features. [126] introduced a multi-granular representation learning model that captures semantic and structural information while the model requires supervised training. Our clustering method uses the cell embedding model to represent blocks to determine similarities. The second task, functional block detection, being different from table detection, focuses on identifying functionally consistent subunit blocks of each table. As a first attempt, [118] applied a top-down BayesianCART-based model, which performs randomly row-wise and column-wise splits to produce blocks according to data types of cells. In this context, [68] discussed building functional blocks in tables despite their goal to correct imperfect cell classification results. The proposed method leverages rich knowledge encapsulated in cell vectors and constructs blocks in a more general way. In addition, the technique builds blocks without knowing cell labels. Many other related works aim to process tabular data automatically. [16], [31] and [40] introduced content, linguistic, global, and local features, and [91] proposed a hybrid architecture TabNet using recurrent neural networks and CNN to encode tabular data, to perform table type classification. [112, 110, 39] used engineered or automatically inferred features to construct rule-based engines to transform tabular data into more formal tables, i.e., relational databases. 85 6.2 Knowledge-intensive Tasks with Tables and KGs Besides the structural analysis designed for complex tables, another vital research direction is to select and leverage knowledge from structured data for downstream tasks. Example downstream tasks include question answering, fact verification, and classification. This dissertation involves QA with KG in Chapter 5, QA with tables in Chapter 3, and KG representation learning and classification in Chapter 4. Table-based QA In the past decade, many research studies have focused on curating table-based datasets and designing novel approaches for QA. Several datasets are created solely using wiki tables. For example, WikiTable-Questions [95], WikiSQL [141], and SQA [56]. Recently, a few new datasets using tables and textual descriptions have been introduced. These datasets aim to evaluate the model’s ability to fuse different sources of information. For example, HybridQA [21] leverages WikiTables and Wikipedia text to answer natural questions; FinQA [25], TAT-QA [143] and MultiHierTT [140] uses financial tables and descriptions to answer financial questions involving numerical reasoning; AIT-QA [64] uses airline information for domain-specific questions. Towards improving QA performance, one line of studies focuses on improving Transformer-based model architectures and pre-training strategies such as TAPAS [52], TaBERT [136], TAPEX [84], and OmniTab [61]. Since the development of GPT, several studies, e.g., Binder [26], began to use models the same size as GPT to explore few-shot scenarios. For domain-specific QA, most of the prior approaches are either based on retriever-generator pipelines [25] or large language models with in-context examples [20]. These models provide promising results in directly generating answers and generating executable SQL or mathematical programs. Our study in Chapter 3 concentrates on financial QA based on case-based reasoning. KGQA This task aims to answer questions using information from given KGs. Most of the existing datasets, such as WebQSP [135], FreebaseQA [60], ComplexQuestions [122], and GrailQA [48], are created using general-domain KGs such as Freebase [11] and WikiData [125] . In Chapter 5, we explore novel 86 approaches for KGQA. Generally, Semantic parsing and Information retrieval are two primary approaches to KGQA, with semantic parsing receiving the bulk of study historically. In general, given the question, these approaches aim to rank or generate SPARQL queries that can answer the question using the entity relationships. For example, QGG [74] and SPARQA [120] use predefined templates and constraints to identify logic forms. RnG-KBQA [134] enumerates and ranks a pool of candidate logic forms and applies a sequence-to-sequence generator to provide final ones. Another recent approach, ArcaneQA [49], reduces the search space with dynamic program induction. These approaches usually rely on the SPARQL executor to provide final answers. Another primary class of approaches is based on information retrieval. For example, PullNet [113] iteratively retrieves subgraphs around the topic entity and identifies answer entities using graph convolutional networks. Besides KGs, Graft-Net [114] also considers textual information and retrieves heterogenous subgraphs to answer questions. EmbedKGQA [106] learns representations of all entities and ranks entities based on scores between the question and entity representations. Recently, a few studies have applied sequence-to-sequence language models to generate answers directly. For example, KGT5 [105] trains a sequence-to-sequence model to create solutions instantly; Unik-QA [93] first uses an entity linking model to identify a subgraph, then applies a dense retriever to retrieve relevant facts and finally apply the sequence-to-sequence model to generate answers using the retrieved facts. Since most prior approaches are based on entity-linking systems, another recent paper, DecAF [137], proposed to get rid of the entity linker and directly retrieve data from an indexed KB. With the retriever, DecAF is trained to generate logic forms and direct answers and combine their results. Both Unik-QA and DecAF are closely related to this paper. Similar to DecAF, we remove the entity linking model for more efficiently solving the realworld problem. Like Unik-QA, our system focuses on direct answer generation without using logic form annotations. 87 KG Representation Learning A KG consists of triples showing relationships between entities where each entity is represented as a node, and each relation is defined as a directed edge between two nodes. KG representation learning aims to represent each entity and relation with a continuous vector. In the past decades, many research studies have been conducted in this area. The most popular models can be divided into translation-based models, rotation-based models, and vector decomposition-based models. Translation-based models, such as TransE [12], TransD [59], and TransH [127], represent tail vectors using a translation function between head vectors and relation vectors. These models are designed to capture the structural semantics of a KG, emphasizing the relationships between entities. Unlike translation-based models, rotation-based models, such as RotatE [121], QuatE [139], and DualE [18], focus on capturing the geometric transformations between entities and relations. For example, in RotatE, entities are embedded in the complex number space, and rotations are applied to capture the relationships between entities. Vector decomposition-based models typically represent entities and relationships in a continuous vector space and leverage decomposition techniques to capture various aspects of the interactions between entities. A score is calculated based on the interactions between entity and relation vectors. Example models include DistMult [132], SimplE [65], and ComplEx [123]. In recent years, prior approaches mainly aim to capture structural knowledge of KGs. Based on these studies, more research efforts have been made to capture structural and additional semantic information. By incorporating structured embeddings and auxiliary features associated with entities, hybrid models aim to capture a more contextually rich representation of entities. Example models include LiteralE [71], R-GCN [108] and KG2E [51]. Based on the experiments, hybrid models, potentially producing better entity representations, lead to performance improvement in downstream tasks such as entity classification and KG completion. Following these findings, in Chapter 4, we apply one of these hybrid models on domainspecific KGs and show better performance on scientific paper reproducibility prediction, which is an entity classification task. 88 Chapter 7 Conclusion and Future Work In this dissertation, we conduct several research studies for addressing distinct downstream tasks using two significant types of structured data - tables and knowledge graphs. In the first study, we investigate automated structural understanding for complex tables from various topic domains. We introduced a hybrid probabilistic approach for addressing three structural understanding tasks: classifying table cells, identifying rectangular functional blocks, and predicting relationships between functional blocks. We experimentally evaluate the system on a new dataset containing hundreds of tables crawled from public websites. We then presented a novel approach for answering financial questions involving tabular and textual data from financial reports. During answer generation, the new method leverages three significant types of contextual information: questions, relevant facts, and abstract program patterns retrieved from annotated questions using Case-based Reasoning. We conducted experiments on public financial QA datasets to show that the auxiliary contextual information, retrieved program patterns, is beneficial for generating accurate programs, especially when the training data is limited. In the second part of the thesis, we delve into another type of structured data - knowledge graphs. In the first study, we briefly introduced a process of constructing scientific KGs that encapsulate different 89 entities and features. More importantly, we apply prior KG representation methods that fuse structural information and entity-specific features. We experimentally evaluate these approaches on the reproducibility prediction task to show that both macro and micro features contribute to the prediction performance. We finally introduce another study for QA using general domain KGs. We extend the existing retrievergenerator paradigm to a new retriever-ranker-generator pipeline. Specifically, the ranker module leverages a cross-encoder that re-evaluates retrieved candidate triples using one-hop neighborhood as context. The experiments on public KGQA datasets show that the ranking process is capable of identifying more informative contexts and generating answers more efficiently. 7.1 Future Work We have presented several advancements in understanding and leveraging structured data for distinct knowledge-intensive tasks. In this section, we envision future research directions with structured data. The first direction is the exploration of verbalizing tables using the outputs from the structural understanding system introduced in Chapter 2. In the existing financial QA pipeline proposed in Chapter 3, each table is verbalized into natural language sentences. The verbalized tables are combined with textual descriptions to serve as candidate facts such that the fact retriever retrieves the most relevant facts from all candidates. As the initial attempt, we verbalize tables following the same pre-processing step as prior studies. Although the current structural understanding system is learned with tables from other domains, it can potentially be extended to the financial field with 1) several training tables and 2) a few high-level domain-specific rules. Such a system is capable of accurately identifying relationships between individual blocks or cells. These relationships can be verbalized into sentences that serve as candidate facts in the financial QA pipeline. Investigating the best verbalization strategy can be an interesting research question. 90 The second direction is to extend the retriever-ranker-generator paradigm presented in Chapter 5 to accept distinct data modalities. As is shown in Unik-QA [93], combining information from multiple modalities of data generally improves QA performance. One natural next step is to follow a similar knowledge integration process as Unik-QA and apply the retriever-ranker-generator paradigm to improve QA performance further. Along this direction, this process can also be used for domain-specific QA. For example, besides the financial reports, information from KGs also encapsulates essential knowledge for answering general financial questions. By combining textual descriptions, financial tables, and KGs, it is possible to produce better answers for all types of financial questions. Many recent studies in natural language processing have demonstrated the efficacy of large language models such as GPT [14]. These approaches often focus on the paradigm of few-shot learning, where a minimal set of examples is chosen to serve as context during answer generation. The third future direction is to adapt the proposed QA systems to leverage the capabilities of such large language models. One potential strategy is to delve into Case-Based Reasoning (CBR) techniques to select in-context examples. There are two sub-directions: firstly, we can explore methods of applying CBR to develop an active learning system capable of setting diverse and most important examples for annotation. Secondly, we can utilize CBR to select the top-K most analogous questions and their associated answers to serve as a context for generating responses to new questions. 91 Bibliography [1] Agnar Aamodt and Enric Plaza. “Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches”. In: AI Communications 7 (Aug. 2001), pp. 39–59. doi: 10.3233/AIC-1994-7104. [2] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. 2021. arXiv: 2010.12688 [cs.CL]. [3] Mehdi Ali, Max Berrendorf, Charles Tapley Hoyt, Laurent Vermue, Sahand Sharifzadeh, Volker Tresp, and Jens Lehmann. “PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings”. In: Journal of Machine Learning Research 22.82 (2021), pp. 1–6. [4] Nazanin Alipourfard, Beatrix Arendt, Daniel M Benjamin, Noam Benkler, Michael M Bishop, Mark Burstein, Martin Bush, James Caverlee, Yiling Chen, Chae Clark, and et al. Systematizing Confidence in Open Research and Evidence (SCORE). May 2021. doi: 10.31235/osf.io/46mnb. [5] Adam Altmejd, Anna Dreber, Eskil Forsell, Juergen Huber, Taisuke Imai, Magnus Johannesson, Michael Kirchler, Gideon Nave, and Colin Camerer. “Predicting the replicability of social science lab experiments”. In: PLOS ONE 14.12 (Dec. 5, 2019). issn: 1932-6203. doi: 10.1371/journal.pone.0225826. [6] Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. “Hinge-Loss Markov Random Fields and Probabilistic Soft Logic”. In: Journal of Machine Learning Research 18.1 (2017), pp. 3846–3912. [7] Monya Baker. “IS THERE A REPRODUCIBILITY CRISIS?” In: Nature 533 (May 2016), pp. 452–454. [8] C. Glenn Begley and Lee M. Ellis. “Raise standards for preclinical cancer research | Nature”. In: Nature 83.7391 (2012), pp. 531–533. doi: doi.org/10.1038/483531a. [9] Iz Beltagy, Kyle Lo, and Arman Cohan. “SciBERT: Pretrained Language Model for Scientific Text”. In: EMNLP. 2019. [10] Taylor Berg-Kirkpatrick and Daniel Spokoyny. “An Empirical Investigation of Contextualized Number Prediction”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020, pp. 4754–4764. doi: 10.18653/v1/2020.emnlp-main.385. 92 [11] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. “Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge”. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008, pp. 1247–1250. isbn: 9781605581026. doi: 10.1145/1376616.1376746. [12] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. “Translating Embeddings for Modeling Multi-relational Data”. In: Advances in Neural Information Processing Systems. Vol. 26. 2013. [13] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack Rae, Erich Elsen, and Laurent Sifre. “Improving Language Models by Retrieving from Trillions of Tokens”. In: Proceedings of the 39th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato. Vol. 162. Proceedings of Machine Learning Research. PMLR, July 2022, pp. 2206–2240. [14] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. “Language Models are Few-Shot Learners”. In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. url: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64aPaper.pdf. [15] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. “API design for machine learning software: experiences from the scikit-learn project”. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 2013, pp. 108–122. [16] Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. “Webtables: exploring the power of tables on the web”. In: VLDB Endowment 1.1 (2008), pp. 538–549. [17] Colin F. Camerer, Anna Dreber, Eskil Forsell, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Johan Almenberg, Adam Altmejd, Taizan Chan, Emma Heikensten, Felix Holzmeister, Taisuke Imai, Siri Isaksson, Gideon Nave, Thomas Pfeiffer, Michael Razen, and Hang Wu. “Evaluating replicability of laboratory experiments in economics”. In: Science 351.6280 (2016), pp. 1433–1436. doi: 10.1126/science.aaf0918. [18] Zongsheng Cao, Qianqian Xu, Zhiyong Yang, Xiaochun Cao, and Qingming Huang. “Dual Quaternion Knowledge Graph Embeddings”. In: AAAI Conference on Artificial Intelligence. 2021. 93 [19] Shuang Chen, Qian Liu, Zhiwei Yu, Chin-Yew Lin, Jian-Guang Lou, and Feng Jiang. “ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering”. In: Jan. 2021, pp. 325–336. doi: 10.18653/v1/2021.acl-demo.39. [20] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks”. In: arXiv preprint arXiv:2211.12588 (2022). [21] Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Yang Wang. “HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 1026–1036. doi: 10.18653/v1/2020.findings-emnlp.91. [22] Xuelu Chen, Muhao Chen, Weijia Shi, Yizhou Sun, and Carlo Zaniolo. “Embedding Uncertain Knowledge Graph”. In: AAAI Conference on Artificial Intelligence. 2019. [23] Zhe Chen and Michael Cafarella. “Automatic Web Spreadsheet Data Extraction”. In: International Workshop on Semantic Search Over the Web. 2013. [24] Zhe Chen and Michael Cafarella. “Integrating Spreadsheet Data via Accurate and Low-Effort Extraction”. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014, pp. 1126–1135. isbn: 9781450329569. doi: 10.1145/2623330.2623617. [25] Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. “FinQA: A Dataset of Numerical Reasoning over Financial Data”. In: Proceedings of EMNLP 2021 (2021). [26] Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. “Binding Language Models in Symbolic Languages”. In: ICLR abs/2210.02875 (2023). [27] Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. “Bayesian CART Model Search”. In: Journal of the American Statistical Association 93.443 (1998), pp. 935–948. [28] CIUS. CIUS. https://ucr.fbi.gov/crime-in-the-u.s. 2019. [29] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators”. In: ICLR. 2020. [30] David Camilo Corrales, Agapito Ledezma, and Juan Carlos Corrales. “A case-based reasoning system for recommendation of data cleaning algorithms in classification and regression tasks”. In: Applied Soft Computing 90 (2020), p. 106180. issn: 1568-4946. doi: https://doi.org/10.1016/j.asoc.2020.106180. [31] Eric Crestan and Patrick Pantel. “A Fine-Grained Taxonomy of Tables on the Web”. In: ACM International Conference on Information and Knowledge Management. 2010, pp. 1405–1408. 94 [32] Rajarshi Das, Ameya Godbole, Ankita Naik, Elliot Tower, Robin Jia, Manzil Zaheer, Hannaneh Hajishirzi, and Andrew McCallum. “Knowledge Base Question Answering by Case-based Reasoning over Subgraphs”. In: ICML. 2022. [33] Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. “Case-based Reasoning for Natural Language Queries over Knowledge Bases”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Nov. 2021, pp. 9594–9611. doi: 10.18653/v1/2021.emnlp-main.755. [34] William de Vazelhes, CJ Carey, Yuan Tang, Nathalie Vauquier, and Aurélien Bellet. “metric-learn: Metric Learning Algorithms in Python”. In: Journal of Machine Learning Research 21.138 (2020), pp. 1–6. [35] DeEx. DeExcelarator. https://wwwdb.inf.tu-dresden.de/research-projects/deexcelarator/. 2013. [36] Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. “TURL: Table Understanding through Representation Learning”. In: Proceedings of the VLDB Endowment (PVLDB) 14.3 (2020), pp. 307–319. [37] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Ed. by Jill Burstein, Christy Doran, and Thamar Solorio. Association for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/n19-1423. [38] Haoyu Dong, S. Liu, S. Han, Z. Fu, and D. Zhang. “TableSense: Spreadsheet Table Detection with Convolutional Neural Networks”. In: AAAI Conference on Artificial Intelligence. 2019. [39] Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei. “Expandable Group Identification in Spreadsheets”. In: ACM/IEEE International Conference on Automated Software Engineering. 2018, pp. 498–508. [40] J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. “Building the Dresden Web Table Corpus: A Classification Approach”. In: International Symposium on Big Data Computing. 2015, pp. 41–50. [41] Julian Eberius, Christoper Werner, Maik Thiele, Katrin Braunschweig, Lars Dannecker, and Wolfgang Lehner. “DeExcelerator: a framework for extracting relational data from partially structured documents”. In: ACM international conference on Information & Knowledge Management. 2013, pp. 2477–2480. [42] Timothy M Errington, Courtney K Soderberg Maya Mathur, Alexandria Denis, Nicole Perfito, Elizabeth Iorns, and Brian A Nosek. “Investigating the replicability of preclinical cancer biology”. In: eLife (2021). doi: 10.7554/eLife.71601. [43] Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. “The Pascal Visual Object Classes (VOC) Challenge”. In: Int. J. Comput. Vision 88.2 (June 2010), pp. 303–338. issn: 0920-5691. doi: 10.1007/s11263-009-0275-4. 95 [44] Siddhant Garg, Thuy Vu, and Alessandro Moschitti. TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection. 2019. arXiv: 1911.04118 [cs.CL]. [45] Basilios Gatos, Dimitrios Danatsas, Ioannis Pratikakis, and Stavros J. Perantonis. “Automatic Table Detection in Document Images”. In: Proceedings of the Third International Conference on Advances in Pattern Recognition - Volume Part I. ICAPR’05. 2005, pp. 609–618. isbn: 3540287574. doi: 10.1007/11551188_67. [46] Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Naik, Pengshan Cai, and Alfio Gliozzo. “Re2G: Retrieve, Rerank, Generate”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. July 2022, pp. 2701–2715. doi: 10.18653/v1/2022.naacl-main.194. [47] Majid Ghasemi Gol, Jay Pujara, and Pedro Szekely. “Tabular Cell Classification Using Pre-Trained Cell Embeddings”. In: 2019 IEEE International Conference on Data Mining (ICDM). IEEE. 2019, pp. 230–239. [48] Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. “Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases”. In: Proceedings of the Web Conference 2021. WWW ’21. Ljubljana, Slovenia: Association for Computing Machinery, 2021, pp. 3477–3488. isbn: 9781450383127. doi: 10.1145/3442381.3449992. [49] Yu Gu and Yu Su. “ArcaneQA: Dynamic Program Induction and Contextualized Encoding for Knowledge Base Question Answering”. In: Proceedings of the 29th International Conference on Computational Linguistics. Oct. 2022, pp. 1718–1731. [50] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. “Retrieval Augmented Language Model Pre-Training”. In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, July 2020, pp. 3929–3938. url: https://proceedings.mlr.press/v119/guu20a.html. [51] Shizhu He, Kang Liu, Guoliang Ji, and Jun Zhao. “Learning to Represent Knowledge Graphs with Gaussian Embedding”. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. CIKM ’15. Melbourne, Australia: Association for Computing Machinery, 2015, pp. 623–632. isbn: 9781450337946. doi: 10.1145/2806416.2806502. [52] Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos. “TaPas: Weakly Supervised Table Parsing via Pre-training”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. July 2020, pp. 4320–4333. doi: 10.18653/v1/2020.acl-main.398. [53] Y. Hirayama. “A method for table structure analysis using DP matching”. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. Vol. 2. 1995, 583–586 vol.2. doi: 10.1109/ICDAR.1995.601964. [54] Jianying Hu, Ramanujan S. Kashi, Daniel P. Lopresti, and Gordon Wilfong. “Medium-independent table detection”. In: Document Recognition and Retrieval VII. Vol. 3967. International Society for Optics and Photonics. SPIE, 1999, pp. 291–302. doi: 10.1117/12.373506. 96 [55] John P. A. Ioannidis. “Why Most Published Research Findings Are False”. In: PLOS Medicine 2 (Aug. 2005), null. doi: 10.1371/journal.pmed.0020124. [56] Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. “Search-based Neural Structured Learning for Sequential Question Answering”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 1821–1831. doi: 10.18653/v1/P17-1167. [57] Gautier Izacard and Edouard Grave. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. 2020. url: https://arxiv.org/abs/2007.0128. [58] Nic Jedema, Thuy Vu, Manish Gupta, and Alessandro Moschitti. DP-KB: Data Programming with Knowledge Bases Improves Transformer Fine Tuning for Answer Sentence Selection. 2022. arXiv: 2203.09598 [cs.CL]. [59] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. “Knowledge Graph Embedding via Dynamic Mapping Matrix”. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China: Association for Computational Linguistics, July 2015, pp. 687–696. doi: 10.3115/v1/P15-1067. [60] Kelvin Jiang, Dekun Wu, and Hui Jiang. “FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). June 2019, pp. 318–323. doi: 10.18653/v1/N19-1028. [61] Zhengbao Jiang, Yi Mao, Pengcheng He, Graham Neubig, and Weizhu Chen. “OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, July 2022, pp. 932–942. doi: 10.18653/v1/2022.naacl-main.68. [62] Jeff Johnson, Matthijs Douze, and Hervé Jégou. “Billion-scale similarity search with GPUs”. In: IEEE Transactions on Big Data 7.3 (2019), pp. 535–547. [63] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. “Dense Passage Retrieval for Open-Domain Question Answering”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Nov. 2020, pp. 6769–6781. doi: 10.18653/v1/2020.emnlp-main.550. [64] Yannis Katsis, Saneem Chemmengath, Vishwajeet Kumar, Samarth Bharadwaj, Mustafa Canim, Michael Glass, Alfio Gliozzo, Feifei Pan, Jaydeep Sen, Karthik Sankaranarayanan, and Soumen Chakrabarti. “AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track. Hybrid: Seattle, Washington + Online: Association for Computational Linguistics, July 2022, pp. 305–314. doi: 10.18653/v1/2022.naacl-industry.34. 97 [65] Seyed Mehran Kazemi and David Poole. “SimplE Embedding for Link Prediction in Knowledge Graphs”. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18. Montréal, Canada: Curran Associates Inc., 2018, pp. 4289–4300. [66] M. G. Kendall. “A New Measure of Rank Correlation”. In: Biometrika 30.1/2 (1938), pp. 81–93. issn: 00063444. [67] Elvis Koci, Maik Thiele, Oscar Romero, and Wolfgang Lehner. “A Machine Learning Approach for Layout Inference in Spreadsheets”. In: International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. 2016, pp. 77–88. [68] Elvis Koci, Maik Thiele, Oscar Romero, and Wolfgang Lehner. “Cell Classification for Layout Recognition in Spreadsheets”. In: Knowledge Discovery, Knowledge Engineering and Knowledge Management. 2019, pp. 78–100. [69] Pigi Kouki, Shobeir Fakhraei, James Foulds, Magdalini Eirinaki, and Lise Getoor. “HyPER: A Flexible and Extensible Probabilistic Framework for Hybrid Recommender Systems”. In: ACM Conference on Recommender Systems. 2015, pp. 99–106. [70] Pigi Kouki, Jay Pujara, Christopher Marcum, Laura Koehly, and Lise Getoor. “Collective Entity Resolution in Familial Networks”. In: IEEE International Conference on Data Mining. 2017. [71] Agustinus Kristiadi, Mohammad Asif Khan, Denis Lukovnikov, Jens Lehmann, and Asja Fischer. “Incorporating Literals into Knowledge Graph Embeddings”. In: The Semantic Web, pp. 347–363. isbn: 978-3-030-30792-9. doi: 10.1007/978-3-030-30793-6_20. [72] Imre Lakatos. “Criticism and the Growth of Knowledge (Proceedings of the International Colloquium in the Philosophy of Science, London 1965, Volume 4)”. 1970. [73] Jean-Baptiste Lamy, Boomadevi Sekar, Gilles Guezennec, Jacques Bouaud, and Brigitte Séroussi. “Explainable artificial intelligence for breast cancer: A visual case-based reasoning approach”. In: Artificial Intelligence in Medicine 94 (2019), pp. 42–53. issn: 0933-3657. doi: https://doi.org/10.1016/j.artmed.2019.01.001. [74] Yunshi Lan and Jing Jiang. “Query Graph Generation for Answering Multi-hop Complex Questions from Knowledge Bases”. In: Jan. 2020, pp. 969–974. doi: 10.18653/v1/2020.acl-main.91. [75] Ivano Lauriola and Alessandro Moschitti. “Answer sentence selection using local and global context in transformer models”. In: ECIR. 2021. [76] David B. Leake. Case-Based Reasoning: Experiences, Lessons and Future Directions. 1996. isbn: 026262110X. [77] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, and Christian Bizer. “DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia”. In: Semantic Web Journal 6 (Jan. 2014). doi: 10.3233/SW-140134. 98 [78] Fangyu Lei, Shizhu He, Xiang Li, Jun Zhao, and Kang Liu. Answering Numerical Reasoning Questions in Table-Text Hybrid Contents with Graph-based Encoder and Tree-based Decoder. 2022. arXiv: 2209.07692. [79] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, July 2020, pp. 7871–7880. doi: 10.18653/v1/2020.acl-main.703. [80] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 9459–9474. [81] Luca Di Liello, Siddhant Garg, and Alessandro Moschitti. Context-Aware Transformer Pre-Training for Answer Sentence Selection. 2023. arXiv: 2305.15358 [cs.CL]. [82] Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. “Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations”. In: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2021, pp. 2356–2362. [83] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. “Learning Entity and Relation Embeddings for Knowledge Graph Completion”. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015, pp. 2181–2187. isbn: 0262511290. [84] Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, and Jian-Guang Lou. “TAPEX: Table Pre-training via Learning a Neural SQL Executor”. In: International Conference on Learning Representations. 2022. [85] Farzaneh Mahdisoltani, Joanna Asia Biega, and Fabian M. Suchanek. “YAGO3: A Knowledge Base from Multilingual Wikipedias”. In: CIDR. 2015. [86] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. doi: 10.1017/CBO9780511809071. [87] Mary McHugh. “Interrater reliability: The kappa statistic”. In: Biochemia medica : časopis Hrvatskoga društva medicinskih biokemičara 22 (Oct. 2012), pp. 276–82. [88] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space”. In: arXiv:1301.3781 [cs] (Sept. 6, 2013). arXiv: 1301.3781. [89] Andreas C. Müller and Sven Behnke. “PyStruct: Learning Structured Prediction in Python”. In: J. Mach. Learn. Res. 15.1 (Jan. 2014), pp. 2055–2060. [90] Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. 2011. arXiv: 1109.2378. 99 [91] Kyosuke Nishida, Kugatsu Sadamitsu, Ryuichiro Higashinaka, and Yoshihiro Matsuo. “Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture”. In: AAAI Conference on Artificial Intelligence. 2017, pp. 168–174. [92] Brian A. Nosek and Timothy M. Errington. “What is replication?” In: PLOS Biology 18 (Mar. 2020), pp. 1–8. doi: 10.1371/journal.pbio.3000691. [93] Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Stan Peshterliev, Dmytro Okhonko, Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, and Scott Yih. “UniK-QA: Unified Representations of Structured and Unstructured Knowledge for Open-Domain Question Answering”. In: Findings of the Association for Computational Linguistics: NAACL 2022. July 2022, pp. 1535–1546. doi: 10.18653/v1/2022.findings-naacl.115. [94] Open Science Collaboration. “Estimating the reproducibility of psychological science | Science”. In: Science 349.6251 (2015). doi: 10.1126/science.aac4716. (Visited on 01/20/2021). [95] Panupong Pasupat and Percy Liang. “Compositional Semantic Parsing on Semi-Structured Tables”. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015, pp. 1470–1480. [96] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: Advances in Neural Information Processing Systems 32. 2019, pp. 8024–8035. [97] Florian Prinz, Thomas Schlange, and Khusru Asadullah. “Believe it or not: how much can we rely on published data on potential drug targets?” In: Nature Reviews Drug Discovery 10.9 (Sept. 2011), pp. 712–712. issn: 1474-1784. doi: 10.1038/nrd3439-c1. [98] Jay Pujara, Arunkumar Rajendran, Majid Ghasemi-Gol, and Pedro Szekely. “A Common Framework for Developing Table Understanding Models”. In: International Semantic Web Conference. 2019. [99] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: Journal of Machine Learning Research 21.140 (2020), pp. 1–67. [100] Juan Enrique Ramos. “Using TF-IDF to Determine Word Relevance in Document Queries”. In: 2003. [101] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2019. url: https://arxiv.org/abs/1908.10084. 100 [102] Stephen Robertson and Hugo Zaragoza. “The Probabilistic Relevance Framework: BM25 and Beyond”. In: Found. Trends Inf. Retr. 3.4 (Apr. 2009), pp. 333–389. issn: 1554-0669. doi: 10.1561/1500000019. [103] Nataliia Rümmele, Yuriy Tyshetskiy, and Alex Collins. “Evaluating approaches for supervised semantic labeling”. In: CoRR abs/1801.09788 (2018). [104] SAUS. SAUS. http://dbgroup.eecs.umich.edu/project/sheets/datasets.htm. 2014. [105] Apoorv Saxena, Adrian Kochsiek, and Rainer Gemulla. “Sequence-to-Sequence Knowledge Graph Completion and Question Answering”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022. [106] Apoorv Saxena, Aditay Tripathi, and Partha Talukdar. “Improving multi-hop question answering over knowledge graphs using knowledge base embeddings”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, pp. 4498–4507. [107] Roger C. Schank. Dynamic Memory: A Theory of Reminding and Learning in Computers and People. USA: Cambridge University Press, 1983. isbn: 0521248582. [108] Michael Schlichtkrull, Thomas Kipf, Peter Bloem, Rianne Berg, Ivan Titov, and Max Welling. “Modeling Relational Data with Graph Convolutional Networks”. In: June 2018, pp. 593–607. isbn: 978-3-319-93416-7. doi: 10.1007/978-3-319-93417-4_38. [109] Faisal Shafait and Ray Smith. “Table Detection in Heterogeneous Documents”. In: Document Analysis Systems 2010. 2010. url: http://doi.acm.org/10.1145/1815330.1815339. [110] Alexey O. Shigarov, Viacheslav V. Paramonov, Polina V. Belykh, and Alexander I. Bondarev. “Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets”. In: Information and Software Technologies. 2016, pp. 78–91. [111] Hassan Soliman, Heike Adel, Mohamed H. Gad-Elrab, Dragan Milchevski, and Jannik Strötgen. “A Study on Entity Linking Across Domains: Which Data is Best for Fine-Tuning?” In: Proceedings of the 7th Workshop on Representation Learning for NLP. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 184–190. doi: 10.18653/v1/2022.repl4nlp-1.19. [112] Huili Su, Yukun Li, Xiaoye Wang, Gang Hao, Yongxuan Lai, and Weiwei Wang. “Transforming a Nonstandard Table into Formalized Tables”. In: Web Information Systems and Applications Conference. Nov. 2017, pp. 311–316. [113] Haitian Sun, Tania Bedrax-Weiss, and William W. Cohen. “PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text”. In: ArXiv abs/1904.09537 (2019). [114] Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William W Cohen. “Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text”. In: EMNLP (2018). 101 [115] Kexuan Sun, Nicolaas Jedema, Karishma Sharma, Ruben Janssen, Jay Pujara, Pedro Szekely, and Alessandro Moschitti. “Efficient and Accurate Contextual Re-Ranking for Knowledge Graph Question Answering”. In: 2024. [116] Kexuan Sun and Jay Pujara. “Low-Resource Financial QA with Case-based Reasoning”. In: The RobustFin Workshop at 29TH ACM SIGKDD conference on Knowledge Discovery and Data Mining. 2023. [117] Kexuan Sun, Zhiqiang Qiu, Abel Salinas, Yuzhong Huang, Dong-Ho Lee, Daniel Benjamin, Fred Morstatter, Xiang Ren, Kristina Lerman, and Jay Pujara. “Assessing Scientific Research Papers with Knowledge Graphs”. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022, pp. 2467–2472. [118] Kexuan Sun, Harsha Rayudu, and Jay Pujara. “A Hybrid Probabilistic Approach for Table Understanding”. In: Thirty-Fifth AAAI Conference on Artificial Intelligence. 2021. [119] Kexuan Sun, Fei Wang, Muhao Chen, and Jay Pujara. “Tabular Functional Block Detection with Embedding-based Agglomerative Cell Clustering”. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, pp. 1744–1753. [120] Yawei Sun, Lingling Zhang, Gong Cheng, and Yuzhong Qu. “SPARQA: Skeleton-Based Semantic Parsing for Complex Questions over Knowledge Bases”. In: Proceedings of the AAAI Conference on Artificial Intelligence 34 (Apr. 2020), pp. 8952–8959. doi: 10.1609/aaai.v34i05.6426. [121] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. “RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space”. In: International Conference on Learning Representations. 2019. url: https://openreview.net/forum?id=HkgEQnRqYQ. [122] Alon Talmor and Jonathan Berant. “The Web as a Knowledge-Base for Answering Complex Questions”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 641–651. doi: 10.18653/v1/N18-1059. [123] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Eric Gaussier, and Guillaume Bouchard. “Complex Embeddings for Simple Link Prediction”. In: Proceedings of The 33rd International Conference on Machine Learning. Vol. 48. Proceedings of Machine Learning Research. 2016, pp. 2071–2080. [124] Pat Verga, Haitian Sun, Livio Baldini Soares, and William Cohen. “Adaptable and Interpretable Neural MemoryOver Symbolic Knowledge”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online, June 2021, pp. 3678–3691. doi: 10.18653/v1/2021.naacl-main.288. [125] Denny Vrandečić and Markus Krötzsch. “Wikidata: A Free Collaborative Knowledgebase”. In: Commun. ACM 57.10 (Sept. 2014), pp. 78–85. issn: 0001-0782. doi: 10.1145/2629489. 102 [126] Fei Wang, Kexuan Sun, Muhao Chen, Jay Pujara, and Pedro A. Szekely. “Retrieving Complex Tables with Multi-Granular Graph Representation Learning”. In: ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 2021. [127] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. “Knowledge Graph Embedding by Translating on Hyperplanes”. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. AAAI’14. Québec City, Québec, Canada: AAAI Press, 2014, pp. 1112–1119. [128] Zhiguo Wang, Patrick Ng, Ramesh Nallapati, and Bing Xiang. “Retrieval, Re-ranking and Multi-task Learning for Knowledge-Base Question Answering”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, Apr. 2021, pp. 347–357. doi: 10.18653/v1/2021.eacl-main.26. [129] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. “Representation Learning of Knowledge Graphs with Entity Descriptions”. In: Proceedings of the AAAI Conference on Artificial Intelligence 30.1 (2016). [130] Jiacheng Xu, Xipeng Qiu, Kan Chen, and Xuanjing Huang. “Knowledge Graph Representation with Jointly Structural and Textual Encoding”. In: IJCAI. 2017. [131] Rui Xu and D. Wunsch. “Survey of clustering algorithms”. In: IEEE Transactions on Neural Networks 16.3 (2005), pp. 645–678. doi: 10.1109/TNN.2005.845141. [132] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. “Embedding Entities and Relations for Learning and Inference in Knowledge Bases”. In: 3rd International Conference on Learning Representations, ICLR 2015. 2015. [133] Yang Yang, Wu Youyou, and Brian Uzzi. “Estimating the deep replicability of scientific findings using human and artificial intelligence”. In: Proceedings of the National Academy of Sciences 117.20 (May 19, 2020), pp. 10762–10768. issn: 0027-8424, 1091-6490. doi: 10.1073/pnas.1909046117. [134] Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. “RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). May 2022, pp. 6032–6043. doi: 10.18653/v1/2022.acl-long.417. [135] Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. “The Value of Semantic Parse Labeling for Knowledge Base Question Answering”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Aug. 2016, pp. 201–206. doi: 10.18653/v1/P16-2033. [136] Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. “TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. July 2020, pp. 8413–8426. 103 [137] Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu, Alexander Hanbo Li, Jun Wang, Yiqun Hu, William Yang Wang, Zhiguo Wang, and Bing Xiang. “DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases”. In: The Eleventh International Conference on Learning Representations. 2023. [138] Yanhong Zhai and Bing Liu. “Web Data Extraction Based on Partial Tree Alignment”. In: Proceedings of the 14th International Conference on World Wide Web. Association for Computing Machinery, 2005, pp. 76–85. isbn: 1595930469. doi: 10.1145/1060745.1060761. [139] Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. Quaternion Knowledge Graph Embeddings. 2019. [140] Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. “MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). May 2022, pp. 6588–6600. url: https://aclanthology.org/2022.acl-long.454. [141] Victor Zhong, Caiming Xiong, and Richard Socher. “Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning”. In: CoRR abs/1709.00103 (2017). [142] Mengyu Zhou, Wang Tao, Pengxin Ji, Han Shi, and Dongmei Zhang. “Table2Analysis: Modeling and Recommendation of Common Analysis Patterns for Multi-Dimensional Data”. In: AAAI Conference on Artificial Intelligence. 2020. [143] Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. “TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, pp. 3277–3287. doi: 10.18653/v1/2021.acl-long.254. [144] K. Zuyev. “Table image segmentation”. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. Vol. 2. 1997, 705–708 vol.2. doi: 10.1109/ICDAR.1997.620599. 104
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Learning distributed representations of cells in tables
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Robust and proactive error detection and correction in tables
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Theory and applications of adversarial and structured knowledge learning
PDF
Mining and modeling temporal structures of human behavior in digital platforms
PDF
Grounding language in images and videos
PDF
Aggregating symbols for language models
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Human motion data analysis and compression using graph based techniques
PDF
Towards generalized event understanding in text via generative models
Asset Metadata
Creator
Sun, Kexuan
(author)
Core Title
Advances in understanding and leveraging structured data for knowledge-intensive tasks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
03/28/2024
Defense Date
12/13/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
contextual re-ranking,knowledge graphs,OAI-PMH Harvest,question answering,representation learning,retrieval-augmented generation,structural understanding,tabular data
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pujara, Jay (
committee chair
), Hoberg, Gerard (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
kexuansu@usc.edu,skxhust@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113859110
Unique identifier
UC113859110
Identifier
etd-SunKexuan-12727.pdf (filename)
Legacy Identifier
etd-SunKexuan-12727
Document Type
Dissertation
Format
theses (aat)
Rights
Sun, Kexuan
Internet Media Type
application/pdf
Type
texts
Source
20240328-usctheses-batch-1132
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
contextual re-ranking
knowledge graphs
question answering
representation learning
retrieval-augmented generation
structural understanding
tabular data