Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Green knowledge graph completion and scalable generative content delivery
(USC Thesis Other)
Green knowledge graph completion and scalable generative content delivery
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GREEN KNOWLEDGE GRAPH COMPLETION
AND SCALABLE GENERATIVE CONTENT DELIVERY
by
Yun-Cheng Wang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2023
Copyright 2023 Yun-Cheng Wang
Acknowledgements
First and foremost, I am forever thankful to my doctoral advisor, Prof. C.-C. Jay Kuo, for providing me
the opportunity to collaborate with him and for his guidance and support throughout my PhD journey.
We started the project on knowledge graphs in the summer of 2020, when I knew very little about it. We
started by studying a survey paper together, and we finally had our first major breakthrough in the summer of 2022. The process could be extremely lonely, but I am fortunate to have Prof. Kuo’s full support and
company. I also learned a lot from my interactions with Prof. Kuo, including his writing and presentation
skills, his vision for future research, and his persistence and efficiency in conducting research. I would
also like to extend my gratitude to Prof. Antonio Ortega and Prof. Robin Jia for serving on my defense
committee and to Prof. Aiichiro Nakano and Prof. Keith Chugg for serving on my qualifying exam committee. They provided valuable feedback to my research and stimulated me to think about the problems
more comprehensively. Their feedback is important for me to complete this thesis. I would also like to
thank the talented collaborators in my PhD journey. I enjoyed the countless nights spent with Xiou Ge and
Bin Wang to discuss knowledge graph research. A lot of ideas in this thesis developed in the discussions
with Xiou and Bin. We were also encouraged and motivated by each other to conduct impactful research.
Chengwei Wei, Jintang Xue, and I worked on scalable generative content delivery. It is a challenging topic
since it requires backgrounds in communication, distributed computing, and natural language processing.
I am thankful that they are willing to work with me. Zhanxuan Mei and I worked on blind perceptual quality assessment. Discussions with him gave me a chance to get exposed to computer vision and stimulated
ii
interdisciplinary thinking in my thesis. I also appreciate Hong-Shou (Max) Chen for constantly discussing
research and sharing life updates with me.
My family plays an important role in shaping my personality and an inspiration for me to pursue a PhD.
My parents’ unconditional love and support are the fuel for my PhD journey. They are also a good audience
for my research ideas. I still remember the time when I called them and tried to explain my research to
them. They share every sweet and bitter in my life and always remind me to stay humble, patient, and
positive. They are the strongest support and motivation in my life, and I can never truly express my
gratitude for their contributions. I am also deeply appreciative of the environment that surrounds me.
When I first arrived in Los Angeles in 2018, I was warmly embraced by its culture and weather. LA is a
place where anyone with a dream can flourish. I am thankful to USC for its good education and research
environment. I am thankful to all my friends for supporting me. I am thankful to all MCL members for
always being supportive and helpful. This thesis is a collective effort. I am profoundly thankful to every
one who has played a part in my life.
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Knowledge Graphs and Their Applications . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Incompleteness of Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Challenges and Needs for Scalable Generative Content Delivery . . . . . . . . . . . 6
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 A Classification Framework for Knowledge Graph Completion with Hard
Negatives Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 A Lightweight Knowledge Graph Completion Method in Low Dimensions . . . . . 8
1.2.3 Improving Knowledge Graph Embeddings with Entity Types and Auxiliary Relations 9
1.2.4 Scalable Generative Content Delivery on Demand . . . . . . . . . . . . . . . . . . . 10
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Knowledge Graph Embedding (KGE) Methods . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Distance-based KGE Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Semantic Matching KGE Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Deep Neural Network KGE Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 KGE Methods in Low Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Knowledge Graph Entity Typing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Embedding-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Deep Neural Network Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Negative Sampling in Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Entity Type Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Scalable Generative Content Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Generative AI (GenAI) Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Scalable Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iv
Chapter 3: A Classification Framework for Knowledge Graph Completion with Hard Negatives
Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Constructing Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Encoding Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Training Relational Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 4: A Lightweight Knowledge Graph Completion Method in Low Dimensions . . . . . . . . 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Feature Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Decision Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.4 Time Analysis on Feature Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.5 Prediction distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.6 Comparison with NN-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.7 Performance as Training Progresses . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.8 Triple Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 5: Improving Knowledge Graph Embeddings with Entity Types and Auxiliary Relations . 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.2 Auxiliary Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.3 Asynchronous Embedding Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.3 Visualization of the Embedding Space . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.6 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6: Scalable Generative Content Delivery on Demand . . . . . . . . . . . . . . . . . . . . . 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Bottlenecks for Scalable GenAI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
v
6.2.1 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.2 Network Bandwidth & Concurrent Connections . . . . . . . . . . . . . . . . . . . . 90
6.2.3 Computation & Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Technical Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3.1 Increased Output Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.2 Growth in Model Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.3 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.4 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.5 Infrastructure Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Application Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.1 Metaverse System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4.2 Artificial Intelligence of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5 System Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5.1.1 Computation offloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5.1.2 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5.1.3 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5.1.4 Incremental Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5.2.1 Lightweight Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.2.2 Minimizing latency through edge-cloud collaboration . . . . . . . . . . . 106
6.5.2.3 Multi-modality Content Generation and Interface . . . . . . . . . . . . . 108
6.6 Research Outlooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.6.1 Generic versus Domain-specific GenAI Models . . . . . . . . . . . . . . . . . . . . 109
6.6.2 Decomposition of Large Language Models . . . . . . . . . . . . . . . . . . . . . . . 109
6.6.3 Quality Assurance for AIGC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.6.4 Green GenAI Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6.5 Attacks and Defense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6.6 Hierarchical Knowledge System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.6.7 Collaboration among Different Agencies . . . . . . . . . . . . . . . . . . . . . . . . 112
6.6.8 Bias and Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter 7: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2.1 Domain-Specific Knowledge Graph Construction and Applications . . . . . . . . . 117
7.2.2 Retrieval-Augmented Generation (RAG) with Knowledge Graphs . . . . . . . . . . 118
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
vi
List of Tables
1.1 Public accessible KGs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Comparison between our work and other related papers. . . . . . . . . . . . . . . . . . . . 11
2.1 Summary of distance-based KGE methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Summary of semantic matching KGE methods. . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Comparison of hardware and performance specifications of three computational resources,
namely cloud servers, edge servers, and user devices. . . . . . . . . . . . . . . . . . . . . . 25
3.1 Statistics of link prediction datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Link prediction results on FB15K and WN18. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Link prediction results on FB15k-237 and WN18RR. . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Model performance for FB15k-237 under different negative sampling settings. . . . . . . . 41
3.5 Ablation study evaluated in MRR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 Popular KGE methods and their scoring functions, where h, r, and t denote embeddings
for a given triple (h, r, t), d is the embedding dimension. ◦ denotes the Hadamard product,
and ⟨·, ·, ·⟩ is the generalized dot product. ne is the number of entity variables in one
dimension, nr is the number of relation variables in one dimension, and nv is the number
of triple variables in one dimension. nv = 2ne + nr. . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Relation grouping results on WN18RR when applying k-Means on relation embeddings
when k = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Dataset statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Results of link prediction in low dimensions (d = 32), where the best and the second best
numbers are in bold and with an underbar, respectively. . . . . . . . . . . . . . . . . . . . . 55
vii
4.5 Results on the link prediction task, where we show the performance gain (or loss) in terms
of percentages with an up (or down) arrow and the ratio of the model size within the
parentheses against those of respective 500-dimensional models. . . . . . . . . . . . . . . . 57
4.6 Link prediction performance on obgl-wikikg2 dataset. . . . . . . . . . . . . . . . . . . . . 57
4.7 Performance on different relation categories in FB15k-237 under 32 dimensions. . . . . . . 58
4.8 Performance for RotatE + GreenKGC in 32 dimensions with different feature pruning
scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.9 Ablation study on different negative sampling methods for classifier training in 32
dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.10 Comparison of required training time (Hour : Minute : Second) to reduce the feature
dimensions from 512 to 100 for TransE between DualDE, a knowledge-distillation method,
and GreenKGC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.11 Comparison on performance, number of model parameters, and total inference time (batch
size = 8) with other classification-based methods in 128 dimensions. We adopt TransE as
the baseline for fair comparison in the number of model parameters. The best numbers
are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.12 Statistics for triple classification datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.13 Triple classification results. GreenKGC adopts TransE as the baseline. . . . . . . . . . . . 65
5.1 Examples of auxiliary relations and the corresponding entity types using the proposed
efficient assignment, where anchor types are marked in boldface. . . . . . . . . . . . . . . 74
5.2 Dataset statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Results on KGET datasets, where the best performance in each column is shown in
boldface, and the second-best performance is underlined. . . . . . . . . . . . . . . . . . . . 78
5.4 Ablation study on asynchronous representation learning and different auxiliary relations.
The MRR performance is reported. The best performance in each column is shown in
boldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Inference time and memory complexity of KGET methods. . . . . . . . . . . . . . . . . . . 84
5.6 Top 3 predicted entity types by AsyncET for entities in YAGO43kET. Groundtruth is
marked in boldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1 Comparison of power consumption, carbon emission, and cloud computational cost in the
training of large GenAI models in different modalities. . . . . . . . . . . . . . . . . . . . . 93
viii
List of Figures
1.1 An example KG∗
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A screenshot of a Google search result that is empowered by KGs. . . . . . . . . . . . . . . 3
1.3 The causes of the incompleteness problem during KG construction. . . . . . . . . . . . . . 4
1.4 Growth in entity size of KGs in recent years†
. . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Architectures of three popular GenAI model categories: VAE, GAN, and Transformers. . . 21
2.2 Three basic computing paradigms in support of large-scale computing systems. . . . . . . 23
2.3 Roles and suitable applications for edge nodes and cloud nodes in edge-cloud computing. . 25
2.4 Implementation of GenAI systems with the edge-cloud computing paradigm. . . . . . . . . 27
3.1 An illustration of modeling link prediction as a binary classification problem. . . . . . . . 29
3.2 An overview of the KGBoost pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Illustrations of sub-relation scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 The performance curves as a function of the embedding dimension for (a) WN18 and (b)
FB15k-237 in MRR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 MRR versus the number of free parameters in KGE methods against FB15K-237 (left) and
YAGO3-10 dataset (right). When a model has fewer parameters, its performance is poorer.
Also, the larger dataset, YAGO3-10, demands more parameters than the smaller dataset,
FB15k-237, to achieve satisfactory results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 An overview of GreenKGC, which consists of three modules: (a) representation learning,
(b) feature pruning, and (c) decision learning. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Histograms of PCA-transformed 1D triple variables in two feature dimensions with (a)
low and (b) high cross-entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
ix
4.4 t-SNE visualization of the KG partitioning result in FB15k-237. . . . . . . . . . . . . . . . . 51
4.5 Average cross-entropy for different numbers of KG partitions in FB15k-237. . . . . . . . . 52
4.6 Embedding dimension d to MRR curves in log-scale for various methods on FB15k-237. d
= 8, 16, 32, 64, 128, 256. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7 Sorted discriminability for each feature dimension in different feature pruning schemes.
For cross-entropy and 1/variance, a lower value indicates a more discriminant feature. For
feature importance, a higher value indicates a more discriminant feature. . . . . . . . . . . 60
4.8 Ablation study on number of relation groups k to MRR. . . . . . . . . . . . . . . . . . . . . 60
4.9 Prediction distribution of a query (38th Grammy Awards, award_winner, ?) in FB15k-237.
A higher predicted score implies a higher chance of being a valid triple. . . . . . . . . . . 63
4.10 Training/evaluation AUC-PR and testing MRR to the number of training iterations. . . . . 64
4.11 Scatter plot of predictions from GreenKGC (the y-axis) versus KGE (the x-axis). . . . . . . 65
5.1 An example KG with missing entity types. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Illustration of using multiple auxiliary relations to model the relationship between entities
and entity types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 A diagram of the training process in AsyncET. . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 The MRR performance as a function of the number of alternating rounds between two
stages in asynchronous representation learning. . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Visualization of the entity embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Training loss curves with respect to different numbers of alternating rounds. . . . . . . . . 83
6.1 The significant amount of data generated in the AIGC era poses an unprecedented
challenge in computer networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 The development of generative LLMs and their model sizes as a function of time. The
vertical axis is in log scale. Models in the figure include GPT-2 [149], T5 [150], Turing-NLG
[169], GPT-3 [21], LaMDA [180], MT-NLG [169], and PaLM [34]. . . . . . . . . . . . . . . . 92
6.3 Illustration of latency in different computation frameworks. . . . . . . . . . . . . . . . . . 94
6.4 Illustration of exemplary service of the Metaverse system with GenAI under edge-cloud
computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5 Illustration of exemplary service of the AIoT system with GenAI in the edge-cloud
computing environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
x
6.6 The roadmap of designing GenAI services at scale. Computation offloading, latency,
privacy, and data offloading are the major considerations. . . . . . . . . . . . . . . . . . . . 100
6.7 Three parallelism for computation and data offloading in DNN model training [116]. . . . 101
6.8 Personalization of GenAI services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.9 Privacy preservation through federated learning. . . . . . . . . . . . . . . . . . . . . . . . . 103
6.10 Online optimization in edge-cloud computing. . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.11 Existing technologies to obtain lightweight GenAI models. . . . . . . . . . . . . . . . . . . 105
7.1 Medical KG construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2 General pipeline of retrieval augmented generation with knowledge graphs. . . . . . . . . 118
xi
Abstract
To build an advanced artificial intelligence (AI) system, it is important to incorporate explainable and
lightweight reasoning modules for better trustworthiness and scalability. Knowledge graphs (KGs) and
Generative AI (GenAI) models are promising categories for developing a reasoning module in AI systems.
They are the main focus of this thesis. There are two objectives in this thesis: 1) developing explainable
and scalable approaches for KGs, and 2) identifying and quantifying the key bottlenecks for scalable generative content delivery on demand and proposing design considerations in training and deployment. More
specifically, we focus on solving four fundamental research problems: 1) designing a novel and explainable
KGC model; 2) improving the proposed explainable KGC model such that it is lightweight and efficient; 3)
improving the embeddings of KGs through incorporating entity typing information; and 4) quantifying the
bottlenecks for scalable generative content delivery. In addition, we envision future research directions
on how to incorporate KGs to achieve better controllability, explainability, and efficiency of generative
models.
Knowledge graph completion (KGC) is fundamental for KG research. In this thesis, to improve the
explainability of the KGC method, we formulate link prediction in KGs as a binary classification problem, where an XGBoost binary classifier is trained for each relation using relevant links in KGs. The
new method, named KGBoost, adopts a modularized design and attempts to find hard negative samples
so as to train a powerful classifier to predict missing links. We conduct experiments on multiple datasets
xii
and demonstrate that KGBoost outperforms state-of-the-art methods across most datasets. Then, to improve the scalability of the previous classification pipeline, a lightweight modularized KGC solution, called
GreenKGC, is proposed in this thesis. GreenKGC consists of three modules: representation learning, feature pruning, and decision learning to extract discriminant KG features and make accurate predictions on
missing relationships using classifiers and negative sampling. Experimental results demonstrate that, in
low dimensions, GreenKGC can outperform SOTA methods in most datasets. In addition, low-dimensional
GreenKGC can achieve competitive or even better performance against high-dimensional models with a
much smaller model size. Analysis of training and inference time shows the advantages of GreenKGC over
other low-dimensional and classification-based methods, respectively.
In addition to link prediction in KGs, entity type prediction is also important. The typing information
can also help improve the quality of the knowledge graph embeddings (KGE). In this thesis, we discover
that by introducing multiple auxiliary relations, where similar types share the same auxiliary relations,
we are able to largely improve the performance of KGE methods on the entity type prediction task. In
addition, an iterative training scheme is proposed to asynchronously update entity and type embeddings
for better quality in capturing the semantic meanings in the original KG. Experiments on two datasets show
that our method can achieve state-of-the-art performance. Meanwhile, through a complexity analysis, we
demonstrate that our proposed method has a significant advantage in time and space complexity over
other existing methods.
The rapid development of GenAI systems has created a huge amount of new data on the Internet,
posing new challenges to current computing frameworks. It is attractive to build systems for scalable
generative content delivery by leveraging a distributed and collaborative computing paradigm. In this
thesis, we identify the key challenges and bottlenecks for scaling up GenAI models. Then, we list design
considerations for training and deploying scalable GenAI models.
xiii
Chapter 1
Introduction
1.1 Significance of the Research
1.1.1 Knowledge Graphs and Their Applications
Knowledge graphs (KGs) are structured representations of human knowledge. Data is stored as a multirelational graph in KG, where the nodes are called entities, such as objects and concepts, and the edges
represent the relationships or properties between two entities. The basic unit in KGs is a (head entity,
relation, tail entity) triple that describes the relationship between two entities. Due to its structured format,
KG is useful in many applications, such as information retrieval, recommendation, and question answering
[243]. An example of a general-purpose KG is shown in Fig. 1.1.
As shown in the figures, the entities can be real-world objects, such as the painting Mona Lisa, or highlevel abstractions of objects, such as the entity type Person, and even dates and images. Thus, KGs are
highly heterogeneous but can integrate data from different sources well. Such a characteristic is especially
useful in many multi-modality applications such as visual question answering [40, 164], image captioning
[212], and image tagging [29]. In addition, KGs are also widely used in many information retrieval tasks.
For example, Google constructs a KG, named Google Knowledge Graph, that can greatly enhance the
search results of their search engine. Fig. 1.2 shows an example of how the search results can be improved
∗
Source: https://aws.amazon.com/neptune/knowledge-graphs-on-aws/
1
Figure 1.1: An example KG∗
.
Knowledge Graph #Entities #Facts Construction Category
DBpedia [7] 4.3M 70M Automatically extracted. Instance
YAGO [122] 10M 120M Automatically extracted. Instance
Wikidata [190] 50M 500M Crowdsourced. Instance
WordNet [127] 155K 207K Curated by linguists. Concept
ConceptNet [118] 44M 2.4B Crowdsourced. Concept
OpenCyc [107] 2.4M 240k Curated by domain experts. Concept
Table 1.1: Public accessible KGs.
through a KG. First, when a textual query is input, the entities and relations are identified in order to find
the perfect matches. In this example, entity los angeles lakers is identified in the query “who plays for the los
angeles lakers". In addition, a specific relation playsFor is also identified. As a result, the textual query can
be converted to a link prediction problem (?, playsFor, los angeles lakers). The retrieval system returns all
the neighboring entities that fulfill the queries in the KG. A knowledge panel on the right-hand side also
provides additional information on the entities you are interested in. Thus, we can observe that KGs are
crucial for modern information retrieval [17, 43] and recommendation systems [199, 213]. In addition, KGs
can also be useful for plenty of natural language processing (NLP) applications, such as natural language
understanding [2, 3], question answering [79, 161], and information extraction [39, 74].
2
Figure 1.2: A screenshot of a Google search result that is empowered by KGs.
There are many publicly accessible KGs that can serve as useful resources for many knowledge-centric
applications. Table. 1.1 summarizes the existing open KGs. Based on the formats of entities, KGs can be
categorized into instance-view KGs and concept-view KGs [69]. In instance-view KGs, the entities are
mostly real-world objects, and the relations are used to describe the attributes of the entities. For example,
an entity can be a person, and the relations include several attributes for a person, such as birthPlace,
spouse, and employer. On the other hand, entities in concept-view KGs are mostly high-level abstractions,
such as a piece of description. For example, WordNet [127] is one of the concept-view KGs that the entities
are words and phrases. Relations in WordNet are also more abstractive, such as synonym, and relatedForm.
There is often a hierarchy or taxonomy in concept-view KGs. Concept-view KGs are also sometimes called
commonsense knowledge graphs (CKGs).
1.1.2 Incompleteness of Knowledge Graphs
From Table. 1.1, we can also learn that KGs can be constructed with different methods. However, most
large-scale KGs rely on an automation pipeline to automatically extract the information from the web.
Such a construction process usually contains several stages, including knowledge extraction and graph
construction, as shown in Fig. 1.3. Each stage contains several trained models. However, it’s not practical
3
to expect all the models can achieve 100% accuracy. As a result, there is some information loss from the
sources to the KGs. Such a loss causes a severe incompleteness problem for modern KGs. For example,
around 71% of persons in Freebase [14] don’t have a place of birth information. Such an incompleteness
problem can also be harmful to the downstream applications [145] and might cause possible performance
degradation. Therefore, knowledge graph completion (KGC) aims to discover missing relationships between entities in KGs, which is one of the most important and fundamental tasks in KGs.
Figure 1.3: The causes of the incompleteness problem during KG construction.
Previous work learns knowledge graph embeddings (KGE) to encode entities and relations into a vector
space. The link prediction is performed on the vector space with a well-defined scoring function. However,
KGE methods are mostly randomly initialized. The random initialization makes them not explainable, as
different initializations may lead to different results. In addition, the scoring functions are usually designed
to model a particular relation pattern, such as translation [16], in the embedding space. Such a pre-defined
transformation cannot model the diverse patterns in KGs well. Therefore, an explainable method that can
accurately predict the missing links in KGs is desired. Such an explainable framework is especially crucial
for KGs as KGs are widely used in explainable AI applications. In addition, KGE models usually require
high dimensions to be expressive due to the over-simplified scoring functions. Normally, 500 dimensions
are required to encode a single entity and relation. As a result, the space complexity grows rapidly as
4
Figure 1.4: Growth in entity size of KGs in recent years†
.
more and more information is added to the KGs. Fig. 1.4 shows a trend of the growth in entity sizes of
modern KGs. KGE models could gradually become not applicable under a resource-constrained setting
when applying to large-scale KGs [100]. It’s desired to have a time and memory-efficient KGC model to
address such an issue.
In addition to the link prediction task, completing the entity type information is also important and
is discussed in this thesis. Entity types could be crucial for many AI and NLP applications, such as drug
discovery [115], entity alignment, [78, 214], and entity linking [67]. Knowledge graph entity typing (KGET)
is to task to predict the missing entity types in KGs. The state-of-the-art KGET methods often adopt deep
neural networks [77] and attention mechanism [140] to solve the problem. Although such methods can
achieve good performance, the forward and backward processes are both time- and space-consuming. As
a result, the large time and space complexity make those methods hard to apply to large-scale KGs under
5
a constrained resource and to real-time inference scenarios. Thus, a KGET method that can achieve good
performance with a small time and space complexity is highly requested.
1.1.3 Challenges and Needs for Scalable Generative Content Delivery
As a specific category of artificial intelligence (AI), generative AI (GenAI) generates new content that
resembles what humans create. The rapid development of GenAI systems has created a huge amount of
new data on the Internet, posing new challenges to current computing and communication frameworks.
The historical development of GenAI suggests the necessity of deploying scalable GenAI services today
[205]. It can be roughly divided into the following four stages:
• 1950 ∼ 1990: Expert systems;
• 1990 ∼ 2020: Deep neural networks;
• 2020 ∼ 2023: Proprietary cloud computing;
• 2023 ∼ - : Public edge-cloud computing.
When the concept of AI was introduced in the earliest stage (1950 ∼ 1990), people were fascinated
by the GenAI idea since it could model human-like interactions. Compared to discriminative AI, where
only low-dimensional decision vectors were predicted, the GenAI technology was not mature enough to
offer powerful GenAI services at that time. Most human-like interactions were configured in the form of
rule-based expert systems [82] and/or template fillings [5].
Through persistent efforts over the three decades in the second stage (1990s ∼ 2020s), deep neural
networks became more powerful and popular. Researchers applied them to GenAI and had several breakthroughs [61]. However, since GenAI was still in the development and prototyping stage by the research
community, its scalability was not a concern.
†
Source: https://www.dbpedia.org/resources/knowledge-graphs/
6
Recently, several commercial companies have started to develop their own GenAI services using large
language/image/video models and proprietary data. The performance of such services is impressive due to
the adopted large model sizes and a huge amount of training data. The services are typically deployed on
cloud computing systems with powerful computation resources. Proprietary GenAI systems have concerns
in various aspects such as privacy, power consumption, and model efficiency. First, since most services are
closed-sourced and proprietary, user privacy protection cannot be well enforced. The model development
process and the final developed models are not transparent. Second, the power consumption of cloud
servers for running deep neural networks and transformers [188] is high. Third, since all computations
are conducted in centralized computing facilities, long physical distances between data sources and end
users tend to yield high latency. Real-time applications are difficult to achieve.
GenAI has entered the commercial usage stage with large exposure to the general public. Due to the
proliferation of mobile and edge devices, the data sources and computation should be placed as close to the
user as possible to reduce communication latency and improve user privacy. We envision the next stage of
GenAI should be open-sourced services that adopt an edge-cloud computing paradigm. To accommodate
an increasing number of daily users, scalability and sustainability are serious technical and business issues
in deploying future GenAI services. Here, we target such accessible, affordable, and sustainable GenAI
services, providing feasible solutions to domain-specific GenAI applications.
1.2 Contributions of the Research
1.2.1 A Classification Framework for Knowledge Graph Completion with Hard Negatives
Mining
In this work, we solve the knowledge graph completion (KGC) problem, which aims at predict the triples
that are previously unobserved based on the existing ones in the KGs. We try to improve the explainability
7
of the KGC methods by modeling each relation as a binary classification problem instead of a single embedding, which can only model certain geometrical patterns in the vector space. In addition, we leverage
knowledge graph embeddings (KGE) as the entity features for the binary classification task. Hard negatives are also considered during classifier training so as to train a powerful classifier. The method is named
KGBoost. Experiments on four link prediction datasets are conducted and the results show that KGBoost
can outperform existing methods in the instance-view KGs. Several unique characteristics of this work are
summarized below.
• Embedding learning and link prediction are separated in KGBoost. Such a modularized design is less
sensitive to different embedding dimension settings than end-to-end models. It allows performance
improvement with the provision of hard negative samples.
• Instead of using a single scoring function, a binary classifier is assigned to each relation to better
model the unique pattern for each relation.
• Two different negative sampling strategies to draw hard negative samples are explored and integrated in KGBoost.
1.2.2 A Lightweight Knowledge Graph Completion Method in Low Dimensions
In this work, we attempt to solve the KGC task in low dimensions so as to have a much smaller model size
and more efficient inference. The proposed lightweight method is called GreenKGC. GreenKGC consists
of three modules that can be trained individually: 1) representation learning, 2) feature pruning, and 3)
decision learning. Initial features for entities and relations are obtained in the first module. We leverage
KGE methods to obtain the initial features. In this stage, the embedding dimensions can be high to ensure
the features are expressive. In the second module, we largely prune the initial features through a supervised
feature selection process. We also devise a method to partition KGs into more homogeneous relation
8
groups in order to further reduce the required feature dimensions in each relation group. Finally, in the
third module, a tree-based classifier is introduced to introduce non-linearities to the model. Experimental
results in low dimensions demonstrate that GreenKGC can achieve satisfactory performance in as low as
8 dimensions. Contributions of this work are summarized as the following.
• We propose a modularized framework to solve the KGC task in low dimensions. The results show
that GreenKGC can outperform all methods in most datasets in low dimensions. Several future
extensions are also enabled due to the modularized design.
• We are one of the earliest works to investigate model pruning for KGs.
• GreenKGC has advantages in time and space complexity compared to other methods, including
neural network methods and knowledge distillation methods.
• A detailed ablation study is carried out to understand the effects of each module.
1.2.3 Improving Knowledge Graph Embeddings with Entity Types and Auxiliary Relations
In this work, we solve the knowledge graph entity typing (KGET task), which aims at predict the missing
types based on the links and known types in KGs. We observe that a single auxiliary relation hasType that
was previously used in the KGE methods when solving the KGET task is not adequate to model the diverse
relations between entities and types. Thus, we try to add more such auxiliary relations based on the types’
context. We define the context of the types as a collection of attributes of the entities that the type is
associated with. As such, the neighborhood information is implicitly encoded when the auxiliary relations
are added. Then, we propose an iterative training scheme, named KGE-iter, for KGET task. The entity
embeddings are first well-initialized by training with factual triples. Typing information is then encoded
using typing triples. We experimented on two KGET datasets. The results demonstrate that KGE-iter can
9
outperform all other methods while retaining a much smaller asymptotically time and space complexity.
Contributions of this work are summarized as the following.
• We propose two solutions to add auxiliary relations to convert typing tuples into triples that can
preserve the context in the KGs.
• We propose an iterative training scheme, as well as a self-adversarial loss for entity typing, to train
power embeddings for entity type prediction.
• Time and complexity analysis are carried out to evaluate the applicability of KGET methods under
a constrained resource.
1.2.4 Scalable Generative Content Delivery on Demand
GenAI poses unprecedented challenges to scalable computing systems and the need for edge-cloud computing because of three main reasons: 1) a significant amount of data generated, 2) consumer-centric
applications, and 3) high cost to maintain centralized GenAI services. Due to the above-mentioned three
emerging challenges, the collaboration of edge and cloud computing resources will mitigate the burden of
cloud servers, especially under the high volume of requests, or “at scale”. In this work, we examine four
important aspects of deploying GenAI under edge-cloud computing: 1) computation and data offloading,
2) low latency, 3) personalization, and 4) privacy. Our main contributions are summarized below:
• Provision of a comprehensive overview of recent developments in both GenAI models and edgecloud computing;
• Identification of technical challenges in training and deploying large-scale GenAI services using
today’s solution;
• Presentation of design considerations for training and deploying GenAI that target computational
efficiency (i.e., lower power consumption), low latency, personalization, and privacy;
10
Year Reference Contributions GenAI Edge Intelligence System Design
2020 [168]
Introduce communication-efficient techniques
from both algorithmic and system perspectives. x v v
2021 [131]
Introduce communication-efficient techniques
from both algorithmic and system perspectives. x v x
2022 [225]
Summarize major research efforts where machine learning systems have been deployed at
the edge of computer networks.
x v x
2023 [228]
Review fundamental GenAI techniques and applications in different modalities. v x x
2023 [24]
Survey on the basic components of GenAI, recent advances, and applications of uni-modality
and multi-modality GenAI models.
v x x
2023 [218]
Deployment of AIGC network and mobile applications via collaborative edge-cloud infrastructure.
v v x
2023 Ours
Review on both GenAI models and edge intelligence; point out challenges and bottlenecks in
current GenAI services; propose design considerations to address the issues; provide future directions on how edge-cloud computing can benefit GenAI.
v v v
Table 1.2: Comparison between our work and other related papers.
• Visualization of two large-scale GenAI applications as concrete examples to support our discussion;
• Future research directions on GenAI systems based on edge-cloud computing.
In addition, we summarize related overview papers and compare them with this work in Table. 1.2.
This work is the first one devoted to GenAI services at scale in the edge-cloud computing paradigm, and
it includes network design considerations and guidance for future research.
1.3 Organization of the Thesis
The rest of the thesis is organized as follows. In Chapter 2, we review the existing work for KGC and KGET
tasks, including embedding-based methods and deep neural network methods, as well as the preliminaries
of GenAI at scale. In Chapter 3, we propose an explainable KGC method that combines KGE, binary classifiers, and hard negative mining. In Chapter 4, we improve the time and space efficiency of a classification
framework KGC method by pruning the embedding dimensions for different relation groups. In Chapter 5,
11
we propose an iterative training scheme and introduce auxiliary relations to solve the KGET task while
retaining the time and space efficiency of the embedding methods. In Chapter 6, we identify the bottlenecks for scalable content generation on demand and propose design considerations in the training and
deployment of large-scale models. Finally, we provide concluding remarks and point out future research
directions on how to integrate KGs into generative models in Chapter 7.
12
Chapter 2
Research Background
2.1 Knowledge Graph Embedding (KGE) Methods
Knowledge graph embedding (KGE) methods learn entity and relation embeddings in a low-dimensional
vector space. The embeddings are able to model certain relation patterns in the KGs. They are useful in
many machine-learning models [204, 236] as the features for entities and relations. In knowledge graph
completion (KGC), KGE methods adopt a well-defined scoring function f(h, r, t)to evaluate the plausibility
of an unseen triple, where h, r, and t are the embeddings for the head entities, relations, and tail entities,
respectively. The scoring function also serves as the optimization objective during training. Based on the
design of the loss function, KGE methods can be categorized into the following three categories [198]: 1)
distance-based methods, 2) semantic matching methods, and 3) deep neural network methods. We will
review the existing work in these three categories in this section. In addition, since most existing KGE
methods require a high-dimensional embedding space to be expressive, KGE methods aim at learning a
powerful lower-dimensional embedding space are also popular. They are parameter efficient for large-scale
KGs and under resource-constrained scenarios. Thus, KGE methods in low dimensions are also reviewed
in this section.
13
Model scoring function
TransE [16] −∥h + r − t∥
TransR [113] −∥Mrh + r − Mrt∥
TransH [206] −∥(h − w⊤
r hwr) + r − (t − w⊤
r
twr)∥
RotatE [172] −∥h ◦ r − t∥
PairRE [27] −∥h ◦ r
H − t ◦ r
T ∥
ReflectE [232] −∥hc − MT
r
t∥
CompoundE [58] −∥T · R(θ) · S · h − Tˆ · Rˆ (θ) · Sˆ · t∥
Table 2.1: Summary of distance-based KGE methods.
2.1.1 Distance-based KGE Methods
Distance-based methods model entities as vectors in the embedding space and relations as affine transformations from the head entity to the tail entity. One famous example is TransE [16], which models
relations as translational distances and minimizes f(h, r, t) = ||h + r − t||, where h, r, t are vectors to
represent a valid triple (h, r, t). Although TransE can capture compositionalities of relations, it fails to
model symmetric relations since r ≈ 0 for symmetric relations. TransH [206] and TransR [113] model
symmetric relations by projecting entities to another hyperplane and vector space, respectively. While
these models can capture symmetric relations, they fail to preserve the compositional patterns. RotatE
[172] extends embeddings to a complex space so that it can model symmetry, asymmetry, inversion, and
compositional patterns of relations at the same time. OTE [177] extends KGE into a hypercomplex space.
Recent work has tried to model relations as more advanced geometric transformations to model particular
relation patterns. For example, PairRE [27] models relations as scaling, and ReflectE [232] models relations
as reflection. HousE [109] adopts the householder transformation to model triples as higher-order rotation and reflection simultaneously. CompoundE [58] is a recent work to model relations as compounding
geometric operations, including translation, rotation, and scaling, sequentially. The order of the geometric
operations may vary given different relation types. It is shown in [58] that some existing distance-based
methods are special cases for CompoundE. Since the model expressiveness is largely improved compared
14
Model scoring function
RESCAL [139] h
⊤Mrt
DistMult [222] h
⊤ diag(r)t
HolE [138] r
⊤(h ⋆ t)
ANALOGY [117] h
⊤Mˆ rt
ComplEx [184] Re
⟨r, h, t⟩
SimplE [89]
1
2
(⟨h, r, t⟩ + ⟨t, r
′
, h⟩)
TuckER [12] W ×1 h ×2 r ×3 t
QuatE [233] Qh ⊗ W◁
r
· Qt
Table 2.2: Summary of semantic matching KGE methods.
to the previous distance-based methods, CompoundE works well in low dimensions. Table. 2.1 summarizes
some well-known distance-based KGE methods.
2.1.2 Semantic Matching KGE Methods
Semantic matching methods [15] calculate the semantic similarities among triples (h, r, t). The scoring
function is often in form of f(h, r, t) = h
TMrt, where a matrix Mr is used to model a relation. RESCAL
[139] suffers from model complexity and numerical instability caused by the dense relation matrices. DistMult [222] confines the relation matrices to be diagonal to reduce model complexity, but it fails to model
asymmetric relations. ComplEx [184] extends DistMult from the real space to the complex space so that
asymmetric relations can be modeled by the imaginary parts of embeddings. QuatE [233] further extends semantic matching methods into a hypercomplex space. Recently, TuckER [12] and AutoSF [235]
allow more flexibility in modeling similarities. Though KGE methods are simple, they often require a
high-dimensional embedding space to be expressive. Table. 2.2 summarizes some well-known semantic
matching KGE methods, where ⟨∗, ∗, ∗⟩ denotes the generalized dot product, ⋆ denotes the circular correlation, ×1, ×2, ×3 denote Tucker decomposition of tensors, ⊗ denotes the Hamilton products, and · denotes
Hadamard product.
1
2.1.3 Deep Neural Network KGE Methods
Though distance-based and semantic matching KGE methods are simple, they have limited expressiveness
in low-dimensional space due to the simplicity of the scoring functions [42]. Therefore, classification-based
methods exploit multi-layer neural networks to increase the model’s expressiveness. NTN [170] adopts a
neural tensor network combined with textual representations of entities. ConvE [42] reshapes entity and
relation embeddings into 2D images, and uses 3 × 3 convolutional filters followed by several FC layers
to predict the scores of triples. ConvKB [135] uses 1 × 3 convolutional filters followed by several fully
connected (FC) layers to predict triple scores. InteractE [186] improves the performance by increasing
feature interactions in convolutional layers. SACN [165] adopts a graph convolutional network encoder
to capture the structural information in KGs and a convolutional-based decoder for link prediction.
Graph convolutional network (GCN) [95] is another kind of neural network that is widely used to
learn entity and relation embeddings. GCNs are initially designed for homogeneous graphs on the node
classification task, so it is not suitable for the KGC task. R-GCN [162] modifies GCN to be able to process
multi-relational graphs by assigning different information propagation matrices for different relation types
in the forward pass process. WGCN [242] uses a shared information propagation matrix for all relations but
assigns different weighted coefficients for different relations in the forward pass process. CompGCN [187]
adopts a variety of entity-relation composition operations from distance-based and semantic matching
KGE methods. It also generalizes several of the existing multi-relational GCN methods.
2.1.4 KGE Methods in Low Dimensions
Recently, research on the design of low-dimensional KGE methods has received attention. MuRP [11]
embeds entities and relations in a hyperbolic space due to its effectiveness in modeling hierarchies in KGs.
AttH [26] improves hyperbolic KGE by leveraging hyperbolic isometries to model logical patterns. MulDE
[194] adopts Knowledge Distillation [71] to a set of hyperbolic KGE as teacher embeddings to supervise
16
powerful embeddings in low dimensions. However, embeddings in hyperbolic space are hard to be used
in other downstream tasks. In Euclidean space, DualDE [244] adopts Knowledge Distillation to learn lowdimensional embeddings from high-dimensional ones for smaller model sizes and faster inference time.
Yet, it requires a long training time to reduce feature dimension.
2.2 Knowledge Graph Entity Typing
Entity types are high-level abstractions of the entities. Typing information is crucial to organize entities in
the ontologies and the taxonomies to ensure the structure of knowledge graphs. The availability of entity
types can also improve the quality of the learned entity and relation embeddings. Thus, they are important
in the knowledge groups. However, similar to the missing links, entity types might also be missing in the
information extraction process. As a result, it is urgent to predict the missing entity types based on the
observed entity types. Knowledge graph entity typing (KGET) is the task to achieve such an objective.
Below, two types of KGET methods are introduced.
2.2.1 Embedding-based Methods
KGE methods solve the KGET task by introducing an auxiliary relation hasType to form typing triples
(entity, hasType, type). Most KGE methods, such as TransE [16] and RotatE [172], can be applied to the
KGET task. However, it was pointed out in [140] that due to the limit in diversity to represent patterns
between entities and types using a single relation, KGE methods cannot perform well on the KGET task.
Therefore, many other embedding-based methods are proposed to address the KGET task. Embeddingbased methods first obtain entity embeddings from KGE models, then fix the entity embeddings and train
type embeddings. The embedding spaces for entities and types are connected with some linear projections.
ETE [130] tried to minimize the l1-distance between the entity and type space. ConnectE [240] adopts a linear projection matrix to connect entity and type space and proposes two training objectives, E2T and TRT,
17
to minimize the distance between entity and type space. JOIE [69] also proposes two training objectives,
cross-view grouping and cross-view transformation, to make sure entities with similar types with be embedded closely. TransC [121] embeds entities and types in the same embedding space as high-dimensional
balls. Several constraints are imposed to maintain the hierarchy among entity types. CORE [59] learns
complex KGE [172, 184] in the entity and type space individually and formulates the transformation from
the entity to type space as a linear regression problem for better expressiveness.
2.2.2 Deep Neural Network Methods
Embedding-based methods failed to encode neighborhood information and known types since the entity
embeddings are fixed when training type embeddings. Neighborhood information is important in the
KGET task since the entity types can be largely determined by the neighbors, including triples and typing
tuples. Thus, recent work adopts R-GCN [162] to solve the KGET task. As a result, entity embeddings
are aggregated from the neighboring entities and types. Then, multi-layer perceptrons (MLPs) are used
to predict missing entity types through solving a multi-label classification problem. More advanced GCNbased methods, such as WGCN [242] and CompGCN [187], are also adopted.
However, not all neighbors contribute equally during entity type prediction. Thus, attention mechanism is adopted in recent work to combine with GCNs in order to achieve better performance. ConnectEMRGAT [241] adopts graph attention networks (GATs) [132, 189] to solve the KGET task. CET [140]
proposes two different attention mechanisms, N2T and Agg2T, to aggregate neighborhood information
from neighboring entities and types. AttET [245] devises a type-specific attention mechanism to learn
entity embeddings. TET [77] adopts the self-attention layers in transformers to aggregate the neighboring information. Although GCN-based methods can achieve superior performance, the time and space
complexity for GCN-based methods is much higher than embedding-based methods.
18
2.3 Negative Sampling in Knowledge Graphs
Negative sampling is important for KG applications since only parts of the facts are given in KGs. The
closed world assumption (CWA) suggests that any unobserved information should be treated as negative
samples. As a result, KGE models can be formulated as convex optimization problems and have closed-form
solutions. However, such an assumption suffers from the overfitting problem on the observed triples and
the inefficiency of using all unseen data as negative training samples. Therefore, open word assumption
(OWA), which suggests that unobserved triples might be missing rather than negative samples, is more
widely used due to its intuition in the real-world scenarios. Thus, an accurate choice of a negative sampling
subset can contribute to the performance of KGC methods. Below, we will introduce previous work on
drawing negative samples for the link prediction and entity type prediction tasks.
2.3.1 Link Prediction
Negative samples are often obtained by corrupting a true triple, (h, r, t), with a random head, (h
′
, r, t), or
tail, (h, r, t′
), such that the corrupted triple is not in the KGs. It was proposed in [206] to model the probability of corrupting heads or tails as a Bernoulli distribution to avoid false negatives. Adversarial learning
with generative adversary networks (GANs) for the generation of negative samples were examined in [22,
197]. That is, effective negative samples could be obtained through training another generator, fG(h
′
, r, t′
),
simultaneously. However, the GAN-based negative sample generator makes the original model more complex and difficult to train. To reduce the complexity of negative sampling, self-adversarial training was
adopted in [172] to generate negative samples based on the original scoring function f(h
′
, r, t′
).
2.3.2 Entity Type Prediction
Given an entity typing tuple (e, t), where e is an entity and t is an entity type, a negative samples can be
generated by corrupting the tuple with a random entity type t
′
, such that the tuple (e, t′
) doesn’t belong to
19
the KGs. Entity-type prediction is often formulated as a multi-label classification problem in GCN-based
methods and binary cross-entropy (BCE) loss is often adopted as the training loss [162, 187, 242]. In such
cases, we can treat all unobserved tuples as negative samples. However, since the severity of missing type
information is high, it is likely to encounter false negatives in such a setting. Thus, [140] proposes to use a
false-negative aware (FNA) loss instead of BCE loss, where the sampling distribution for negative samples
p(e, t′
) is defined as:
p(e, t′
) = σ(f(e, t′
)) − σ(f(e, t′
))2
, (2.1)
where f(e, t′
) is the plausibility of a unseen tuple and σ∗ is a sigmoid function. It can be seen that there
are lower possibilities to draw negative samples that are very likely, i.e., false negatives, and very unlikely,
i.e., easy negatives, to be valid. [77] further refines the FNA loss by designing the sampling distribution
p(e, t′
) as:
p(e, t′
) =
3σ(f(e, t′
)) − 2σ(f(e, t′
))2 σ(f(e, t′
)) ≤ 0.5
σ(f(e, t′
)) − 2σ(f(e, t′
))2 + 1 otherwise,
(2.2)
which is similar to 2.1 but with larger gradients at σ(f(e, t′
)) = 0.5.
2.4 Scalable Generative Content Delivery
2.4.1 Generative AI (GenAI) Models
The development of GenAI can be roughly divided into three eras: 1) the Variational Autoencoder (VAE)
and Generative Adversarial Network (GAN) era (2014-2017), 2) the Transformer era (2018-2019), and 3) the
large model era (2020-present) [54]. Three popular architectures for GenAI models are shown in Fig. 2.1.
a) Variational Autoencoder (VAE). The Variational Autoencoder (VAE) was first proposed in [94].
It has several variations [6, 60, 209, 227] to improve the quality of the generated content [102, 181], adjust
to different levels of supervision [56], and improve the inference efficiency [171]. VAEs are probabilistic
20
(a) Variational Auto-encoder (VAE).
(b) Generative Adversarial Network (GAN).
(c) Transformers.
Figure 2.1: Architectures of three popular GenAI model categories: VAE, GAN, and Transformers.
generative models. Their encoder and decoder correspond to two neural networks. The encoder maps
an input to a vector in a latent space, while the decoder maps a latent vector back to the input space to
generate an output. In the training stage, the network parameters are optimized so that the output is as
close as possible to the input. Adding noise to latent vectors makes the decoder produce multiple output
samples that have the same distribution as input samples.
b) Generative Adversarial Network (GAN). Similar to VAE, Generative Adversarial Networks (GANs)
[61] need two networks in the training stage but keep only one in the inference stage [38, 75, 141, 207].
21
The two networks are a generator and a discriminator. Through a training process [66, 83], the generator generates better and better fake data that are getting closer to real data in the distribution to fool the
discriminator. On the other hand, the discriminator is used to differentiate real and fake data as much as
possible. The generator and discriminator are trained by solving a min-max optimization problem:
min
G
max
D
V (G, D) = E
x∼real
[log D(x)]
+ E
G(z)∼fake
[1 − log D(G(z))],
where G(∗) and D(∗) denotes the generator and discriminator, respectively. Its capability improves along
the training process. Gradually, they reach an equilibrium status where fake and real data are so close that
they cannot be easily differentiated. Then, the training stage is completed.
c) Transformers. Natural language generation (NLG) models aim to generate human-like textual
responses. There are several common applications, such as neural machine translation [33, 175], question answering [36, 224], and document summarization [133, 191]. Such models are also called language
models (LMs) [208]. In recent years, transformers [188] with self-attention mechanisms have made major
breakthroughs in establishing powerful LMs [68, 91, 119, 163, 179]. Transformers have replaced the long
short-term memory (LSTM) [73] as the preferred LM architecture and set off a new wave of large language
models (LLMs) [47, 87, 112, 230]. They often adopt an encoder-decoder architecture, as shown in Fig. 2.1
(c). While the encoder adopts a bi-directional information propagation process to understand the input
text, the decoder in most transformer architectures generates words one by one. Such a decoder is also
called the autoregressive decoder. With the advent of transformers, generative models are getting larger
and larger. Over the past two years, attempts have been made to combine a wide variety of models to
create larger and more powerful models. They offer impressive performance in various fields [63]. Due
to the large model sizes of GenAI models, they are deployed on the cloud nowadays. That is, models are
22
Figure 2.2: Three basic computing paradigms in support of large-scale computing systems.
trained at the training stage and run at the inference stage in cloud servers. Users send requests to the
cloud server for content generation. Then, the generated content is sent back to users.
2.4.2 Scalable Computing Systems
There are three basic paradigms for implementing scalable computing systems. They are: 1) cloud computing, 2) multi-access edge computing (MEC), or previously mobile-edge computing, and 3) edge-cloud
computing, as shown in Fig. 2.2. Among the three, cloud computing carries out computationally demanding projects using a large number of online servers to serve many remote users. The cloud has much
larger computing resources than a local site. Moving compute-intensive tasks to the cloud has been an efficient way of data processing. The concept of cloud computing was introduced in the early 60s [57, 174].
It has made rapid progress in the last several decades and has become a mature business service model.
23
Examples include: Amazon Web Services (AWS)∗
, Microsoft Azure†
, Google Cloud Platform (GCP)‡
, IBM
Cloud§
, Salesforce¶
, etc.
As the computational power of mobile devices increases and wireless networks become accessible at
almost any place, multi-access edge computing (MEC) provides computing, storage, and bandwidth closer
to users. MEC tends to allocate more computing tasks to the edge than the cloud. Computation can be
performed near data sources on edge devices. Edge computing has become more important nowadays, as
pointed out in a few studies, e.g., [23, 48, 49, 124, 167]. The MEC framework primarily relies on edge devices, which have limited resources. In addition, the MEC framework greatly relies on caching to improve
the latency. Thus, its performance is not good for computationally demanding tasks.
As the demand for real-time processing, low-latency communication, and efficient data management
increases, the edge-cloud computing paradigm emerges as a new and attractive solution. By combining the
power of cloud computing with the proximity and responsiveness of edge devices, edge-cloud computing
aims to bridge the gap between latency and scalability. Since it has lower latency, it is suitable for real-time
applications such as AR/VR/MR [51, 234], object tracking and detection [152, 185], etc. Since it can utilize
computational resources at both the cloud and edges, it has more flexibility in load balancing to yield a
more scalable solution. Moreover, user data and privacy can be better preserved by edge-cloud computing
[143].
The hardware and performance specifications of three computational resources (namely, cloud servers,
edge servers, and user devices) are compared in Table 2.3. As shown in the table, cloud servers have the
highest resources in terms of computational memory and data storage capacity. At the same time, they
have the highest power consumption and the largest number of concurrent connections. Their latency is
also the highest since they are far from users. It is beneficial to shift some computation loads from cloud
∗
http://aws.amazon.com/ec2
†
http://www.microsoft.com/azure
‡
https://cloud.google.com/
§
https://www.ibm.com/cloud
¶
https://www.salesforce.com/
24
Resources Cloud Servers Edge Servers User Devices
Memory >24TB ∼500GB <64GB
Dist Storage >25PB <1PB <10TB
Latency (RTTs) 30 ∼ 50 ms <10ms -
Power (per year) >2,000TWh ∼7,500KWh ∼600KWh
ConcurrentConnections >500,000 ∼1,000 1
Table 2.3: Comparison of hardware and performance specifications of three computational resources,
namely cloud servers, edge servers, and user devices.
Figure 2.3: Roles and suitable applications for edge nodes and cloud nodes in edge-cloud computing.
servers to edge servers and user devices to balance the computational load and reduce latency in various
applications. The load-balancing idea is also called offloading. Computation offloading [53, 81] and data
offloading [70, 239] are two key concepts in edge-cloud computing.
The AI tasks suitable for edge servers and cloud servers are shown in Fig. 2.3. Due to rich computation
resources, cloud servers can store and run large models to process high-level tasks. In contrast, edge
devices are mainly responsible for low-level pre-processing tasks. Due to the emergence of 5G/IoT, AIGC
enters a new era. That is, it is no longer sufficient to conduct all computations and store all data in a
centralized cloud server or data center. Similarly, AI computation with edge servers and user devices is
also not practical in building a scalable system as AIGC data grows fast.
Some large deep-learning AI models are difficult to deploy at the edges. Recently, a green learning
methodology [100] has been proposed as an alternative to deep learning. Green learning AI models have
25
much smaller model sizes, significantly lower computational complexity in terms of FLOPs (Floating Point
Operations), faster inference time, and less power consumption demand. As a result, green-learning AI
models open a new door for edge servers and even user devices in offloading cloud servers. Hybrid deepand green-learning solutions match the edge-cloud computing paradigm well. That is, GenAI has a unique
mission to process low-level data and aggregate high-level abstractions to generate creative content. GenAI
can benefit the most from the collaboration of edge and cloud servers.
Recently, Meta announced a supercomputing cluster with very rich computational resources∥
. It can
perform five exaflops (billion billion calculations per second) using a total of 16,000 NVIDIA A100 GPUs
to train state-of-the-art GenAI models. Servers are connected by an NVIDIA Quantum InfiniBand fabric network with a bandwidth of 16 Tb/s to ensure low latency in data synchronization. However, this
computational scale is not affordable for most companies and academic institutions. Thus, how to design
scalable GenAI systems using a reasonable computing cluster to perform similar tasks is of great interest.
We put hope in edge-cloud computing since it can leverage expandable computation resources that are
under-utilized and closer to users.
The deployment of GenAI systems on the edge-cloud computing platform is shown in Fig. 2.4. Since
the training of GenAI models is most computationally heavy, it is still conducted in cloud servers. The
training is usually done offline and asynchronous. The deployment of trained GenAI models for the AIGC
tasks can be placed as close to users as possible to lower latency. Edge servers can be used to fine-tune
GenAI models, train personalized models, preserve user privacy, and serve as an interface between edges
and cloud servers. It is ideal to have several edge servers to handle individual tasks separately.
It is worthwhile to emphasize that user feedback is important for model fine-tuning. In other words,
there are interactions between the cloud and the edges. Besides, collaboration among users is important
for training a more robust and diverse system. Edge-cloud computing provides a natural solution to build
∥
https://ai.facebook.com/blog/ai-rsc/
26
Figure 2.4: Implementation of GenAI systems with the edge-cloud computing paradigm.
GenAI systems at scale. Yet, to the best of our knowledge, there is no research addressing how a distributed
system should be designed to accommodate the computation, transmission, and exchange of a huge amount
of AIGC data. This motivates us to explore this topic and conduct this research.
27
Chapter 3
A Classification Framework for Knowledge Graph Completion with
Hard Negatives Mining
3.1 Introduction
The great majority of research on knowledge base completion focuses on learning effective embeddings for
entities and relations through end-to-end optimization on a pre-defined scoring function [84]. In different
knowledge graph embedding models, since relation embeddings possess different meanings, the roles of
relation embeddings are not easy to explain. For example, TransE [16] models relations as translation
vectors from head to tail entities. Such a scoring function is not able to model symmetric and one-to-many
relations in nature. Therefore, it is difficult for TransE to model a KG that contains diverse relation patterns.
RotatE [172] extends a similar idea to complex space and model relations as rotation between head and
tail entities. However, the modeling capability of using fixed compositional relation patterns like TransE
and RotatE have been challenged for its generalizability [233] and single knowledge graph embedding
could only model some specific relation patterns. Since relation patterns may vary significantly in KGs,
it is challenging to model relations using a single scoring function. Although KG embedding methods
offer state-of-the-art performance, they are limited to several aspects: inadequacy in relation modeling,
sensitivity to embedding dimensions, and performance improvement with the provision of hard negative
28
Figure 3.1: An illustration of modeling link prediction as a binary classification problem.
samples. To increase the expressiveness of embedding models, it either requires high dimension embedding
spaces [42] or hyper-complex embedding spaces [177, 233].
Instead of investigating high-dimensional embedding, we aim to search for a solution that models
relations in a more generic and explainable way. Given a relation, there are two possible outcomes between
a head entity and a tail entity; namely, the relation either exists or does not exist. Thus, a relation can be
potentially modeled by a binary classifier. For example, as in Fig. 3.1, “Is Hulk a superhero movie?” is indeed
a binary classification problem.
These observations motivate our work. That is, to better address different relation patterns and make
the role of relations more explainable, we propose a new method, called KGBoost, to model link prediction
as a generic binary classification problem. It is applied to the knowledge graph completion task, which
tries to predict the missing links based on the existing KGs. In addition, we attempt to find hard negative
samples so as to train a powerful classifier for missing link prediction. Previous work tried to use generative
adversary networks (GANs) [22, 197] to draw hard negative samples by training an embedding model and
a discrete negative sample generator jointly. Such a framework is computationally expensive and difficult
to optimize. Here, we propose a simple yet effective negative sampling strategy based on the constraint of
29
each relation. Furthermore, harder negative samples are generated gradually in the training process using
self-adversarial negative sampling.
We conduct experiments on multiple benchmarking datasets and demonstrate that KGBoost outperforms state-of-the-art methods across most datasets. Furthermore, as compared with models trained by
end-to-end optimization, KGBoost works well under the low-dimensional setting so as to allow a smaller
model size. The classification performance is not sensitive to the feature number if discriminant features
already exist. Furthermore, the modularized design allows KGBoost to have a performance improvement
with the provision of hard negative samples.
3.2 Methodology
We use E and R to denote sets of entities and relations, respectively. A KG, which is represented by
F = {(h, r, t)} ⊆ E × R × E, is a collection of factual triples, where ∀h, t ∈ E and ∀r ∈ R. Furthermore,
we divide F into several subsets of entity pairs,
G
(r) = {(h, t) | (h, r, t) ∈ F},
where G
(r) only contains entity pairs connected by a specific relation r. The proposed KGBoost method
consists of three main steps: (1) constructing training data, (2) encoding entities, and (3) training relation
classifiers as shown in Fig. 3.2. They are elaborated below.
3.2.1 Constructing Training Data
Constructing training data is important in KGBoost since the quality of the training set directly affects
the performance of the classifiers. There are two criteria adopted to construct training data: 1) sufficient
positive samples and 2) effective negative samples. Along this line, we propose to incorporate inference
30
Figure 3.2: An overview of the KGBoost pipeline.
patterns between relations to augment positive samples and generate effective negative samples based on
relation constraints.
Relation Inference Since the classifier for each relation is trained independently, we consider inference patterns between relations to facilitate the first-order relation dependencies, i.e. sub-relations. A
relation, r2, is said to be a sub-relation of relation r1 if and only if ∀(h, t) ∈ G(r2) ⇒ (h, t) ∈ G(r1)
. However, due to the incompleteness of KG, some (h, t) pairs in G
(r2) might be missing in G
(r1)
. Therefore, we
define an inference index to decide whether r2 is a sub-relation to r1 as
infer (r1 | r2) = |G(r1) ∩ G(r2)
|
|G(r2)
|
, (3.1)
r2 is defined as a sub-relation of r1 if and only if infer (r1 | r2) > δsub, where δsub is a threshold. We
augment G
(r1)
to become G
(r1) ∪ G(r2)
if r2 is a sub-relation of r1. There are two possible scenarios where
r1 can borrow positive samples from r2 as shown in Fig. 3.3.
In the first scenario, r2 is a sub-relation of r1, but r1 is not a sub-relation of r2. An example is ‘award_-
nominee’ and ‘award_winner’. If a person is the ‘award_winner’ of some awards, that person is also likely
to be the ‘award_nominee’ for the same award. In this scenario, G
(r1)
is augmented to be G
(r1) ∪ G(r2) but
G
(r2)
stays the same. In the second scenario, r2 is a sub-relation of r1, and r1 is also a sub-relation of r2.
r1 and r2 are either duplicates or they form a symmetric reciprocal pair, such as relation ‘friend_of ’ and
its inverse. Then, r1 and r2 will share the same positive training set G
(r1) ∪ G(r2)
.
31
(a) r2 is the sub-relation of r1. (b) r1, and r2 are a symmetric reciprocal relation pair.
Figure 3.3: Illustrations of sub-relation scenarios.
Negative Sampling based on Relation Constraints. Generating negative samples is challenging
for KGs since there are only observed positive triples. Naive negative sampling [16] generates negative
samples by corrupting the tail entity (or the head entity) of an observed sample (h, t) with a random entity.
Naive negative sampling is defined as
N
(r)
naive(h, t) = {(h, t′
) | t
′ ∈ E,(h, t′
) ∈ G /
(r)
}. (3.2)
Yet, negative samples in N
(r)
naive does not carry much semantics. For example, in Fig. 3.1, generating a
naive negative sample for Hulk in relation film_genre might yield (Hulk, Tobey Maguire), which is trivial.
It does not contribute much for the models to predict movie genres. Instead, we look for a negative sample
like (Hulk, Romance Film), which is more informative than the previous negative sample in movie genre
prediction. Based on this observation, corrupted tail entities could be drawn from different subsets of
entities for different relations based on the relation constraints. More specifically, only entities that have
been observed in G
(r)
, i.e. the range of relation r, are considered:
range(r) = {t | ∃h ∈ E,(h, t) ∈ G(r)
}. (3.3)
32
A similar idea was mentioned in [98], where it was called the local-closed world assumption (LCWA).
LCWA assumes that head and tail entities of a specific relation are constrained by entity types. The type
information in KGs is often missing [80], so they use the set of existing head and tail entities, e.g., relation
range, as the constraint to generate negative samples. An obvious drawback of negative sampling based
on LCWA is that it is likely to generate false negative samples. To mitigate this problem, we define the
co-occurrence between two tail entities as the number of common heads in G
(r)
. Formally, co-occurrence
between two tail entities, t and t
′
is defined as:
co-occur(t, t′
) = |{h | (h, t) ∈ G(r) ∧ (h, t′
) ∈ G(r)
}|. (3.4)
When the two tail entities t and t
′
are highly co-occurred, it’s likely that (h, t′
) is a false negative given
(h, t) ∈ G(r)
. Therefore, to generate a negative sample based on a positive sample, (h, t), we exclude
corrupted entities t
′ with co-occur(t, t′
) larger than a threshold, δrcwc. As a result, the range-constrained
with co-occurrence (rcwc) negative sampling can be formulated as
N
(r)
rcwc(h, t) = {(h, t′
) | t
′ ∈ range(r),(h, t′
) ∈ G /
(r)
}
\ {(h, t′
) | co-occur(t, t’) > δrcwc}.
(3.5)
3.2.2 Encoding Entities
Semantic representations for entities often carry rich information and thus, they are suitable to be the input
features for classifiers. In distance-based KG embedding models, similar entities are likely to be clustered
together in the vector space due to the design of the scoring functions. The clustering effect provides clear
decision boundaries for classifiers. Therefore, we select two distance-based KG embedding models, TransE
[16] in the real space and RotatE [172] in the complex space, to encode entities. The distance functions are
shown below
33
• TransE [16]
dr(h, t) = ∥h + r − t∥, (3.6)
• RotatE [172]
dr(h, t) = ∥h ◦ r − t∥
2
. (3.7)
The entity embeddings are then trained with the negative log-likelihood loss as in [172]:
Lemb = − log σ(γ − dr(h, t))
−
X
N
i=1
p(h
′
i
, r, t′
i
) log σ(dr(h
′
i
, t′
i
) − γ),
(3.8)
where σ(.) is the sigmoid function, γ is a margin, (h
′
i
, r, t′
i
) is the i-th negative sample, and p(h
′
i
, r, t′
i
) is
the coefficient in self-adversarial negative sampling[172].
3.2.3 Training Relational Classifiers
Given a relation, there are two possible outcomes between a head entity and a tail entity; namely, the
relation exists or does not exist. Thus, a binary classifier can be used to predict the probability for a link
to exist in an entity pair. Typically, the number of negative samples is much higher than the number of
positive samples in KGs, causing the problem of imbalanced training data.
Ensemble and tree-based classifiers can be used to solve pairwise matching problems and handle imbalance data [37, 55]. A powerful tree-boosting classifier, XGBoost [31], is chosen as the relation classifier
in this work. XGBoost is a scalable gradient tree-boosting system that attempts to optimize the k-th tree
estimator fk(.) so as to minimize
L
(k) =
X
N
n=i
l(yi
, yˆ
(k−1)
i + fk([hi
; ti
])) + Ω(fk), (3.9)
34
where Ω(fk) is a regularization term and
yˆi
(k) = ˆy
(k−1)
i + fk([hi
; ti
])
is the prediction at the k-th iteration. We adopt the binary cross entropy loss commonly used in logistic
regression
l(yi
, yˆi
(k)
) = − yi
log(σ( ˆyi
(k)
))
− (1 − yi) log(1 − σ( ˆyi
(k)
))
(3.10)
The final prediction yˆi becomes
yˆi = σ(
X
K
k=1
fk([hi
; ti
])). (3.11)
Self-Adversarial Negative Sampling. We investigate the strategy to provide the classifiers easy
negative samples in the early training stage and hard negative samples in the later training stage with selfadversarial negative sampling, denoted as Nadv. For initial estimators in XGBoost, we provide the classifier
with Nnaive or Nrcwc to build up its basic knowledge. As the classifier acquires some basic information, we
collect negative samples in Nnaive or Nrcwc that are mis-classified by the initial estimators to form Nadv.
Self-adversarial negative samples are used to train estimators in the later iterations to correct the mistakes
made by the initial estimators. In general, Nadv ⊆ Nrcwc ⊆ Nnaive. Thus, harder negative samples are
gradually given in the training process under self-adversarial negative sampling.
LCWA-based Prediction. We find that LCWA [98] is beneficial for link prediction. When predicting
missing links for relation r in the inference stage, we only consider entities in range(r). For example,
in Fig. 3.1, we do not consider Tobey Maguire when predicting tails for relation film_genre because Tobey
Maguire has never appeared as a movie genre. It is worthwhile to point out that not all relations satisfy
LCWA. To address it, we further define a local-closed world (lcw) index to estimate whether a relation
35
satisfies LCWA. To calculate lcw(r), we split G
(r)
into stratified K folds, iterate through each fold and
accumulate the number of samples that contain the tail entities that do not exist in other folds. lcw(r)
is equal to one minus the accumulated error rate. In general, a larger K will result in more accurate
estimation.
For a triple (hi
, r, ti), the final scoring function of link prediction can be written as
fr(hi
, ti) =
1({ti ∈ range(r)})ˆyi
lcw(r) > δlcw
yˆi otherwise
(3.12)
where δlcw is a threshold and 1{.} is an indicator function.
3.3 Experiments
3.3.1 Experimental Setup
Datasets. We evaluate the proposed KGBoost method on four widely used link prediction datasets, WN18
[16], WN18RR [42], FB15K [16], and FB15k-237 [182]. The statistics of the datasets are summarized in
Table.3.1. WN18 and WN18RR are extracted from WordNet[127], a lexical database with conceptual entities
and relations. Inversed relations are removed from WN18 to form WN18RR. FB15K and FB15k-237 are
extracted from Freebase[14], an instance-level knowledge base. Near-duplicate and inversed relations are
removed from FB15K to form FB15k-237.
Table 3.1: Statistics of link prediction datasets.
Dataset #ent #rel #triples (train / valid / test)
WN18 40,943 18 141,442 / 5,000 / 5,000
WN18RR 40,943 11 86,835 / 3,034 / 3,134
FB15K 14,691 1,345 483,142 / 50,000 / 59,071
FB15k-237 14,541 237 272,115 / 17,535 / 20,466
36
Training Details. We select the optimal hyper-parameters for the XGBoost classifiers from the following search values:
• number of estimators: 300, 500, 1000, 1500.
• max depth: 3, 5, 7, 10.
• learning rate: 0.01, 0.05, 0.1.
The other hyper-parameters are selected from the following search values:
• negative sampling size: 8, 16, 32, 64.
• δsub: 0.6, 0.7, 0.8, 0.9.
• δlcw: 0.6, 0.7, 0.8, 0.9.
All hyper-parameters are determined based on the MRR in the validation set. Hyper-parameters being
used in FB15K and FB15k-237 are marked in boldface, and in WN18 and WN18RR are marked with an
underline. The hyper-parameter, K, used to estimate the local-closed world index is set to 5 for all datasets.
We follow the embedding dimension settings in [172], which is 500 for WN18 and WN18RR and 1,000 for
FB15K and FB15k-237. When training the XGBoost classifiers, the first 500 estimators are trained with
Nrcwc. Mis-classified negative samples in Nrcwc are collected and used to train the rest of the estimators.
KGBoost using TransE as the entity embedding is denoted as KGBoost-T and is denoted as KGBoost-R
using RotatE as the entity embedding. We assign an individual classifier to each relation as well as its
inverse relation to have maximum flexibility.
Evaluation Metrics. We evaluate the results using MR (Mean Rank), MRR (Mean Reciprocal Rank),
and Hits@k (k=1, 3, and 10) under the filtered setting [16]. That is, testing triples are ranked against
all candidate triples that are not in the training, validation, or testing set. Candidate triples are created by
37
corrupting the tail entities in the testing triples. Since we assign classifiers for inverse relations, corrupting
head entities is not required.
3.3.2 Experimental Results
Link prediction results for FB15K and WN18 are shown in Table 3.2, and results for FB15k-237 and WN18RR
are shown in Table 3.3. KGBoost outperforms all state-of-the-art models on both datasets extracted from
Freebase, FB15K and FB15k-237, since most of the relations in instance-level knowledge graphs have a
fixed subset of tail entities, i.e., range. For example, the relation film_genre has a fixed set of tail entities
that contain only 123 movie genres, while there are 14,541 entities in the knowledge base. Therefore,
LCWA-based link prediction can help the model to rule out most of the irrelevant candidate triples. As
the irrelevant candidate triples are ruled out, rcwc negative sampling N
(r)
rcwc is used to produce effective
negative samples that can help the classifiers to separate the true triples from other candidate triples.
In WN18 and WN18RR, the conceptual relations don’t have a fixed tail entity subset. Therefore,
Nrcwc ≈ Nnaive and is not as effective as in FB15K and FB15k-237. However, self-adversarial negative
samples Nadv are still able to iteratively provide negative samples that have been previously misclassified
by the classifiers. KGBoost achieves state-of-the-art performance on WN18RR and comparable results to
the state-of-the-art models on WN18.
In addition, TransE is known to have difficulties modeling symmetric relations [113, 172, 206] because
embeddings for symmetric relations tend to be zero vectors to minimize the scoring function. This is clearly
a shortcoming of modeling all triples with a single scoring function regardless of the relation patterns. On
the contrary, each relation is modeled by a binary classifier in KGBoost-T, so the symmetric relation pattern
could possibly be modeled. As a result, KGBoost-T has significant improvements over TransE on FB15K
and WN18, which contain many symmetric and inverse relations.
38
FB15K WN18
MR MRR H@1 H@3 H@10 MR MRR H@1 H@3 H@10
TransE [16] - 0.463 0.297 0.578 0.749 - 0.495 0.113 0.888 0.943
DistMult [222] 42 0.798 - - 0.893 655 0.797 - - 0.946
ComplEx [184] - 0.692 0.599 0.759 0.840 - 0.941 0.936 0.945 0.947
ConvE [42] 51 0.657 0.558 0.723 0.831 374 0.943 0.935 0.946 0.956
RotatE [172] 40 0.797 0.746 0.830 0.884 309 0.949 0.944 0.952 0.959
KGBoost-T (Ours) 15 0.811 0.739 0.867 0.915 189 0.820 0.703 0.936 0.951
KGBoost-R (Ours) 16 0.817 0.751 0.868 0.914 131 0.939 0.929 0.946 0.955
Table 3.2: Link prediction results on FB15K and WN18.
FB15k-237 WN18RR
MR MRR H@1 H@3 H@10 MR MRR H@1 H@3 H@10
TransE [16] 357 0.294 - - 0.465 3384 0.226 0.014 0.402 0.501
DistMult [222] 254 0.241 0.155 0.263 0.419 5110 0.43 0.39 0.44 0.49
ComplEx [184] 339 0.247 0.158 0.275 0.428 5261 0.44 0.41 0.46 0.51
ConvE [42] 244 0.325 0.237 0.356 0.501 4187 0.43 0.40 0.44 0.52
RotatE [172] 177 0.338 0.241 0.375 0.533 3340 0.476 0.428 0.492 0.571
SACN [165] - 0.35 0.26 0.39 0.54 - 0.47 0.43 0.48 0.54
InteractE [186] 172 0.354 0.263 - 0.535 5202 0.463 0.43 - 0.528
KGBoost-T (Ours) 78 0.426 0.335 0.462 0.608 2405 0.265 0.062 0.445 0.544
KGBoost-R (Ours) 77 0.425 0.336 0.460 0.606 2476 0.478 0.436 0.493 0.560
Table 3.3: Link prediction results on FB15k-237 and WN18RR.
However, the performance of KGBoost is dependent on the quality of input entity features, i.e., entity
embeddings. Since it is difficult for TransE to predict the exact match (H@1) in WN18RR, it is also challenging for KGBoost-T when the TransE feature is used. As a result, KGBoost-T has a low H@1 in WN18RR
as compared with others. However, KGBoost-T is able to show consistent performance improvement over
TransE on all metrics and improves H@1 from 0.014 to 0.062.
Embedding Dimension and Performance. As pointed out in [42], distance-based KG embedding
models, such as TransE and RotatE, require high embedding dimension for model expressiveness. However,
with the modularized design, KGBoost can reach similar performance under low- and high-dimensional
settings. We evaluate the performance of different models under different dimension settings in Fig. 3.4.
39
Figure 3.4: The performance curves as a function of the embedding dimension for (a) WN18 and (b) FB15k237 in MRR.
Since RotatE is already able to achieve above 0.9 in H@1 for WN18, there is no large room for KGBoost
to enhance. Consequently, KGBoost-R achieves performance similar to RotatE for this dataset. On the contrary, there is more room for KGBoost to enhance the performance of TransE, as verified by experiments.
The performance improvement of KGBoost-T is even more obvious when the embedding dimension, d, is
less than 200, where the performance of TransE starts to drop significantly. In FB15k-237, both TransE and
RotatE performance drop significantly under low-dimensional setting (d = 100). However, KGBoost-T
and KGBoost-R remain nearly the same performance as under high-dimensional settings.
For ConvE, the end-to-end training of embedding learning and link prediction leads to inconsistent
performance across different dimensional settings. For WN18, ConvE requires a large model size to be
expressive so that the performance under d = 50 is much worse than that under a higher dimensional
setting. For FB15k-237, an increasing embedding dimension yields fewer interactions between features in
the 2D convolutional kernels, so it performs worse when the embedding dimension is high. These results
show an advantage of adopting a modularized design to separate embedding learning and link prediction
to achieve consistent performance under different dimensional settings. In addition, KGBoost works better
under low-dimensional settings than distance-based knowledge graph embedding models.
40
RotatE KGBoost-R
MRR H@10 MRR H@10
Nnaive 0.295 0.480 0.307 0.479
Nnaive + Nadv 0.338 0.533 0.354 0.532
Nrcwc 0.248 0.419 0.425 0.606
Nrcwc + Nadv 0.218 0.380 0.424 0.606
Table 3.4: Model performance for FB15k-237 under different negative sampling settings.
Negative Sampling and Self-Adversarial Training. We investigate how different negative sampling strategies could affect the performance for RotatE and KGBoost-R in Table 3.4. Nadv indicates selfadversarial setting is adopted. Self-adversarial settings have different definitions in RotatE and KGBoost.
In RotatE, self-adversarial training [172] assigns higher weights to the negative samples that have smaller
margins than the positive samples in the loss function. On the other hand, in KGBoost, self-adversarial
negative sampling identifies previously misclassified negative samples and corrects them when training
boosting trees in the later iterations. Despite the different definitions, they both aim to provide hard negative samples based on the mistakes made in previous iterations. When the models are trained with naive
negative samples, self-adversarial settings are able to correct previous mistakes by giving more emphasis
on borderline cases and boosting the performance of both models.
When training RotatE using rcwc negative samples, which are considered to carry more semantics
than naive negative samples, the performance is even worse than training with naive negative samples.
This is because easy negative samples are important in the embedding learning process to group similar
entities and separate irrelevant entities. On the other hand, the modularized design of KGBoost takes pretrained entity embeddings as input. Trivial negative samples are mostly separated from positive samples
in the embedding space, and they are no longer required. Thus, hard negative samples result in better
performance improvement.
On the other hand, the modularized design of KGBoost takes pre-trained entity embeddings as input.
Trivial negative samples are widely separated from positive samples in the embedding space and are no
41
WN18RR FB15k-237
Complete KGBoost 0.478 0.425
w.o. relation inference 0.469 0.327
w.o. rcwc negative sampling 0.475 0.307
w.o. LCWA-based prediction 0.476 0.219
Table 3.5: Ablation study evaluated in MRR.
longer required. Therefore, providing hard negative samples allows performance improvement with the
provision of hard negative samples. As shown in Table 4, the performance is improved from 0.338 to 0.425
in MRR.
However, when combining rcwc negative samples and self-adversarial negative sampling, the performance is nearly the same as only using rcwc negative samples. This is because both rcwc and selfadversarial negative sampling attempt to yield hard negative samples. For FB15k-237, rcwc negative sampling is already effective, which limits the negative sample candidates to a small subset of entities. As
a result, self-adversarial negative sampling is not effective in combination with rcwc negative sampling.
In contrast, as shown in Table 4, self-negative sampling is effective in improving the quality of negative
samples when naive negative sampling candidates are given.
Ablation Study. We evaluate how each module in KGBoost affects the performance in Table 3.5.
Relation inference is incorporated to facilitate first-order dependencies between relations and is able to
boost the performance from 0.327 to 0.425 for FB15k-237 and 0.469 to 0.478 for WN18RR in MRR. The
rcwc negative sampling incorporates relation constraints to generate negative samples with semantics. It
is able to boost the performance from 0.307 to 0.425 for FB15k-237 and 0.475 to 0.478 for WN18RR in MRR.
LCWA-based prediction filters out irrelevant candidate triples during testing. It boosts the performance
from 0.219 to 0.425 for FB15k-237 and 0.476 to 0.478 for WN18RR in MRR.
42
In general, KGBoost performs better on instance-level knowledge graphs, such as Freebase, than knowledge base with conceptual entities and relations, such as WordNet, because different relations in instancelevel knowledge graphs have different constraints, e.g. relation ranges. KGBoost is able to make specific
predictions for each relation based on relation constraints.
3.4 Conclusion and Future Work
In this work, we propose KGBoost, a knowledge base completion method with a modularized design to
model unique pattern of each relation. Different from previous KG embedding models using a single scoring function for all relations, we formulate link prediction in each relation as a binary classification problem
and leverage XGBoost to predict missing links. Besides, range-constrained with co-occurrence (rcwc) negative sampling and self-adversarial negative sampling are proposed to generate effective negative samples.
Experimental results show that KGBoost not only outperforms state-of-the-art methods in link prediction
but also works well under low-dimensional settings.
In the future, we aim to extend KGBoost to predict missing links for emerging entities and relations.
Since KGs are constantly evolving, new entities and relations are introduced to the knowledge base frequently. When a new entity or relation is added, existing KG embedding models need to be re-trained on
the entire KG again. In KGBoost, each relation classifier is trained separately, and entity embeddings can
be pre-trained. As a result, KGBoost has the potential to handle emerging entities and relations and can
be extended to an inductive setting.
43
Chapter 4
A Lightweight Knowledge Graph Completion Method in Low
Dimensions
4.1 Introduction
Knowledge graph embedding (KGE) methods have been widely used to solve the incompleteness problem.
Embeddings for entities and relations are stored as model parameters and updated by maximizing triple
scores among observed triples while minimizing those among negative triples. The number of free parameters in a KGE model is linear to the embedding dimension and the number of entities and relations
in KGs, i.e., O((|E| + |R|)d), where |E| is the number of entities, |R| is the number of relations, and d is
the embedding dimension. Since KGE models usually require a higher-dimensional embedding space for a
better reasoning capability, they require large model sizes (i.e., parameter numbers) to achieve satisfactory
performance, as demonstrated in Fig. 4.1. To this end, it is challenging for them to handle large-scale KGs
with lots of entities and relations in resource-constrained platforms such as mobile/edge computing. A
KGC method that has good reasoning capability in low dimensions is desired [100].
The requirement of high-dimensional embeddings for popular KGE methods comes from the oversimplified scoring functions [215]. Thus, classification-based KGC methods, such as ConvE [42], aim to
44
Figure 4.1: MRR versus the number of free parameters in KGE methods against FB15K-237 (left) and
YAGO3-10 dataset (right). When a model has fewer parameters, its performance is poorer. Also, the larger
dataset, YAGO3-10, demands more parameters than the smaller dataset, FB15k-237, to achieve satisfactory
results.
increase the reasoning capabilities in low dimensions by adopting neural networks (NNs) as powerful decoders. As a result, they are more efficient in parameter scaling than KGE models [42]. However, NNs
demand longer inference time and more computation power due to their deep architectures. The long
inference time of the classification-based methods also limits their applicability to some tasks that require
real-time inference. Recently, DualDE [244] applied Knowledge Distillation (KD) [72] to train powerful low-dimensional embeddings. Yet, it demands three stages of embedding training: 1) training highdimensional KGE, 2) training low-dimensional KGE with the guidance of high-dimensional KGE, and 3)
multiple rounds of student-teacher interactions. Its training process is time-consuming and may fail to
converge when the embeddings are not well-initialized.
Here, we propose a new KGC method that works well under low dimensions and name it GreenKGC.
GreenKGC consists of three modules: 1) representation learning, 2) feature pruning, and 3) decision learning. Each of them is trained independently. In Module 1, we leverage a KGE method, called the baseline
method, to learn high-dimensional entity and relation representations. In Module 2, a feature pruning
45
process is applied to the high-dimensional entity and relation representations to yield discriminant lowdimensional features for triples. In addition, we observe that some feature dimensions are more powerful
than others in different relations. Thus, we group relations with similar discriminant feature dimensions
for parameter savings and better performance. In Module 3, we train a binary classifier for each relation
group so that it can predict triple’s score in inference. The score is a soft prediction between 0 and 1, which
indicates the probability of whether a certain triple exists or not. Finally, we propose two novel negative
sampling schemes, embedding-based and ontology-based, for classifier training in this chapter. They are
used for hard negative mining, where these hard negatives cannot be correctly predicted by the baseline
KGE methods.
We conduct extensive experiments and compare the performance and model sizes of GreenKGC with
several representative KGC methods on link prediction datasets. Experimental results show that GreenKGC
can achieve good performance in low dimensions, i.e., 8, 16, 32 dimensions, compared with SOTA lowdimensional methods. In addition, GreenKGC shows competitive or better performance compared to the
high-dimensional KGE methods with a much smaller model size. We also conduct experiments on a largescale link prediction dataset with over 2.5M entities and show that GreenKGC can perform well with much
fewer model parameters. Ablation studies are also conducted to show the effectiveness of each module in
GreenKGC.
4.2 Methodology
GreenKGC is presented in this section. It consists of three modules: representation learning, feature pruning, and decision learning, to obtain discriminant low-dimensional triple features and predict triple scores
accurately. An overview of GreenKGC is given in Fig. 4.2. Details of each module will be elaborated below.
46
[0, 1] soft labels
Binary
classifier
Low-dimensional
triple features
Feature pruning
[0, 1] soft labels Feature pruning
[0, 1] soft labels Feature pruning
High-dimensional
entity / relation
representations
KG partitioning
Binary
classifier
Binary
classifier
(a) (b) (c)
Figure 4.2: An overview of GreenKGC, which consists of three modules: (a) representation learning, (b)
feature pruning, and (c) decision learning.
4.2.1 Representation Learning
We leverage existing KGE models, such as TransE [16] and RotatE [172], to obtain good initial embeddings
for entities and relations, where their embedding dimensions can be high to be expressive. Yet, the initial
embedding dimension will be largely reduced in the feature pruning module. In general, GreenKGC can
build upon any existing KGE models. We refer to the KGE models used in GreenKGC as our baseline
models.
To train the baseline KGE model as the initial entity and relation representations, we adopt the selfadversarial learning process in [172] and use this codebase∗
. That is, given an observed triple (h, r, t) and
the KGE model fr(h, t), we minimize the following loss function
L = − log(σ(fr(h, t)))
−
Xn
i=1
p(h
′
i
, r, t′
i
) log(σ(−fr(h
′
i
, t
′
i
))),
(4.1)
∗
https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding
47
Model ne nr nv fr(h, t)
TransE 1 1 3 −∥h + r − t∥
DistMult 1 1 3 ⟨h, r, t⟩
ComplEx 2 2 6 Re(⟨h, r, t⟩)
RotatE 2 1 5 −∥h ◦ r − t∥
2
Table 4.1: Popular KGE methods and their scoring functions, where h, r, and t denote embeddings for a
given triple (h, r, t), d is the embedding dimension. ◦ denotes the Hadamard product, and ⟨·, ·, ·⟩ is the
generalized dot product. ne is the number of entity variables in one dimension, nr is the number of relation
variables in one dimension, and nv is the number of triple variables in one dimension. nv = 2ne + nr.
where (h
′
i
, r, t′
i
) is a negative sample and
p(h
′
j
, r, t′
j
) =
exp(αfr(h
′
j
, t
′
j
))
Pn
i=1 exp(αfr(h
′
i
, t
′
i
)), (4.2)
where α is the temperature to control the self-adversarial negative sampling.
We summarize the scoring functions for some common KGE models and their corresponding number
of variables per dimension in Table 4.1. In general, GreenKGC can build upon any existing KGE models.
4.2.2 Feature Pruning
In this module, a small subset of feature dimensions in high-dimensional KG representations from Module
1 are preserved, while the others are pruned, to form low-dimensional discriminant KG features.
Discriminant Feature Test (DFT). DFT is a supervised feature selection method recently proposed
in [223]. All training samples have a high-dimensional feature set as well as the corresponding labels. DFT
scans through each dimension in the feature set and computes its discriminability based on sample labels.
DFT can be used to reduce the dimensions of entity and relation embeddings while preserving their power
in downstream tasks such as KGC.
Here, we extend DFT to the multivariate setting since there are multiple variables in each triple. For
example, TransE [16] has 3 variables (i.e. h, r, and t) in each feature dimension. First, for each dimension i, we learn a linear transformation wi to map multiple variables [hi
, ri
, ti
] to a single variable xi
in
48
each triple, where hi
, ri
, ti represents the i-th dimension in the head, relation, and tail representations,
respectively. Such a linear transformation can be learned through principal component analysis (PCA) using singular value decomposition (SVD). As a result, wi
is the first principal component in PCA. However,
linear transformations learned from PCA are unsupervised and cannot separate observed triples from negatives well. Alternatively, we learn the linear transformation through logistic regression by minimizing
the binary cross-entropy loss
L = − y log(σ(wi
[hi
, ri
, ti
]
T
))
− (1 − y) log(1 − σ(wi
[hi
, ri
, ti
]
T
)),
(4.3)
where y = 1 for observed triples (h, r, t) and y = 0 for corrupted triples (h
′
, r, t′
). Afterward, we can
apply the standard DFT to each dimension.
DFT adopts cross-entropy (CE) to evaluate the discriminant power of each dimension as CE is a typical
loss for binary classification. Dimensions with lower CE imply higher discriminant power. We preserve
the feature dimensions with the lowest CE and prune the remaining to obtain low-dimensional features.
To calculate the discriminant power of each dimension, we iterate through each dimension in the highdimension feature set and calculate the discriminant power based on sample labels. More specifically, we
model KGC as a binary classification task. We assign label yi = 1 to the ith sample if it is an observed
triple and yi = 0 if it is a negative sample. For the dth dimension, we split the 1D feature space into left
and right subspaces and calculate the cross-entropy in the form of
H(d) =
NLH
(d)
L + NRH
(d)
R
NL + NR
, (4.4)
49
where NL and NR are the numbers of samples in the left and right intervals, respectively,
H
(d)
L = −PL,1 log(PL,1) − PL,0 log(PL,0), (4.5)
H
(d)
R = −PR,1 log(PR,1) − PR,0 log(PR,0), (4.6)
and where PL,1 =
1
NL
PNL
i=1 yi
, and PL,0 = 1−PL,1 and similarly for PR,1 and PR,0. A lower cross-entropy
value implies higher discriminant power.
(a) Cross-entropy = 0.7348 (b) Cross-entropy = 0.9910
Figure 4.3: Histograms of PCA-transformed 1D triple variables in two feature dimensions with (a) low and
(b) high cross-entropy.
Fig. 4.3 shows histograms of linearly transformed 1D triple variables in two different feature dimensions. As seen in the figure, samples in Fig. 4.3 (a), i.e. the feature dimension with the lower cross-entropy,
are more separable than that in Fig. 4.3 (b), i.e. the feature dimension with the higher cross-entropy.
Therefore, a lower cross-entropy implies a more discriminant feature dimension.
KG partitioning. Given that relations in KGs could be different (e.g. symmetric v.s. asymmetric and
films v.s. sports), a small subset of feature dimensions might not be discriminant for all relations. Thus, we
first partition them into disjoint relations groups, where relations in each group have similar properties.
Then, we perform feature pruning within each relation group and select the powerful feature dimensions
correspondingly.
50
We hypothesize that relations that have similar properties are close in the embedding space. Therefore,
we use k-Means to cluster relation embeddings into relation groups. To verify our hypothesize, we show
the grouping results on WN18RR in Table 4.2. Without categorizing relations into different logical patterns
explicitly, relations of similar patterns can be clustered together in the embedding space. For example, most
relations in cluster #0 are symmetric ones. All relations in the cluster #1 are N-to-1. The remaining two
relations in cluster #2 are 1-to-N with the highest tail-per-head ratio. While we observe cardinality-based
grouping for relations in WN18RR, which mostly contains abstract concepts, for FB15k-237 and YAGO3-10,
relations with similar semantic meanings are often grouped after KG partitioning.
Cluster # Relations
0 _derivationally_related_form
_also_see
_member_meronym
_has_part
_verb_group
_similar_to
1 _hypernym
_instance_hypernym
_synset_domain_topic_of
2 _member_of_domain_usage
_member_of_domain_region
Table 4.2: Relation grouping results on WN18RR when applying k-Means on relation embeddings when k
= 3.
Figure 4.4: t-SNE visualization of the KG partitioning result in FB15k-237.
51
Figure 4.5: Average cross-entropy for different numbers of KG partitions in FB15k-237.
To further verify the idea of relation clusters in the embedding space for KG partitioning, we show the
t-SNE visualization of relation embeddings in FB15k-237 in Fig. 4.4. Relations within the same cluster are
assigned the same color. We do observe the clustering structure in the t-SNE plot.
Furthermore, we evaluate how different numbers of relation groups, k, can affect the feature pruning
process. In Fig. 4.5, as the lower CE reflects more discriminant features, we can obtain more powerful
features when k becomes larger, i.e. partitioning KG into more relation groups. Thus, for each dataset,
we select the optimal k when the average CE starts to converge. We elaborate on the high-level intuition on why combining feature pruning and KG partitioning works with KGE models. First, KGE models
are isotropic, meaning each dimension can be handled by DFT independently. Second, some feature dimensions are more powerful than others in different relations. Thus, we group relations with the same
discriminant feature dimensions for parameter savings.
4.2.3 Decision Learning
We formulate KGC as a binary classification problem in each relation group. We adopt binary classifiers as
decoders since they are more powerful than simple scoring functions. The binary classifiers take pruned
52
triple features as inputs and predict soft probabilities (between 0 and 1) of triples as outputs. We also
conduct classifier training with hard negative mining so as to train a powerful classifier.
Binary classification. The binary classifiers, g(∗), take a low-dimensional triple feature x and predict
a soft label yˆ = g(x) ∈ [0, 1]. The label y = 1 for the observed triples and y = 0 for the sampled negatives.
We train a binary classifier by minimizing the following negative log-likelihood loss:
l(y, yˆ) = − y log(ˆy)
− (1 − y) log(1 − yˆ),
(4.7)
In general, we select a nonlinear classifier to accommodate nonlinearity in sample distributions.
Negative sampling. Combining KGE with classifiers is non-trivial because it’s challenging to obtain
high-quality negative samples for classifier training, given that negative samples are not explicitly labeled
in the KGs. Therefore, it is desired to mine hard negative cases for baseline KGE models so as to train a
powerful classifier. We propose two negative sampling schemes for classifier training. First, most KGE
models can only capture the coarse entity type information. For example, they may predict a location
given the query (Mary, born_in, ?) yet without an exact answer. Thus, we draw negative samples within
the entity types constrained by relations [98] to enhance the capability to predict the exact answer. Such
a negative sampling scheme is called ontology-based negative sampling. We also investigate the sampling
of hard negatives that cannot be trivially obtained from original KGE methods. Negatives with higher
embedding scores fr(hi
, ti) tend to be predicted wrongly in the baseline methods. To handle it, we rank
all randomly sampled negative triples and select the ones with higher embedding scores as hard negatives
for classifier training. Such a negative sampling strategy is called embedding-based negative sampling.
53
Dataset # ent. # rel. # triples (train / valid / test)
WN18RR 40,943 11 86,835 / 3,034 / 3,134
FB15k-237 14,541 237 272,115 / 17,535 / 20,466
YAGO3-10 123,143 37 1,079,040 / 4,978 / 4,982
ogbl-wikikg2 2,500,604 535 16,109,182 / 429,456 / 598,543
Table 4.3: Dataset statistics.
4.3 Experiments
4.3.1 Experimental Setup
Datasets. We consider four link prediction datasets for performance benchmarking: FB15k-237 [16, 182],
WN18RR [16, 42], YAGO3-10 [42], and ogbl-wikikg2 [76]. Their statistics are summarized in Table 4.3.
FB15k-237 is a subset of Freebase [14] that contains real-world relationships. WN18RR is a subset of
WordNet [127] containing lexical relationships between word senses. YAGO3-10 is a subset of YAGO3
[122] that describes the attributes of persons. ogbl-wikikg2 is extracted from wikidata [190] capturing the
different types of relations between entities in the world. Among the four, ogbl-wikikg2 is a large-scale
dataset with more than 2.5M entities.
Implementation details. We adopt TransE [16] and RotatE [172] as the baseline models and learn
500 dimensions initial representations for entities and relations. The feature dimensions are then reduced
in the feature pruning process. We compare among GreenKGC using RotatE as the baseline in all ablation
studies. To partition the KG, we determine the number of groups k for each dataset when the average
cross-entropy of all feature dimensions converges. As a result, k = 3 for WN18RR, k = 5 for FB15k-237
and YAGO3-10, and k = 20 for ogbl-wikikg2.
For decision learning, we consider several tree-based binary classifiers, including Decision Trees [19],
Random Forest [18], and Gradient Boosting Machines [31], as they match the intuition of the feature
pruning process and can accommodate non-linearity in the sample distribution. The hyperparameters are
searched among: tree depth l ∈ {3, 5, 7}, number of estimators n ∈ {400, 800, 1,200, 1,600, 2,000}, and
54
FB15k-237 WN18RR YAGO3-10
Model MRR H@1 H@3 H@10 MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
KGE Methods
TransE [16] 0.270 0.177 0.303 0.457 0.150 0.009 0.251 0.387 0.324 0.221 0.374 0.524
RotatE [172] 0.290 0.208 0.316 0.458 0.387 0.330 0.417 0.491 0.419 0.321 0.475 0.607
Classification-based Methods
ConvKB [134] 0.232 0.157 0.255 0.377 0.346 0.300 0.374 0.422 0.311 0.194 0.368 0.526
ConvE [42] 0.282 0.201 0.309 0.440 0.405 0.377 0.412 0.453 0.361 0.260 0.396 0.559
Low-dimensional Methods
MuRP [11] 0.323 0.235 0.353 0.501 0.465 0.420 0.484 0.544 0.230 0.150 0.247 0.392
AttH [26] 0.324 0.236 0.354 0.501 0.466 0.419 0.484 0.551 0.397 0.310 0.437 0.566
DualDE [244] 0.306 0.216 0.338 0.489 0.468 0.419 0.486 0.560 - - - -
TransE + GreenKGC (Ours) 0.331 0.251 0.356 0.493 0.342 0.300 0.365 0.413 0.362 0.265 0.408 0.537
RotatE + GreenKGC (Ours) 0.345 0.265 0.369 0.507 0.411 0.367 0.430 0.491 0.453 0.361 0.509 0.629
Table 4.4: Results of link prediction in low dimensions (d = 32), where the best and the second best
numbers are in bold and with an underbar, respectively.
learning rate lr ∈ {0.05, 0.1, 0.2}. The best settings are chosen based on MRR in the validation set. As a
result, we adopt Gradient Boosting Machine for all datasets. l = 5, n = 1200, lr = 0.2 for FB15k-237 and
YAGO3-10, l = 3, n = 1600, lr = 0.1 for WN18RR, and l = 7, n = 2000, lr = 0.05 for ogbl-wikikg2. We
adopt ontology-based negative sampling to train classifiers for FB15k-237, YAGO3-10, and ogbl-wikikg2,
and embedding-based negative sampling for WN18RR. Baseline KGEs are trained on NVIDIA Tesla P100
GPUs and binary classifiers are trained on AMD EPYC 7542 CPUs.
Evaluation metrics. For the link prediction task, the goal is to predict the missing entity given a
query triple, i.e. (h, r, ?) or (?, r, t). The correct entity should be ranked higher than other candidates.
Here, several common ranking metrics are used, such as MRR (Mean Reciprocal Rank) and Hits@k (k=1,
3, 10). Following the convention in [16], we adopt the filtered setting, where all entities serve as candidates
except for the ones that have been seen in training, validation, or testing sets.
4.3.2 Main Results
Results in low dimensions. In Table 4.4, we compare GreenKGC with KGE, classification-based, and
low-dimensional KGE methods in low dimensions, i.e. d = 32. Results for other methods in Table 4.4 are
55
either directly taken from [26, 244] or, if not presented, trained by ourselves using publicly available implementations with hyperparameters suggested by the original papers. KGE methods cannot achieve good
performance in low dimensions due to over-simplified scoring functions. Classification-based methods
achieve performance better than KGE methods as they adopt NNs as complex decoders. Low-dimensional
KGE methods provide state-of-the-art KGC solutions in low dimensions. Yet, GreenKGC outperforms them
in FB15k-237 and YAGO3-10 in all metrics. For WN18RR, the baseline KGE methods perform poorly in
low dimensions. GreenKGC is built upon KGEs, so this affects the performance of GreenKGC in WN18RR.
Thus, GreenKGC is more suitable for instance-based KGs, such as Freebase and YAGO, while hyperbolic
KGEs, such as MuRP and AttH, model the concept-based KGs, such as WordNet, well.
We show the performance curves of various methods as a function of embedding dimensions in Fig.
4.6. We see that the performance of KGE methods (i.e. TransE and RotatE) drops significantly as the
embedding dimension is lower. For ConvKB, although dimensions less influence its performance due to a
complex decoder, it performs poorly compared to other methods in general. For ConvE, although it claims
it’s more efficient in parameter scaling [42], its performance actually degrades significantly in dimensions
lower than 64. In addition, it also doesn’t perform well when the dimension is larger. Thus, the performance
of ConvE is sensitive to the embedding dimension. MuRP, AttH, and GreenKGC are the only methods that
can offer reasonable performance as the dimension goes to as low as 8 dimensions.
Comparison with baseline KGE. One unique characteristic of GreenKGC is to prune a high-dimensional
KGE into low-dimensional triple features and make predictions with a binary classifier as a powerful decoder. We evaluate the capability of GreenKGC in saving the number of parameters and maintaining the
performance by pruning original 500-dimensional KGE to 100-dimensional triple features in Table 4.5.
As shown in the table, GreenKGC can achieve competitive or even better performance with around 5
times smaller model size. Especially, Hits@1 is retained the most and even improved compared to the
56
FB15k-237 WN18RR YAGO3-10
Baseline Dim. MRR H@1 #P (M) MRR H@1 #P (M) MRR H@1 #P (M)
TransE
500 0.325 0.228 7.40 0.223 0.013 20.50 0.416 0.319 61.60
100
0.274 0.186 1.48 0.200 0.009 4.10 0.377 0.269 12.32
↓ 15.7% ↓ 18.5% (0.20x) ↓ 10.3% ↓ 30.8% (0.20x) ↓ 9.4% ↓ 16.7% (0.20x)
100 0.338 0.253 1.76 0.407 0.361 4.38 0.455 0.358 12.60
(Ours) ↑ 4.0% ↑ 9.6% (0.24x) ↑ 82.5% ↑ 176.9% (0.21x) ↑ 9.4% ↑ 12.2% (0.20x)
RotatE
500 0.333 0.237 14.66 0.475 0.427 40.95 0.478 0.388 123.20
100
0.296 0.207 2.93 0.437 0.385 8.19 0.432 0.340 24.64
↓ 11.1% ↓ 12.7% (0.20x) ↓ 8% ↓ 9.8% (0.20x) ↓ 9.6% ↓ 12.4% (0.20x)
100 0.348 0.266 3.21 0.458 0.424 8.47 0.467 0.378 24.92
(Ours) ↑ 4.5% ↑ 12.2% (0.22x) ↓ 3.6% ↓ 0.7% (0.21x) ↓ 2.3% ↓ 3.6% (0.20x)
Table 4.5: Results on the link prediction task, where we show the performance gain (or loss) in terms of
percentages with an up (or down) arrow and the ratio of the model size within the parentheses against
those of respective 500-dimensional models.
Method #P (M) Val. MRR Test MRR
TransE (d = 500) 1,250 (5×) 0.427 0.426
RotatE (d = 250) 1,250 (5×) 0.435 0.433
TransE (d = 100) 250 (1×) 0.247 0.262
TransE + GreenKGC (d = 100) 250 (1×) 0.339 0.331
RotatE (d = 50) 250 (1×) 0.225 0.253
RotatE + GreenKGC (d = 50) 250 (1×) 0.341 0.336
Table 4.6: Link prediction performance on obgl-wikikg2 dataset.
high-dimensional baselines. In addition, GreenKGC using TransE as the baseline can outperform highdimensional TransE in all datasets. Since the TransE scoring function is simple and fails to model some
relation patterns, such as symmetric relations, incorporating TransE with a powerful decoder, i.e. a binary classifier, in GreenKGC successfully overcomes deficiencies of adopting an over-simplified scoring
function. For all datasets, 100-dimensional GreenKGC could generate better results than 100-dimensional
baseline models.
We further compare GreenKGC and its baseline KGEs on a large-scale dataset, ogbl-wikikg2. Table 4.6
shows the results. We reduce the feature dimensions from 500 to 100 for RotatE and 250 to 50 for TransE
and achieve a 5x smaller model size while retaining around 80% of the performance. Compared with the
baseline KGEs in the same feature dimension, GreenKGC can improve 51.6% in MRR for RotatE and 37.2%
57
Figure 4.6: Embedding dimension d to MRR curves in log-scale for various methods on FB15k-237. d = 8,
16, 32, 64, 128, 256.
Predicting Heads Predicting Tails
Model 1-to-1 1-to-N N-to-1 N-to-N 1-to-1 1-to-N N-to-1 N-to-N
TransE [16] 0.374 0.417 0.037 0.217 0.372 0.023 0.680 0.322
RotatE [172] 0.468 0.431 0.066 0.229 0.463 0.057 0.725 0.336
AttH [26] 0.473 0.432 0.071 0.236 0.472 0.057 0.728 0.343
TransE + GreenKGC (Ours) 0.478 0.442 0.088 0.243 0.477 0.096 0.754 0.351
RotatE + GreenKGC (Ours) 0.483 0.455 0.134 0.245 0.486 0.112 0.765 0.353
Table 4.7: Performance on different relation categories in FB15k-237 under 32 dimensions.
in MRR for TransE. Therefore, the results demonstrate the advantages in performance to apply GreenKGC
to large-scale KGs in a constrained resource.
Relation Categories We further evaluate GreenKGC in different relation categories. Following the
convention in [206], we divide the relations into four categories: 1-to-1, 1-to-N, N-to-1, and N-to-N. They
are characterized by two statistical numbers, head-per-tail (hpt), and tail-per-head (tph), of the datasets.
If tph < 1.5 and hpt < 1.5, the relation is treated as 1-to-1; if tph < 1.5 and hpt ≥ 1.5, the relation is
treated as 1-to-N; if tph ≥ 1.5 and hpt < 1.5, the relation is treated as N-to-1; if tph ≥ 1.5 and hpt ≥ 1.5,
the relation is treated as N-to-N.
58
FB15k-237 WN18RR
MRR H@1 H@10 MRR H@1 H@10
w/o pruning 0.318 0.243 0.462 0.379 0.346 0.448
random 0.313 0.239 0.460 0.375 0.346 0.420
variance 0.315 0.239 0.465 0.381 0.348 0.455
feature importance 0.323 0.241 0.478 0.385 0.355 0.464
prune low CE 0.312 0.236 0.460 0.373 0.343 0.419
prune high CE (Ours) 0.345 0.265 0.507 0.411 0.367 0.491
Table 4.8: Performance for RotatE + GreenKGC in 32 dimensions with different feature pruning scheme.
Table 4.7 summarizes the results for different relation categories in FB15k-237 under 32 dimensions.
In the low-dimensional setting, GreenKGC is able to outperform other methods in all relation categories.
Specifically, GreenKGC performs especially well for many-to-1 predictions (i.e. predicting heads for 1-toN relations, and predicting tails for N-to-1 relations). Such results demonstrate the advantage of using
classifiers to make accurate predictions when there is only one valid target.
4.3.3 Ablation Study
Feature pruning. We evaluate the effectiveness of the feature pruning scheme in GreenKGC in Table 4.8.
We use “w/o pruning" to denote the baseline 32 dimensions KGE directly followed by the decision-learning
module. Also, we compare the following feature pruning schemes: 1) random pruning, 2) pruning based on
variance, 3) pruning based on feature importance from a Random Forest classifier, 4) pruning dimensions
with low CE (i.e., the most discriminant ones), in DFT, and 5) pruning dimensions with high CE (i.e., the
least discriminant ones) in DFT. As shown in the table, our method to prune the least discriminant features
in DFT achieves the best performance on both datasets. In contrast, pruning the most discriminant features
in DFT performs the worst. Thus, DFT module can effectively differentiate the discriminability among
different features. Using variance to prune achieves similar results as “w/o pruning" and random pruning.
Pruning based on feature importance shows better results than “w/o pruning", random and pruning, and
pruning based on variance, but performs worse than DFT. In addition, feature importance needs to consider
59
(a) Cross-entropy in DFT (b) 1 / Variance (c) Feature importance
Figure 4.7: Sorted discriminability for each feature dimension in different feature pruning schemes. For
cross-entropy and 1/variance, a lower value indicates a more discriminant feature. For feature importance,
a higher value indicates a more discriminant feature.
(a) FB15k-237 (b) WN18RR
Figure 4.8: Ablation study on number of relation groups k to MRR.
all feature dimensions at once, while in DFT, each feature dimension is processed individually. Thus, DFT
is also more memory efficient than calculating feature importance.
Fig. 4.7 plots the sorted discriminability of features in different pruning schemes. From the figure, the
high variance region is flat, so it’s difficult to identify the most discriminant features using their variances.
For feature importance, some of the feature dimensions have zero scores. Therefore, pruning based on
feature importance might ignore some discriminant features. In the DFT curve, there is a “shoulder point"
indicating only around 100 feature dimensions are more discriminant than the others. In general, we can
get good performance in low dimensions as long as we preserve dimensions lower than the shoulder point,
and prune all other dimensions.
60
FB15k-237 WN18RR
Neg. sampling MRR H@1 H@10 MRR H@1 H@10
Random 0.283 0.197 0.452 0.407 0.361 0.481
Ontology 0.345 0.265 0.507 0.403 0.350 0.487
Embedding 0.316 0.232 0.471 0.411 0.367 0.491
Table 4.9: Ablation study on different negative sampling methods for classifier training in 32 dimensions.
KG partitioning. Figure 4.8 shows GreenKGC performance with different numbers of relation groups
k, where k = 1 means no KG partitioning. A larger k will give a better performance on both FB15k-237
and WN18RR. Without using KG partitioning performs much worse than using KG partitioning. Note
that with a larger k, GreenKGC has more model parameters since we need more classifiers. The model
complexity is O(|E|d + kΘ), where Θ is the model complexity for the classifier. Thus, we can adjust k
based on the tradeoff of performance convergence and memory efficiency.
Negative Sampling
We evaluate the effectiveness of the two proposed negative sampling (i.e., ontology- and embeddingbased) methods in Table 4.9. In FB15k-237, both are more effective than randomly drawn negative samples.
The ontology-based one gives better results than the embedding-based one. In WN18RR, the embeddingbased one achieves the best results. Since there is no clear entity typing in WordNet, the ontology-based
one performs worse than the randomly drawn one. We can conclude that to correct failure cases in the baseline KGE, ontology-based negative sampling is effective for KGs consisting of real-world instances, such
as FB15k-237, while embedding-based negative sampling is powerful for concept KGs such as WN18RR.
4.3.4 Time Analysis on Feature Pruning
FB15k-237 WN18RR YAGO3-10
DualDE 03:30:50 01:50:00 09:28:20
GreenKGC (Ours) 00:10:50 00:06:02 00:23:35
Table 4.10: Comparison of required training time (Hour : Minute : Second) to reduce the feature dimensions
from 512 to 100 for TransE between DualDE, a knowledge-distillation method, and GreenKGC.
61
Table 4.10 shows the required training time for DualDE [244], a knowledge distillation method, and
GreenKGC, to reduce 512 dimensions TransE embeddings to 100 dimensions. As shown in the table,
GreenKGC achieves around 20x faster training time compared to DualDE, especially in YAGO3-10, which
is a larger-scale dataset. Besides, in knowledge distillation methods, low-dimensional embeddings are randomly initialized and trained with the guidance of high-dimensional embeddings. Thus, the quality of the
low-dimensional embeddings highly depends on good initialization. On the contrary, the feature pruning
process in GreenKGC selects a subset of powerful feature dimensions without learning new features from
scratch. In addition, it is also memory-efficient since it only processes one feature dimension at once.
4.3.5 Prediction distribution.
It was reported in [173] that the predicted scores for all candidates on FB15k-237 are converged to 1 with
ConvKB [134]. This is unlikely to be true, given the fact that KGs are often highly sparse. The issue is
resolved after ConvKB is implemented with PyTorch†
, but the performance on FB15k-237 is still not as
good as ConvKB originally reported in the paper. The issue shows the problem of end-to-end optimization.
That is, it is difficult to control and monitor every component in the model. This urges us to examine
whether GreenKGC has the same issue. Fig. 4.9 shows the sorted predicted scores of a query (38th Grammy
Awards, award_winner, ?) in FB15k-237. We see from the figure that only very few candidates have positive
scores close to 1, while other candidates receive negative scores of 0. The formers are valid triples. The
score distribution is consistent with the sparse nature of KGs.
4.3.6 Comparison with NN-based Methods
Inference time analysis. We compare GreenKGC with two other NN-based methods in Table 4.11 in
terms of performance, number of free parameters, and inference time. They are ConvKB [134] and ConvE
[42]. We adopt TransE as the baseline in GreenKGC to match the number of parameters in the embedding
†
https://github.com/daiquocnguyen/ConvKB/issues/5
62
Figure 4.9: Prediction distribution of a query (38th Grammy Awards, award_winner, ?) in FB15k-237. A
higher predicted score implies a higher chance of being a valid triple.
FB15k-237 WN18RR
Model MRR H@1 H@3 H@10 #P (M) T (s) MRR H@1 H@3 H@10 #P (M) T (s)
ConvKB [134] 0.258 0.179 0.283 0.416 1.91 548.67 0.369 0.317 0.399 0.468 5.26 225.12
ConvE [42] 0.317 0.230 0.347 0.493 2.74 235.73 0.427 0.394 0.437 0.495 6.09 46.08
TransE + GreenKGC (Ours) 0.339 0.253 0.364 0.503 2.42 205.12 0.435 0.391 0.461 0.510 5.84 40.01
Table 4.11: Comparison on performance, number of model parameters, and total inference time (batch size
= 8) with other classification-based methods in 128 dimensions. We adopt TransE as the baseline for fair
comparison in the number of model parameters. The best numbers are in bold.
layer for a fair comparison. As compared with ConvKB, GreenKGC achieves significantly better performance with slightly more parameters. As compared with ConvE, GreenKGC uses fewer parameters and
demands a shorter inference time since ConvE adopts a multi-layer architecture. GreenKGC also offers
better performance compared to ConvE.
4.3.7 Performance as Training Progresses
We plot the AUC-PR and MRR curve for training/validation, and testing in Fig. 4.10a and Fig. 4.10b,
respectively. We use AUC-PR to monitor the training of the classifiers. AUC-PR starts to converge for
both training and validation sets after 200 iterations. We record the link prediction results on the testing
63
(a) Training/evaluation AUC-PR. (b) Testing MRR.
Figure 4.10: Training/evaluation AUC-PR and testing MRR to the number of training iterations.
Dataset # entities # relations # triples (train / valid / test) # negatives (valid / test)
CoDEx-S 2,034 42 32,888 / 1,827 / 1,828 1,827 / 1,828
CoDEx-M 17,050 51 185,584 / 10,310 / 10,311 10,310 / 10,311
Table 4.12: Statistics for triple classification datasets.
set every 100 iterations. Though the AUC-PR improves slightly after 200 iterations, the MRR starts to
converge after 600 iterations.
4.3.8 Triple Classification
We evaluate GreenKGC on CoDEx [159], which includes two triple classification datasets, to demonstrate
that the pipeline can be easily generalized to another KGC task. The dataset statistics are summarized in
Table 4.12.
For the triple classification task, the goal is to predict the plausibility (i.e. 0 or 1) of a query triple,
(h, r, t). Same as prior work, we find the optimal score threshold for each relation using the validation set,
apply it to the testing set, and use accuracy and the F1 score to evaluate the results. We adopt TransE as
the GreenKGC baseline in the triple classification task.
Main results. Results on triple classification are shown in Table 4.13. We adopt TransE as the baseline
KGe model and reduce it from 512 dimensions to 128 dimensions in GreenKGC. Performance for other
64
CoDEx-S CoDEx-M
Models Acc. F1 #P (M) Acc. F1 #P (M)
RESCAL 0.843 0.852 12.06 0.818 0.815 22.09
TransE 0.829 0.837 1.04 0.797 0.803 8.73
ComplEx 0.836 0.846 2.08 0.824 0.818 17.46
ConvE 0.841 0.846 1.27 0.826 0.829 19.92
TuckER 0.840 0.846 135.26 0.823 0.816 142.95
GreenKGC 0.838 0.846 0.58 0.828 0.831 2.25
Table 4.13: Triple classification results. GreenKGC adopts TransE as the baseline.
Figure 4.11: Scatter plot of predictions from GreenKGC (the y-axis) versus KGE (the x-axis).
methods is taken from [159], and the number of model parameters is calculated according to their settings
in the paper. Again, we see that GreenKGC is able to achieve comparable or even better performance with
much fewer parameters. It is worthwhile to emphasize that, since the number of parameters in the classifier
is invariant to the size of the dataset, GreenKGC will have more savings in parameters in larger datasets
(e.g., CoDEx-M) than smaller datasets (e.g., CoDEx-S). In addition, GreenKGC is able to outperform other
methods in CoDEx-M, where composition and symmetry are the two most prevalent relation patterns
[159], with a smaller model size.
Qualitative analysis. We compare predictions from GreenKGC and KGE methods on individual relations through scatter plots of the predicted scores from two models in Fig. 4.11, where the vertical axis
shows the scores predicted by GreenKGC and the horizontal axis shows the scores from KGE. As shown
in the figure, there are many samples lying between 0.2 and 0.6 with KGE predictions. The overlapping
65
of positive and negative samples in that interval makes the binary classification task more challenging.
In contrast, predictions from GreenKGC are closer to either 0 or 1. Thus, it is easier for GreenKGC to
differentiate positive samples from negative samples. This is especially true for symmetric relations such
as spouse and sibling. They support our methodology in classification-based link prediction, where Hits@1
can be improved significantly.
4.4 Limitations
In this chapter, we focus on efficiently and accurately predicting missing links in KGs using low-dimensional
features and binary classifiers. GreenKGC can achieve impressive efficiency during the inference stage and
can be applied to various platforms with memory constraints because of its superior performance in lowdimensional space. However, the whole training process of GreenKGC still requires high-dimensional
pre-trained embeddings as initial features. Therefore, it may hinder GreenKGC from being trained on
resource-constrained platforms from scratch. In addition, the current GreenKGC model is proposed under a transductive setting, where we focus on a fixed entity and relation set. The generalizability of the
few-shot learning capability on GreenKGC is yet to be explored.
The above-mentioned two limitations can be addressed by leveraging textual information in KGs. In
recent years, text-based KGC models [192, 195, 200], which take advantage of entities’ names and descriptions to obtain features, are more and more popular. We may extend GreenKGC using word embeddings
from pre-trained language models as initial features to overcome the current limitations. In addition, continual learning on the classifiers [123], which aims at learning new training samples without forgetting
the old training samples, i.e. catastrophic forgetting, is also an active research topic. Thus, GreenKGC can
incorporate such techniques to improve its generalizability to new data.
66
4.5 Conclusion and Future Work
A lightweight KGC method, called GreenKGC, was proposed in this chapter to make accurate link predictions in low dimensions. It consists of three modules that can be trained individually: 1) representation
learning, 2) feature pruning, and 3) decision learning. Experimental results in low dimensions demonstrate
GreenKGC can achieve satisfactory performance in as low as 8 dimensions. In addition, experiments on
ogbl-wikikg2 show GreenKGC can get competitive results with much fewer model parameters. Furthermore, the ablation study shows the effectiveness of KG partitioning and feature pruning.
Modularized GreenKGC allows several future extensions. First, GreenKGC can be combined with new
embedding models as initial features. In general, using a more expressive KGE model can lead to better
final performance. Second, individual modules can be fine-tuned for different applications. For example,
since the feature pruning module and the decision-learning module are supervised, they can be applied
to various applications. Finally, different negative sampling strategies can be investigated in different
applications.
67
Chapter 5
Improving Knowledge Graph Embeddings with Entity Types and
Auxiliary Relations
5.1 Introduction
In addition to relation types, each entity also comes with multiple types to describe the high-level abstractions of an entity∗
. Fig. 5.1 shows an example KG containing entity type information. As shown in the
figure, each entity can be labeled with multiple types. For example, the entity “Mark Twain” has types
“writer” and “lecturer” at the same time. Entity types are crucial in several artificial intelligence (AI) and
natural language processing (NLP) applications, such as drug discovery [115], entity alignment, [78, 214],
and entity linking [67]. In real-world applications, entity types could be missing, e.g., having type “writer”
without type “person”, due to prediction errors from information extraction models [220, 221]. Such missing types can be inferred from the existing information in KG. For example, in Fig. 5.1, we can infer
that “Mark Twain” has a missing type “person” given that there is a known type “writer” and the relation
“born in”. Thus, knowledge graph entity typing (KGET) is a task to predict the missing types based on
the observed types and triples in the KGs [202]. KGET methods can serve as refinement mechanisms for
real-world knowledge bases.
∗We refer to “entity type" when using the term “type" in the remainder of this chapter.
68
Mark Twain
writer
lecturer
person
Missouri St. Louis University
university
state in U.S.
administrative
district
Will John
soccer player
Born in
Located in
Plays for
Lives in
Entity Entity type Relation Known type Missing type
Figure 5.1: An example KG with missing entity types.
Knowledge graph embedding (KGE) methods achieve great success in predicting missing triples in KGs
[84], and they are extended to solve the KGET task in [130]. Since the type labels are stored in the format
of tuples (entity, type), an auxiliary relation, hasType, is first introduced to convert the typing tuples into
triples (entity, hasType, type), and, then, a KGE method [129] is adopted to predict missing types. Although
such a method is time- and parameter-efficient, it does not perform well since the relationship between
entities and types is too diverse to be modeled by a single relation. In addition, such a method does not
consider the neighborhood information. This affects the performance of entity type prediction as well.
Other methods are proposed to improve the model’s expressiveness. Embedding-based methods, such as
ConnectE [240], first learn embeddings for entities and types separately using KGE methods. Then, a linear projection matrix is learned to minimize the distance between the entity and the type spaces. Another
work leverages multi-relational graph convolution networks (R-GCN) [162] to encode the neighborhood
information into entity embeddings. The attention mechanism is also explored in [241] to control the
contributions of neighbors when predicting an entity type. Afterward, multi-layer perceptrons (MLPs) are
69
Mark Twain
writer
lecturer
person
Will John
soccer player
general
creativity
sport player
Missouri St. Louis University
university
state in U.S.
administrative
district
location
location
creativity general
institute
Figure 5.2: Illustration of using multiple auxiliary relations to model the relationship between entities and
entity types.
cascaded to predict the entity types based on the learned entity embeddings. The KGET task is, therefore,
formulated as a multi-label classification problem. Although GCN-based methods offer superior performance, they are not applicable in a resource-constrained environment, such as mobile/edge devices and
real-time prediction [100], due to their high inference time complexity and large model sizes. In addition,
training GCNs could be time- and memory-consuming. Their applicability to large-scale KGs is challenging. It is desired to develop a KGET method of low inference time complexity, small model sizes, and good
performance. This is the objective of this work.
As argued above, a single auxiliary relation is not sufficient to model the relationship between entities
and types. Here, we introduce more auxiliary relations to improve the expressiveness of KGE models
on the entity-type prediction task. This idea is illustrated in Fig. 5.2, where we show an example of using
multiple auxiliary relations to model the entity-type relationship. It is intuitive that we should use different
auxiliary relations to model typing relationship for type “administrative district” and type “person” since
they describe two different concepts of entities. Similarly, type “writer” and type “soccer player” should
70
adopt different auxiliary relations since they are semantically different. They should not be close to each
other in the embedding space. On the other hand, for other types, such as “writer” and “lecturer”, they
co-occur with each other more frequently. Thus, they can adopt the same auxiliary relation for model
simplicity.
Along this direction, we introduce multiple auxiliary relations based on the “context” of types. The
context of a type is defined as a collection of attributes of its entities. It can be viewed as a discrete
representation of the type. As such, the local KG structure is implicitly encoded when auxiliary relations
are used. Next, we propose a method adopting an Asynchronous learning scheme for Entity Typing,
named AsyncET, to obtain better embeddings for entities and types for the entity type prediction task. The
training process consists of two stages: 1) link prediction and 2) type prediction. Entity embeddings are
first optimized by training with only factual triples to predict missing links. Then, the typing information is
used to train type embeddings and fine-tune entity embeddings by predicting missing types. Two training
stages alternate as the training progresses. The asynchronous training schema keeps the learned entity
embedding up-to-date and informative for entity type prediction. Experiments conducted on two KGET
datasets demonstrate that the proposed multiple auxiliary relations and asynchronous training framework
can substantially improve the performance of the KGET task. Furthermore, AsyncET has a significant
advantage over existing KGET methods in model sizes and time complexity.
5.2 Methodology
5.2.1 Notations
We use G to denote a KG containing a collection of factual triples; namely,
G = {(e
head, r, etail) | e
head, etail ∈ E, r ∈ R}, (5.1)
71
where E and R represent sets of entities and relations in the KG, respectively. The typing information is
denoted as
I = {(e, t) | e ∈ E, t ∈ T }, (5.2)
where T is a set of entity types in the KG. In order to group similar types based on the attributes of their
associated entities, we define the context of type t as a collection of relations that co-occur with entities
of type t
Ct = {r | (e
head, r, etail) ∈ G,(e
head, t) ∈ I}. (5.3)
For example, the context of type person is {Born in, Lives in, Plays for, ... }. It contains all possible properties
of the type person. The context of type t can be seen as a discrete representation for t that encodes the
local structure of the KG.
5.2.2 Auxiliary Relations
Previous KGE models have only one auxiliary relation - hasType. They convert a typing tuple, (e, t), into
a typing triple, (e, hasType, t). However, a single relation is not sufficient to model diverse entity-type
patterns. Here, we aim to find a mapping such that, given entity type t, an auxiliary relation p = Aux(t)
is assigned to form new typing triples (e, p, t), where t ∈ T , p ∈ P, and P denotes a set of auxiliary
relations in the KG. The objective is to maximize the capabilities of auxiliary relations to model every
entity-type relationship. We compare three methods for the design of auxiliary relations below.
Bijective assignment. A straightforward solution to enhance the expressiveness of auxiliary relations
is to assign a unique auxiliary relation to each type, called the bijective assignment. It can model diverse
typing patterns well by exhaustively modeling every possible typing pattern in the KG. However, when
the KG contains a large number of types, this assignment has several shortcomings. First, the model
optimization is less stable since the number of model parameters increases significantly. Second, it is too
72
Algorithm 1 Find anchors for efficient auxiliary relation assignment
Initialization:
The uncovered relations in the KG: U = R
The set of existing anchor types: A = ∅
Iteration:
while U ̸= ∅ do
t = argmaxt∈T |Ct ∩ U|
U ← U \ Ct
A ← A ∪ {t}
end while
fine-grained to perform well on the test dataset. Third, its inference time is much longer. Therefore, it is
essential to group similar types and assign auxiliary relations to each group of types.
Taxonomy-based assignment. Taxonomy is a hierarchical organization of concepts in KGs. For
example, type “/film/producer" in Freebase [14] is a type under category “film" with the subcategory “producer". To group types based on the taxonomy, we can group them based on the first layer of the taxonomy.
For instance, “/film/producer" will belong to the “film" group, and “/sports/sports_team" will belong to the
“sports" group. However, such a taxonomy might not be available for some KGs, say, YAGO subsets [122].
Furthermore, the first-layer-taxonomy-based assignment may not have enough granularity for some types.
To address these issues, we propose a new assignment method below.
Efficient assignment. To strike a balance between a small number of auxiliary relations, |P|, and high
expressiveness of entity-type modeling, we maximize similarities among the types in the same group and
minimize similarities among different groups. Mathematically, we adopt the Jaccard similarity between
the type contexts to define similarities between types. It can be written in the form of
Sim(t, t′
) = |Ct ∩ Ct
′|
|Ct ∪ Ct
′|
. (5.4)
Then, based on the well-defined similarity function between types, grouping similar types with the
minimum number of groups can be formulated as a min-max optimization problem. Such an optimization
73
Auxiliary Relation Entity Types
# 2
/film/producer
/TV/tv_director
/film/writer
/film/film_story_contributor
# 20
/sports/sports_team
/soccer/football_team
/baseball/baseball_team
/sports/school_sports_team
# 27
/location/administrative_division
/film/film_location
/fictional_universe/fictional_setting
/location/citytown
Table 5.1: Examples of auxiliary relations and the corresponding entity types using the proposed efficient
assignment, where anchor types are marked in boldface.
problem is NP-Hard. Here, we develop a greedy algorithm to find an approximate optimal assignment
as elaborated in the following. First, we identify several anchor types to be the centroid of each group.
Initially, the anchor type set is empty, and all types are not covered. The method iteratively selects the
type, which has not yet been selected, with the largest intersection of context and the uncovered relations
|Ct ∩ U| as a new anchor. The iteration ends when the union of all anchors’ contexts is equal to R. The
process of finding anchor types is depicted in Algorithm 1. Then, we assign a unique auxiliary relation
to each anchor type. The non-anchor types will find their most similar anchor types and share the same
auxiliary relations with the anchor type.
To verify whether the proposed algorithm can generate reasonable results, we show examples of the
grouped entity types in Table 5.1. We see from the table that the auxiliary relation # 2 is assigned to types
of persons who work in the entertainment industry, auxiliary relation # 20 is assigned to types that are
mostly sports teams, and auxiliary relation # 27 is assigned to geographical locations. Similar types are
successfully grouped together while types of distant semantic meanings are separated.
74
Link prediction Entity type prediction
Initialization
Fine-tuning
Figure 5.3: A diagram of the training process in AsyncET.
5.2.3 Asynchronous Embedding Learning
After auxiliary relations are defined, typing tuples (e, t) can be converted into typing triples (e, p, t). Such
typing triples form a typing graph
T G = {(e, p, t) | (e, t) ∈ I, p = Aux(t)}. (5.5)
Instead of mixing the original triples and the newly constructed typing triples together in embedding
learning, we optimize the entity and type embeddings on the original KG G and the typing graph T G
alternatively. That is, the embedding learning process is divided into two stages. In stage 1, the entity
embeddings are trained on G using a link prediction task. The learned entity embeddings serve as an initialization for embedding learning in stage 2, where we use the typing graph T G to learn type embeddings
and fine-tune entity embeddings with typing triples. The two training stages are optimized alternatively.
Fig. 5.3 illustrates the training process of asynchronous embedding learning. Details of each training stage
are elaborated below.
75
Stage 1: Link prediction. The goal of this stage is to obtain a good initialization of entity embeddings
that can be used to predict the missing types. We follow the training loss in [172] with the self-adversarial
negative sampling. The link prediction loss is defined as
Llp = − log(σ(f(e
head
, r, e
tail)))
−
Xn
i=1
p(e
′
i
, r, e′′
i
) log(σ(−f(e
′
i
, r, e
′′
i
))),
(5.6)
where (e
′
i
, r, e′′
i
)is the negative samples generated by corrupting the head and tail entities, f(e
head
, r, e
tail)
is the scoring function in the KGE model, and e
head
, r, e
tail are embeddings for the head entity, relation,
and tail entity, respectively. The self-adversarial negative sampling distribution is defined as
p(e
′
j
, r, e′′
j
) =
exp(αf(e
′
j
, r, e
′′
j
))
Pn
i=1 exp(αf(e
′
i
, r, e
′′
i
)), (5.7)
where α is the temperature to control the smoothness of the softmax function. As a result, negative samples
with lower scores are assigned smaller weights for optimization as they are well-trained already, and the
model can focus on optimizing the hard cases.
Stage 2: Entity type prediction. In this stage, we fine-tune the entity embeddings and train the type
embeddings using only typing triples. We adopt a loss similar to (5.6) to predict the missing entity types.
Ltp = − log(σ(f(e, p, t)))
−
Xn
i=1
p(e, p′
i
, t′
i
) log(σ(−f(e, p
′
i
, t
′
i
))),
(5.8)
where (e, p′
i
, t′
i
) is a negative sample for entity type prediction generated by replacing the valid types
with a random type t
′
i
and the corresponding auxiliary relation p
′
i
. The auxiliary relations are assigned
based on mappings, p = Aux(t) and p
′
i = Aux(t
′
i
). Since the number of entity types is much fewer
than the number of entities (i.e. |T | << |E|), false negatives, (i.e. (e, p′
i
, t′
i
) ∈ T G) are more prevalent for
76
FB15k-ET YAGO43k-ET
G
# entities 14,951 42,334
# relations 1,345 37
# triples 483,142 331,686
T G
# types 3,584 45,182
# train 136,618 375,853
# valid 15,848 43,111
# test 15,847 43,119
# p-bijective 3,584 45,182
# p-taxonomy 89 -
# p-efficient 54 10
Table 5.2: Dataset statistics.
entity-type prediction. To address this issue, we adopt false-negative-aware negative sampling distribution
introduced in [140]. It can be written as
p(e, p′
j
, t′
j
) = x − x
2
, (5.9)
where
x = σ(−f(e, p
′
i
, t
′
i
)). (5.10)
Then, negative samples with the highest scores are assigned lower weights as they are possibly false negatives. Similar to the self-adversarial loss, negative samples with the lowest scores are already well-trained,
so they are assigned smaller negative sampling weights.
5.3 Experiments
5.3.1 Experimental Setup
Datasets. We adopt two KGET datasets, FB15k-ET and YAGO43k-ET [130], for evaluation. FB15k-ET is
derived from a link prediction dataset, FB15K [16], extracted from Freebase [14] by adding typing tuples.
77
Models FB15kET YAGO43kET
MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
TransE [16] 0.618 0.504 0.686 0.835 0.427 0.304 0.497 0.663
RotatE [172] 0.632 0.523 0.699 0.840 0.462 0.339 0.537 0.695
CompoundE [58] 0.640 0.525 0.719 0.859 0.480 0.364 0.558 0.703
ETE [130] 0.500 0.385 0.553 0.719 0.230 0.137 0.263 0.422
ConnectE [240] 0.590 0.496 0.643 0.799 0.280 0.160 0.309 0.479
CET [140] 0.697 0.613 0.745 0.856 0.503 0.398 0.567 0.696
ConnectE-MRGAT [241] 0.630 0.562 0.662 0.804 0.320 0.243 0.343 0.482
AttET [245] 0.620 0.517 0.677 0.821 0.350 0.244 0.413 0.565
AsyncET-TransE (Ours) 0.659 0.552 0.729 0.859 0.452 0.341 0.518 0.684
AsyncET-RotatE (Ours) 0.668 0.564 0.735 0.864 0.471 0.359 0.556 0.717
AsnycET-CompoundE (Ours) 0.688 0.581 0.755 0.885 0.492 0.380 0.574 0.721
Table 5.3: Results on KGET datasets, where the best performance in each column is shown in boldface, and
the second-best performance is underlined.
Freebase contains general relations between real-world entities. YAGO43k-ET is derived from another link
prediction dataset, YAGO43k [129], extracted from YAGO [122] by adding typing tuples. There are mostly
attributes and relations for persons in YAGO. The dataset statistics are summarized in Table 5.2, where pbijective, p-taxonomy, and p-efficient denote the auxiliary relations obtained from the bijective assignment,
the taxonomy-based assignment, and the efficient assignment described in Sec. 5.2.2, respectively. Only
FB15k-ET contains taxonomy labels for the types.
Implementation details. We select three representative KGE methods as the scoring functions to
evaluate the effectiveness of AsyncET. Specifically, we select TransE [16], RotatE [172], and CompoundE
[58]. TransE models relations as translation in the vector space. It has several limitations in expressiveness
since it does not model symmetric relations well. RotatE models relations as rotation in the complex embedding space. CompoundE is a recently proposed KGE method that generalizes the majority of distancebased methods by including compounding geometric transformations such as translation, rotation, and
scaling. It also operates in the complex embedding space. The scoring functions of three KGE methods for
AsyncE are written below.
78
• TransE [16]:
f(e
head
, r, e
tail) = γ − ∥e
head + r − e
tail∥,
where γ is the margin. It’s a hyperparameter that can be tuned.
• RotatE [172]:
f(e
head
, r, e
tail) = γ − ∥e
head ◦ r − e
tail∥,
where ◦ denotes the rotation operation in the complex embedding space.
• CompoundE [58]:
f(e
head
, r, e
tail) = γ − ∥Tr · R(θr) · Sr · e
head − e
tail∥,
where Tr, R(θr), Sr denote the translation, rotation, and scaling operations in CompoundE, respectively.
For both datasets, we select the best hyper-parameters from a certain search space under the embedding
dimension d = 500 based on the performance of the validation set for entity type prediction. The search
space is given below:
• Number of negative samples nneg ∈ {128, 256⋆⋄, 512};
• Learning rate lr ∈ {0.01⋄, 0.001⋆, 0.0001};
• Softmax temperature α ∈ {0.5, 1.0⋆⋄};
• Margin γ ∈ {8.0⋆, 12.0, 16.0, 20.0⋄}.
The hyper-parameter settings adopted for FB15k-ET and YAGO43k-ET are marked with ⋆ and ⋄, respectively. We also conduct an ablation study on the performance against the number of alternate steps between
79
FB15kET YAGO43kET
Training Aux. Rel. TransE RotatE CompoundE TransE RotatE CompoundE
Syn. hasType 0.618 0.632 0.640 0.427 0.462 0.480
Syn. p-bijective 0.532 0.534 0.581 0.362 0.388 0.407
Syn. p-taxonomy 0.545 0.550 0.603 - - -
Syn. p-efficient 0.565 0.564 0.625 0.418 0.438 0.455
Asyn. hasType 0.621 0.624 0.638 0.443 0.458 0.474
Asyn. p-bijective 0.659 0.668 0.688 0.391 0.418 0.442
Asyn. p-taxonomy 0.633 0.641 0.664 - - -
Asyn. p-efficient 0.654 0.661 0.682 0.452 0.471 0.492
Table 5.4: Ablation study on asynchronous representation learning and different auxiliary relations. The
MRR performance is reported. The best performance in each column is shown in boldface.
two training stages in asynchronous embedding learning in Sec. 5.3.4. Based on the study, we alternate
two training stages every 16 steps for both datasets. All experiments are conducted using one NVIDIA
Tesla P100 GPU.
Evaluation metrics. For the KGET task, the goal is to predict the missing types given an entity, i.e.,
(e, ?). However, in AsyncET, we convert the tuples into triples so the test queries become (e, ?, ?). Since
each entity type is only modeled by one auxiliary relation, we evaluate the joint plausibility f(e, p′
, t′
),
∀t
′ ∈ T , where p
′ = Aux(t
′
) for a given query. The valid entity types should be ranked as high as possible
compared to all other candidates. Following the convention in [16], we adopt the filtered setting, where all
entity types in the KG serve as candidates except for those observed ones. Several commonly used ranking
metrics are adopted, including the Mean Reciprocal Rank (MRR) and Hits@k (k=1, 3, and 10).
5.3.2 Main Results
The experimental results on two KGET datasets are given in Table 5.3. Models are clustered into three
groups: 1) KGE methods trained using a single auxiliary relation (i.e., hasType), 2) other type-embedding
methods and models using graph neural networks, 3) proposed AsyncET with TransE, RotatE, and CompoundE scoring functions. For group 3, we report the best performance using p-bijective, p-taxonomy, or
80
Figure 5.4: The MRR performance as a function of the number of alternating rounds between two stages
in asynchronous representation learning.
p-efficient. Detailed comparison of the effects of different auxiliary relations will be discussed in Sec. 5.3.4.
We have the following observations from the table. AsyncET is significantly better than KGE methods
trained with only one auxiliary relation. CET performs better in exact matches (i.e. H@1) than AsyncET
since they adopt a graph neural network to learn entity embeddings from neighbors. However, the performance of some type-embedding methods, such as ETE and ConnectE, is much worse than that of AsyncET
for both datasets since they do not encode the neighboring information effectively. Recent methods, such
as CET, ConnectE-MRGAT, and AttET, adopt attention mechanisms to obtain context-aware entity embeddings to predict the missing types. AsyncET encodes entities’ attributes through auxiliary relations
and asynchronous training. AsyncET using scoring functions of TransE, RotatE, and CompoundE outperforms attention-based methods in all metrics except for CET. Furthermore, AsyncET-CompoundE can
even outperform CET in H@3 and H@10.
5.3.3 Visualization of the Embedding Space
We further visualize the entity embeddings using t-SNE into a 2D space in Fig. 5.5 to see if the entity embeddings obtained from our methods have better quality. We compare the entities embeddings
in TransE and TransE-iter under the entity types “/people/person", “/film/film", “sports/sport_team", and
81
“/award/award_category". We can observe that TransE already demonstrates some clustering effects in the
entity space without the typing information. However, for entities with types “/film/film" and “/award/award_-
category", there are some outliers to the clusters. Compared with the entity embeddings generated by our
method, fewer outliers are observed in the t-SNE plot.
(a) TransE. (b) TransE-iter (Ours).
Figure 5.5: Visualization of the entity embeddings.
5.3.4 Ablation Study
To analyze the effectiveness of asynchronous training and different auxiliary relation assignments, we
conduct an ablation study in Table 5.4, where the MRR performance of several methods for two datasets
is reported. We compare the following:
• Synchronous vs. asynchronous training;
• Single auxiliary relation hasType vs. multiple auxiliary relations with p-bijective, p-taxonomy, and
p-efficient designs;
• TransE, RotatE, or CompoundE scoring functions.
As shown in the table, asynchronous training consistently outperforms synchronous training. The
performance improvement of asynchronous training when using only one auxiliary relation, hasType, is
82
(a) # of alternating rounds = 256 (b) # of alternating rounds = 16 (c) # of alternating rounds = 1
Figure 5.6: Training loss curves with respect to different numbers of alternating rounds.
not significant since the second stage of the embedding learning is trained on a single-relational graph.
When there are multiple auxiliary relations, mixing typing triples with original factual triples in embedding
training using the synchronous framework still yields poor performance. This could be attributed to the
fact that KGE methods are difficult to train when there are too many relations. The performance improves
significantly when we decompose the training process into two stages under the asynchronous framework.
When using multiple auxiliary relations to model the typing relationship, the bijective assignment
works well in the dataset with fewer entity types, e.g. FB15k-ET. However, it performs worse than the
efficient assignment on the dataset with more entity types, e.g. YAGO43kET. In datasets with many entity
types, the bijective assignment introduces larger amounts of new parameters, making the embedding training more difficult to converge. The efficient assignment can achieve the best performance on YAGO43kET,
showing that such an assignment can capture the context in the KG and similarities among entity types
well. Surprisingly, the taxonomy-based assignment does not perform well on both datasets. One possible
reason is that grouping entity types based on only the first layer of the taxonomy ignores the granularities of different entity-type patterns. Such a problem might be solved by considering more layers in the
taxonomy.
Number of Alternating Rounds
83
Model Inference time complexity Memory complexity
ConnectE [240] O(dedt + T dt) O((E + R)de + T dt + dedt)
CompGCN [187] O(LERd2 + T d) O((E + R + T)d + Ld2
)
CET [140] O(ERT d) O((E + R + T)d + T d)
AsyncET (Ours) O(T d) O((E + R + T)d + P d)
Table 5.5: Inference time and memory complexity of KGET methods.
In asynchronous learning, we alternate between the link prediction stage (Stage 1) and the entity type
prediction stage (Stage 2) after a few stochastic gradient descent steps. The link prediction loss in Eq. (5.6)
and the entity type prediction loss in Eq. (5.8) are minimized in Stage 1 and Stage 2, respectively. The
training process begins with Stage 1 and switches to Stage 2 after Ns,1 stochastic gradient descent steps.
Similarly, it conducts Ns,2 stochastic gradient descent steps and then switches back to Stage 1. We call
one entire cycle of performing Stage 1 and Stage 2 once an alternating round. In our experiments, we set
Ns,1 = Ns,2 = Ns and use Nr to denote the number of alternating rounds.
We plot the MRR performance on YAGO43kET using TransE as the scoring function as a function of Ns
and Nr in Fig. 5.4 in the green line. We conduct experiments using Ns steps, with Ns = 1, 16, 256, 4096, 65536
in one round. Besides, we set Ns × Nr = 65, 536. The relation of Ns and Nr is shown by the red line. The
smaller Ns being set means the stage alternation is more frequent. We see that the performance is better when alternating between the two stages more frequently. Clearly, asynchronous training contributes
better entity and type embedding quality when more frequent interactions exist between the entity and
type embedding spaces.
We also plot the training loss curves as a function of alternating rounds in Fig. 5.6. It shows that
the training loss can be lower with more alternating rounds. In addition, both the link and entity type
prediction loss are successfully reduced during the training. In other words, the two training stages are
mutually beneficial. When alternating between the two stages only once, as shown in Fig. 5.6 (c), the loss
curves go down slower than the other two cases. Thus, more alternating rounds help convergence. We
84
Entity Top 3 Type Predictions
Mihail Majearu
Romanian footballers
Alkmaar players
People from Glasgow
David Cross
Jewish actors
21st-century American actors
American humorists
Zhejiang
Cities in Zhejiang
Provincial capitals in China
Administrative divisions of China
Table 5.6: Top 3 predicted entity types by AsyncET for entities in YAGO43kET. Groundtruth is marked in
boldface.
conclude that alternating between two stages can fine-tune entity and type embeddings to approach the
global optimal in optimization.
5.3.5 Complexity Analysis
We conduct complexity analysis on the inference time and the number of model parameters for several
representative methods in Table 5.5, where d, E, R, T, and P denote the embedding dimension, numbers
of entities, relations, entity types, and auxiliary relations, respectively. For ConnectE, the embedding
dimensions for entities and types are denoted as de and dt
, respectively. For GCN-based methods, L is
the number of GCN layers. ConnectE is an embedding-based method connecting two subspaces. It tries
to learn a matrix to connect the entity and type space. Thus, its time complexity and number of model
parameters are proportional to dedt
, which are much larger than our proposed method. The complexity of
GCN methods is correlated with the number of layers and the number of nodes and edges in the graph. As
a result, the inference time complexity is proportional to ER, which is highly inefficient in testing. The
inference time and memory complexity of AsyncET is the same as KGE methods, except it needs additional
model parameters to store embeddings for auxiliary relations. However, the number of auxiliary relations
is minized and is a constant compared to the number of entitiy types in the KGs.
85
5.3.6 Qualitative Analysis
We show some examples of predicted types in Table 5.6. In all three examples, the groundtruth ranks
among the top three. In the first example, for entity Mihail Majearu, the model can successfully rank the
groundtruth at the top one. In addition, the top three predicted types are all persons, and the 2nd prediction
is also related to football. In the second example, for entity David Cross, who is a comedian, the model can
rank the groundtruth at the top one once again. The other two entity types are also relevant to the entity.
They are valid types. In the last example, for entity Zhejiang, although the groundtruth only ranks as the
third, the first two predicted types are both relevant to the entities. Note that Zhejiang is a province instead
of a city in China. The granularity of the entity is not predicted correctly in the top two choices.
5.4 Conclusion and Future Work
Multiple auxiliary relations were proposed to solve the KGET task in this work. Three methods for the
design of auxiliary relations were compared. The efficient assignment is recommended among the three
since it is scalable to datasets containing many entity types. In addition, asynchronous embedding learning
was proposed to achieve better entity and type embeddings by alternatively predicting missing links and
types . It was shown by experimental results that AsyncET outperforms SOTA in H@3 and H@10 with
much lower inference complexity and fewer model parameters. As future extensions, we will investigate
how the sparsity of the typing information affects the performance of KGET methods. We aim to not only
develop a time- and parameter-efficient model, but achieve less performance degradation when trained
with fewer labeled entity types.
86
Chapter 6
Scalable Generative Content Delivery on Demand
6.1 Introduction
Generative AI (GenAI) has emerged as a groundbreaking field to realize artificial general intelligence (AGI)
by integrating machine learning and creative content generation. It is a specific category of AI that aims to
autonomously generate new content that imitates the content created by humans in different modalities,
including images [64, 147], audio [153, 154], text [25, 52], and even 3D objects [125, 137]. With the rapid
development of GenAI, various applications, such as text-to-image generation [103, 176], text-to-speech
(TTS) synthesis [99, 229], chatbot [1, 120], and AI-empowered mixed reality (MR) [151, 219], have been
widely used by consumers. Recently, GenAI models rely on deep neural networks, such as generative
adversarial networks (GANs) [61] and large language models (LLMs) [21] because of the higher complexity
of the generative tasks. As a result, such GenAI models have huge model sizes and are computationally
demanding, a powerful centralized computation infrastructure (i.e., cloud server) is required to process
requests from users. Thus, users may experience high latency if the cloud experiences a high volume of
traffic. Such limitations hinder the applicability of GenAI to applications with low latency requirements.
Besides, the heavy computation in a cloud consumes a significant amount of energy. The overly centralized
computing framework is eco-unfriendly, unsustainable, and cost-inefficient.
87
Figure 6.1: The significant amount of data generated in the AIGC era poses an unprecedented challenge
in computer networks.
In recent years, the proliferation of mobile devices and the exponential growth of data-intensive applications have spurred the development of edge-cloud computing solutions. Edge-cloud computing takes
advantage of powerful computation resources in cloud servers and efficient data management and communication in edge servers. It has emerged as a promising solution for consumer-based AI applications
and edge intelligence. For example, several large AI models are deployed with the edge-cloud computing
system [146, 216]. Compared to traditional cloud computing and multi-access edge computing (MEC),
edge-cloud computing can exploit more computation resources and achieve lower latency through the
collaboration between clouds and edges.
GenAI poses unprecedented challenges to scalable computing systems and the need for edge-cloud
computing because of three main reasons: 1) a significant amount of data generated, 2) consumer-centric
applications, and 3) high cost to maintain centralized GenAI services. First, compared to discriminative
AI, GenAI produces a significant amount of multimedia content, or so-called AI-generated content (AIGC),
in different modalities, such as audio, images, text, etc. Fig. 6.1 shows the evolution of different phases
in content creation. Compared to professionally-generated content (PGC) and user-generated content
(UGC), GenAI created much more data on the Internet. As a result, transmission latency becomes a serious
challenge in GenAI services. Although latency is a common challenge of deploying models at the edge, it
is even more so in the context of GenAI due to a much larger data amount.
88
The second challenge is the unique application domain of GenAI. Currently, most GenAI services target
consumer-centric applications. In addition, many applications require real-time interactions, such as the
chatbots. It makes more sense to place the computation system closer to users instead of relying on a
centralized computation infrastructure to process all user requests. In addition, edge-cloud computing can
preserve more privacy for users by storing their data only on local servers or user devices. Deploying
GenAI services closer to the users by adopting an edge-cloud computing paradigm can improve efficiency
and data privacy.
Third, the required resources to run GenAI services are huge. For example, ChatGPT by OpenAI∗
is one
of the most popular GenAI services recently. It is a chatbot used to interactively answer users’ questions
in human-like responses. It processed more than 13 million daily requests in January 2023 [211]. Although
the exact computing infrastructure used by the ChatGPT service is not publicly available, we can estimate
the cost to run the service each day based on the model architecture of GPT-3 [21], the generative model
to support the ChatGPT service. GPT-3 is an LLM containing 175 billion parameters, which requires more
than 350 GB of RAM and VRAM to run the model. To deploy such a large model with minimum latency, a
distributed parallel computing system with at least 2,048 GPUs is required [21, 211] to handle user inputs.
Relying solely on the computation power in the cloud would lead to high latency when the request volume
is high. In addition, its daily electricity charge is estimated to be around $600,000 using NVIDIA A100
GPUs; not to mention the training of GPT-3, which requires 108
times computation and more than 105
iterations. It is neither cost-efficient nor feasible to deploy such a service entirely on the cloud servers.
Due to the above-mentioned three emerging challenges, the collaboration of edge and cloud computing resources will mitigate the burden of cloud servers, especially under the high volume of requests,
or “at scale”. In this chapter, we examine four important aspects of deploying GenAI under edge-cloud
computing: 1) computation and data offloading, 2) low latency, 3) personalization, and 4) privacy.
∗
https://openai.com/blog/chatgpt
89
6.2 Bottlenecks for Scalable GenAI Systems
As the load on certain services increases, the services should maintain a constant response time in the face
of this increased workload because new nodes are added to the cluster, and new server instances are run.
Such a data service is called a scalable one. When the services fail to meet the requirements in a centralized
cloud cluster, the edge-cloud computing paradigm can provide significant benefits in time, computation,
and power efficiency. To understand how GenAI services work at scale in the edge-cloud paradigm, we
must examine the available computation power, network speed, number of concurrent connections or
users, and latency requirements, as discussed below.
6.2.1 Memory
A modern cloud computing infrastructure often contains thousands of GPUs as computing power. Each
GPU’s video RAM (VRAM) can range from 32 to 80 GB. Thus, the total number of memory available is at
the scale of 100TB. An LLM, LLaMA, containing 7B ∼ 65B model parameters require 28GB ∼ 260GB of
VRAM to process an inference request. In other words, the cloud server can only handle ∼ 4,000 requests at
once, even using the most lightweight model. However, usually, for a cloud server, there will be as many
as 500,000 concurrent requests, which are much more than what it can process in real-time. Then, any
GenAI models that require more than 200MB of memory during inference demand distributed processing
at scale.
6.2.2 Network Bandwidth & Concurrent Connections
One unique characteristic of GenAI services is the output dimensions. Their output dimensions are much
larger than discriminative AI services since the latter only outputs low-dimensional decision vectors. For
example, the output dimension can be up to 1920×1080, equivalent to 6MB of data in image generation
tasks, while the output dimension is between 96kbps ∼ 160kbps in audio synthesis tasks. The transmitted
90
AI-generated content can easily exceed the network bandwidth of a centralized cluster. Any GenAI models
that transmit generated content exceed the network bandwidth, where
output bitrate × # of connections ≥ bandwidth,
demand distributed processing at scale.
6.2.3 Computation & Latency
For real-time applications, such as dialogue agents and Metaverse with AI-generated scenes, latency is
one of the top priorities. Latency is also closely related to the computation power and the number of
FLOPs. For example, a 90fps frame rate is required to avoid dizziness in the Metaverse, meaning that the
computation resources should be powerful enough to generate the content in 1/90 second. Considering
using A100 GPUs, the computation power is 312 teraFLOPs per second. To meet the 90 fps requirement,
the model needs to have a number of FLOPs lower than 3.5 teraFLOPs to achieve real-time interactions.
These requirements should be jointly considered, and one will affect the other. For example, the number of concurrent connections will affect the required network bandwidth and latency; the number of
model parameters to fit in the computation infrastructure will affect the number of concurrent connections.
6.3 Technical Challenges
There are technical challenges in training and deploying GenAI services at scale. The major ones include:
1) increased output dimensions, 2) growth in model sizes, 3) power consumption, 4) latency, and 5) infrastructure reliability. They are summarized below to demonstrate the need for good resource coordination
between edges and the cloud with edge-cloud computing.
91
6.3.1 Increased Output Dimensions
GenAI is a specific category of AI that creates new content in multimedia formats, such as audio, images,
or texts. Compared to discriminative AI, the output dimensions of GenAI are much larger, posing a new
challenge in transmitting a high volume of data. For example, for discriminative AI, the outputs are usually
a low-dimensional vector, say, the decision vector. They can be easily transmitted even with a large number
of requests. In contrast, it is challenging to transmit a high volume of multimedia data from the cloud
center to users for GenAI services. Data compression techniques [201] are needed in GenAI. In addition,
GenAI usually has a larger model size than discriminative AI. The former often demands Transformers
[188], while the latter may adopt convolutional neural networks. Consequently, GenAI demands more
computational resources, including hardware costs and power consumption. Thus, designing an efficient
edge-cloud GenAI system at scale is a unique challenge.
Figure 6.2: The development of generative LLMs and their model sizes as a function of time. The vertical
axis is in log scale. Models in the figure include GPT-2 [149], T5 [150], Turing-NLG [169], GPT-3 [21],
LaMDA [180], MT-NLG [169], and PaLM [34].
6.3.2 Growth in Model Sizes
In order to achieve better performance in various applications, GenAI systems adopt larger models with
more model parameters and computation over time. The growth rate of their model sizes is an exponential
92
Model Modality Hardware Power (watts) Hours Energy Comsumption (kWh) CO2e (lbs)
WaveGAN [44] Audio P100 GPU x1 250 96 24 19.63
GANSynth [50] Audio V100 GPU x1 300 108 32.4 26.5
FloWaveNet [92] Audio V100 GPU x1 300 272 81.6 66.74
BigGAN [20] Image V100 GPU x1 300 3,072 921.3 753.54
Stable Diffusion [155] Image V100 GPU x1 300 2,184 655 535.72
GPT-2 [149] Text TPUv3 x 32 - 168 2.8 × 104 2.39 × 104
GPT-3 [21] Text V100 GPU x10,000 - 355 1.29 × 106 1.1 × 106
GLaM [144] Text TPUv4s - - 4.56 × 105 8 × 104
Table 6.1: Comparison of power consumption, carbon emission, and cloud computational cost in the training of large GenAI models in different modalities.
function of time [88] as shown in Fig. 6.2. Specifically, the model sizes of neural GenAI models double
every 6 months as reported in [24]. This is called “Moore’s Law for GenAI”. In contrast, the computation
power of CPUs and GPUs only doubles every two years in the semiconductor manufacturing industry. If
the trend continues, the demand for computation will surpass its supply in the near future. Unless there is
a major breakthrough in supply, its limitation will hinder the future growth of GenAI systems. Thus, how
to train and run GenAI systems through collaboration between the cloud and edges efficiently has become
an urgent issue for the entire community to tackle.
6.3.3 Power Consumption
Power consumption is a major concern in cloud computing [128, 217]. The centralized computation infrastructure consumes a significant amount of electricity in running user requests as well as training large
models. Fig. 6.1 compares power consumption, carbon emission, and cloud computational cost in training
large GenAI models for different modalities. The power consumption of GenAI services is even greater
than simply training GenAI models since they need to process millions of requests per day from the users.
Power consumption and carbon emission are closely related to the number of floating point operations
(FLOPs). More FLOPs imply higher carbon emissions and electricity bills. For example, the GPT-3 model,
the backbone of ChatGPT, demands 1023 FLOPs in one training iteration and 1015 FLOPs in inference. Since
the power efficiency of CPUs/GPUs in modern computation facilities is around 1010 FLOPs/sec-watt, it will
93
(a) Cloud Computing.
(b) Multi-access Edge Computing (MEC).
(c) Edge-cloud Computing.
Figure 6.3: Illustration of latency in different computation frameworks.
demand 27.78 kWh (105
Joule) to process a single request. Apparently, GenAI services are not scalable.
Furthermore, they are eco-unfriendly, unsustainable, and cost-inefficient. To achieve sustainability with
large-scale GenAI services, alternative Green solutions under the edge-cloud computing paradigm are
essential.
6.3.4 Latency
For real-time GenAI applications such as VR and gaming, it is of uttermost importance to reduce latency.
The latency calculation in three different computing frameworks is illustrated in Fig. 6.3. It is the time
94
between a request sent and its response received at the user end. It is determined by uplink transmission
time, inference time, and downlink transmission time; namely,
latency = tUL + tinference + tDL,
where tUL, tinference, and tDL denote uplink transmission time, inference time, and downlink transmission
time, respectively. In the cloud computing framework, the latency comes from the long uplink transmission
time tUL and downlink transmission time tDL. since the computation resources are placed far from the
users. In MEC, the transmission delay is reduced since the processing units are placed closer to the users.
However, the computation resources in edge servers are not as powerful as the ones in the cloud servers.
Thus, the inference time, tinference will be much longer, especially for computation-intensive applications,
such as GenAI services. In edge-cloud computing, tasks are divided efficiently between the edge and cloud
servers. Thus, the overall inference delay can be reduced by leveraging both computation resources in
edge and cloud servers. In addition, the transmission delay is also reduced since the connection between
edge servers and the cloud is much faster than from the user end.
For GenAI applications, their inference time can be longer than that of other applications due to larger
model sizes and more computations required by GenAI models. Furthermore, the output of GenAI services
can be multimedia AIGC. Transmission of multimedia data such as video will demand a longer downlink
transmission time tDL than text data. We can reduce tDL by allocating multimedia generation tasks to edge
servers. Again, the development of green-learning-based GenAI models are in urgent need.
6.3.5 Infrastructure Security
Cloud servers need a large number of GPUs to handle user requests at scale. As mentioned before, Meta
has just started a supercomputing center with 16,000 NVIDIA A100 GPUs to support their GenAI services.
It is unrealistic to set up such powerful but costly infrastructures in many sites globally. Furthermore,
95
Figure 6.4: Illustration of exemplary service of the Metaverse system with GenAI under edge-cloud computing.
such a huge single-site infrastructure is vulnerable to physical and/or cyberspace attacks. Distributed
computing with multiple lightweight cloud servers and much more edge servers will offer a more robust
AI computational infrastructure in the future.
6.4 Application Scenarios
In this section, we present two exemplary GenAI services as concrete examples to demonstrate how to
deploy GenAI in the edge-cloud computing environment. In particular, these services demand low latency
and will have a large number of users when the technologies become mature and the markets are ready.
Scalability-based edge-cloud computing is critical to their successful deployment. They are, a) Metaverse
system, which is a performance- and latency-centric application, and b) artificial intelligence of things
(AIoT), which is a personalization- and privacy-centric application. Details of GenAI model deployment
in the cloud and edges are given separately below.
96
6.4.1 Metaverse System
Metaverse is one of the most important applications in GenAI. With the development of GenAI, most of the
generated scenes rely on machine learning models. Metaverse requires an extremely low latency to make
the transition smoother in order to avoid dizziness. However, high-quality rendering is time-consuming,
and virtual reality (VR) goggles are resource-constrained. Generating satisfactory scenes and meeting the
low latency requirement with resource-constrained edge devices is the key to the success of the Metaverse
system. Apparently, its solutions at scale demand the close collaboration of the computation resources at
the edges and the cloud.
In the Metaverse system, every user should be placed in a single virtual environment. As a result, a
huge map will be required to be generated. Edge-cloud computing can be a latency-efficient solution for
Metaverse applications. For example, as illustrated in Fig. 6.4, the entire map is stored in the centralized
data center that can be shared among all users. Then, the locations, angles, and other parameters can be
collected by user devices and transmitted through a wireless network [85]. The cloud computing clusters
are also responsible for generating the scenes and rendering the results. The compressed scenes will be
sent back to the users. At the user end, a lightweight decoder and renderer are deployed to display the
scenes based on the corresponding viewpoints of the users. As a result, such a system design can reduce
the latency significantly since the computation-heavy parts are taken care of using powerful computation
infrastructure. In addition, the amount of data transmitted in the communication systems is minimized.
The users will send the request to the GenAI models in the cloud, and the compressed scenes will be
transmitted back to the users.
Edge servers are a fundamental component in the Metaverse system [110]. They serve a similar role
as in the content delivery network (CDN) to distribute content based on geographical locations and share
the computation load in the cloud server. Users in the same locations will be connected to the same edge
server. Once a user sends a request to the Metaverse system to generate the local scene, it is transmitted
97
Figure 6.5: Illustration of exemplary service of the AIoT system with GenAI in the edge-cloud computing
environment.
through the edge servers and cached. Other users in the same location can access the cached scenes in
the edge servers to further reduce the latency. Computation resources in the edge servers should also be
leveraged. For example, they can be helpful in compressing and decompressing the scenes generated in
the cloud server. As a result, not only the latency can be reduced, but also the quality of the generated
scenes is improved.
6.4.2 Artificial Intelligence of Things
Artificial Intelligence of Things (AIoT) is an emerging application to combine artificial intelligence (AI)
technologies in the Internet of Things (IoT) systems [231]. Through the integration of AI and ubiquitous
wireless networking infrastructure, one can build AIoT systems where the end devices have certain intelligence in data processing and analytics. GenAI can be further exploited to facilitate a broader range of
applications. For example, a voiced assistant can interact with users in applications such as autonomous
driving, smart cities, and smart homes, where fluent human speech has to be automatically generated from
multiple information sources, which is often in the form of text data.
98
To implement AIoT with edge-cloud computing (say, voiced assistant applications), we need to consider
privacy, personalization, and data synchronization [32]. Users may collect data to train more relevant
personalized GenAI models. Training a simple GenAI model with acceptable performance on user devices
is desired. Then, model parameters of multiple users can be sent to cloud servers to be integrated to build a
more advanced GenAI model through federated learning [90, 136] or split learning [160, 238]. In addition,
data can be constantly collected from the end devices to ensure the information in GenAI models is upto-date. Online optimization [108, 210] supports GenAI model training with streams of data on the fly.
User devices can be synchronized with advanced GenAI models through firmware updates. As a result,
the whole system can benefit from a larger pool of training data from users via federated or split learning
while user data privacy can be well protected.
The hierarchy in edge-cloud computing can be utilized for more efficient GenAI model deployment.
For example, large, middle-size, and lightweight models can be placed in cloud servers, edge serves, and
user devices, respectively. Different resolutions of the models can be achieved through knowledge distillation [62, 196] and model parameter pruning [85, 157]. Grouping users with the same computation facility
can further reduce the computation. Different edge and cloud servers can be specialized to process different applications efficiently. Personalization can be considered to optimize end devices according to user
behavior. The personalization fine-tuning on the user devices is generally efficient due to the deployment
of lightweight models.
6.5 System Design Considerations
Design considerations for providing GenAI services at scale using edge-cloud computing are examined in
this section. Training and deployment of GenAI services should be considered separately. For the training
of GenAI models, a larger amount of computational resources and training data are needed. Key considerations include: 1) computation offloading, 2) personalization, 3) privacy, and 4) information recency. After
99
Figure 6.6: The roadmap of designing GenAI services at scale. Computation offloading, latency, privacy,
and data offloading are the major considerations.
models are trained, it is desired to deploy them on user devices for lower latency and power consumption.
There are three main considerations: 1) lightweight models, 2) minimizing latency through edge-cloud collaboration, and 3) multi-modality content generation and interface. First, lightweight models are essential
because of limited resources on edge servers and user devices. Second, by properly dividing the inference
tasks to edges and the cloud, inference latency can be largely reduced through edge-cloud collaboration.
Third, multimedia content will become the main media for humans to acquire information, as evidenced
by the popularity of videos on the Internet nowadays. Multi-modality content generation and interface at
edges should be considered carefully. Fig. 6.6 summarizes the design considerations for providing GenAI
services at scale.
6.5.1 Training
Since the training of large-scale GenAI models is costly, we need to consider the following issues.
100
(a) Model parallelism. (b) Data parallelism. (c) Hybrid parallelism.
Figure 6.7: Three parallelism for computation and data offloading in DNN model training [116].
6.5.1.1 Computation offloading
This is an important concept in edge-cloud computing and collaboration. It means that we need to fully
utilize computation resources in the cloud and edges. Traditional cloud computing puts all computational
loads in a centralized cluster. Users might experience long latency if the resources in the cloud cannot
meet the requirements of sudden heavy service requests. Furthermore, the computational cost to train
large GenAI models is extremely high. It may take days or weeks to train large models. Thus, computation
offloading has to be considered when training GenAI systems under the edge-cloud computing paradigm.
Most GenAI services adopt deep neural networks (DNNs) as models. DNNs consist of multiple layers.
To balance computation loads in training DNNs, we can decouple the training procedure. [116] illustrates
how DNNs can be trained by different workers in parallelism, as shown in Fig. 6.7. Such an idea can be
leveraged in edge-cloud computing, where the user devices, edge servers, and the cloud serve as different
workers. Thus, in edge-cloud computing, data does not need to be entirely transmitted to the cloud server,
and the training does not need to take place entirely in the cloud. Instead, different layers can be trained
by different computational facilities (e.g. user devices, edge servers, and the cloud server). For example, as
shown in Fig. 6.7 (a), deeper layers are farthest from users, and they can be trained in the cloud. Gradients
are propagated to edge servers to train middle layers. Gradients are propagated again to user devices.
Finally, shallow layers are closest to users, and their parameters can be trained on user devices. As a
result, system optimization can be carried out through the collaboration of user devices, edge servers, and
101
Figure 6.8: Personalization of GenAI services.
the cloud server. Only the gradient information has to be transmitted in such a design. Another idea is to
decouple the training data as shown in Fig. 6.7 (b). Smaller DNNs can be trained in parallel by leveraging
data parallelism. Then, multiple smaller models can be integrated through federated learning. Finally, a
hybrid solution exploiting both model parallelism and data parallelism can be explored as well as shown in
Fig. 6.7 (c). Under the GenAI context, such parallelism and collaboration between edges and the cloud are
even more important. Computation and data offloading should be carefully designed in large-scale GenAI
services.
6.5.1.2 Personalization
Edge-cloud computing can provide personalized GenAI models. While training a GenAI model requires
a large amount of data, personalization can be achieved by fine-tuning the trained model with a small
amount of user data. The collaboration between edges and the cloud for personalized services is depicted
in Fig. 6.8. First, an advanced GenAI model, called the foundation model, should be trained in the cloud
with common data. In this step, the trained foundation model can handle general requests. To achieve
102
Figure 6.9: Privacy preservation through federated learning.
personalization, personal data, such as user logs and metadata, are collected from user devices and sent to
edge servers. The foundation model is also placed in edge servers for personalization. Then, a fine-tuning
technique can be developed to shift the model domain from a generic one to a user-specific one using
personal data. Typically, fine-tuning requires much fewer computation resources, and it can be entirely
conducted in edge servers.
6.5.1.3 Privacy
Privacy is a major concern in GenAI services to prevent personal information from being disclosed to
other users and companies. It is particularly important in the context of GenAI services since generated
content is difficult to control. One solution to privacy is the use of federated learning, as shown in Fig. 6.9.
103
Figure 6.10: Online optimization in edge-cloud computing.
The core concept is to share the model parameters among users instead of sharing personal data. Users
will have their own models stored in user devices or edge servers based on applications. The models are
trained based on user data. Information exchange among users is through aggregating user models in the
cloud. That is, all trained user models are transmitted from edges to the cloud, where small user models are
combined to train an advanced large model. Finally, the model parameters of the advanced model will be
synchronized with user models for the next round of training. By sharing model parameters in federated
learning, GenAI services can preserve user privacy while collecting relevant information from users.
Besides federated learning, split learning [8, 160, 238] offers a powerful solution to data privacy preservation when training GenAI models in a distributed setting. Instead of passing model parameters as done
in federated learning, split learning shares the gradients among different sections of the models that are
trained by different clients independently. Thus, no other clients can access the original raw data. In such
a way, models can be optimized with the arrival of new data samples, while data privacy is preserved at
the same time, in an edge-cloud collaborative fashion.
104
Figure 6.11: Existing technologies to obtain lightweight GenAI models.
6.5.1.4 Incremental Update
Keeping the information updated is one of the main challenges to GenAI services. For example, chatbots
need the most updated information to offer a better user experience. On the other hand, training GenAI
models is time-consuming and inefficient. Incremental learning is needed. However, it is not easy to
implement in neural network models. Online optimization with edge-cloud computing is an alternative
way to keep the services updated. This is illustrated in Fig. 6.10. Usually, it contains two models - an
online model and an offline model. The online model is stored in the cloud server for the most updated
information by adopting online optimization. At the same time, a smaller offline model is placed in the
edge servers for low latency inference and cloud online model offloading. Online and offline models are
synchronized periodically to ensure that edge intelligence is also up-to-date.
6.5.2 Deployment
Three design considerations in deploying GenAI services are elaborated below.
105
6.5.2.1 Lightweight Models
Deploying GenAI models on edge servers and user devices can lower latency in user-centric applications.
Large GenAI models cannot be deployed on user devices due to their large model sizes and high power
consumption. Lightweight GenAI models, as summarized in Fig. 6.11 are more suitable. For example,
knowledge distillation can fit into edge-cloud computing well. With knowledge distillation, the knowledge
learned in a huge teacher model is transferred to a smaller student model. Thus, the teacher model can be
trained and stored in the cloud server while the student model is distilled from the teacher model in the
edge servers and, then, stored in user devices. Model pruning adopts a similar concept to train a smaller
model from a large model, which takes place in edge servers. Other techniques include quantization and
model compression. They can reduce the model sizes effectively without the collaboration between the
cloud and edges.
Recently, there has been an increasing number of research focusing on developing lightweight GenAI
models. LLaMA [183] reduces the number of model parameters in LLMs to as small as 7 billion using a selfinstruct training technique called Alpaca [178]. Lightweight GenAI models encourage the development of
mobile- or web-based applications on user devices, such as WebLLM†
. The small model sizes also alleviate the burden in caching-based communication networks. Latency is also largely reduced due to lower
computation and transmission delay. Research in developing lightweight GenAI models demonstrates the
urgency to reduce the ridiculously large models while still having comparable performance.
6.5.2.2 Minimizing latency through edge-cloud collaboration
When deploying GenAI models, it is desirable to minimize the latency through the collaboration between
edge and cloud servers as illustrated in Fig. 6.3. First, transmission latency is largely reduced due to
the introduction of edge servers [28]. In general, applications are sped up 20 times while reducing energy
†
https://mlc.ai/web-llm/
106
consumption by 5% [35]. Furthermore, an optimized strategy to divide the computation tasks among edges
and the cloud can reduce the inference latency. This is critical to scalable and efficient model deployment.
We attempt to analyze the inference latency of GenAI models in the edge-cloud computing system by
proposing clear instructions on how the tasks should be divided among edges and the cloud below. To
estimate the inference latency, edge and cloud servers can be modeled as M/M/c queues since most servers
adopt parallel computing using GPUs. Each inference request can be modeled as a customer in the queue.
There are two important parameters to specify for each inference job:
• FLOPs (F) governs the service rate of the server. A higher FLOPs indicate a longer service time
under the same computation resources.
• Memory Usage (U) governs the number of parallel jobs to be run on the server. A higher memory
usage will lead to fewer concurrent jobs.
At the server end, there are three important parameters:
• GPU Memory (G) controls how many jobs can be run simultaneously.
• Computation Power (P) controls the service rate.
• Concurrent Connections (N) controls the arrival rate.
We can specify an M/M/c queue as:
c =
G
U
, λ = N, µ =
P
F
,
where λ is the arrival rate and µ is the service rate. For example, LLaMA is a powerful task-generation
model. The smallest model contains 7B parameters. It will consume about 28GB of memory during inference. Inference of LLaMA-7B will require around 13.1 GFLOPs. Suppose the cloud server is equipped with
100 NVIDIA A100 GPUs, and the edge server is equipped with 8 NVIDIA V100 GPUs. Each A100 GPU has
107
80GB of memory, and it can process 312T FLOPs per second. Each V100 GPU has 32GB memory, and it
can process 120T FLOPs per second. The cloud server has an arrival rate of 1,000, while the edge server
has an arrival rate of 20. Then, the cloud server can be modeled as an M/M/285 queue with λ = 1000
and µ = 23816, and the edge server can be modeled as an M/M/9 queue with λ = 20 and µ = 9160.
As a result, the cloud server has a higher server utilization to be able to handle multiple concurrent jobs
efficiently. On the other hand, the edge servers have a lower arrival rate and work load so it is efficient to
process jobs sequentially.
Furthermore, the average service time W in the M/M/c queue can be written as
W =
1
µ
+
C(c, λ
µ
)
cµ − λ
,
where
C(c,
λ
µ
) = 1
1 + (1 −
λ
cµ
)
µcc!
λc Σ
c−1
k=0
λk
µkk!
is referred to as Erlang’s C formula. The inference latency of an edge server is bottlenecked by its service
rate 1
µ
. Consequently, to minimize the overall inference latency, tasks with lower FLOPs but higher memory usage, such as preprocessing tasks, should be distributed to edges, and tasks with higher FLOPs but
lower memory usage, such as deep neural networks, should be distributed to the cloud.
6.5.2.3 Multi-modality Content Generation and Interface
Image captioning and text-to-image generation are two examples of multi-modality content generation and
interface. To implement multi-modality content generation, we need a joint embedding space to connect
two different modalities. CLIP [148] is a well-known multi-modality GenAI model. It learns a joint multimodal latent space for language and vision through contrastive pre-training. We elaborate on how such
a framework can be efficiently deployed under the edge-cloud computing paradigm. The multi-modality
108
models usually consist of three modules: 1) the input module, 2) the generation model, and 3) the output
module. The first and third modules are more relevant to users, and they do not require as many computational resources as the second module. Thus, we can place the input/output modules in edge servers
or user devices to avoid transmitting generated content. The main generation module is deployed in the
cloud server since it requires more computation resources.
6.6 Research Outlooks
It is important to think beyond the current GenAI service framework in proposing future research directions. Some promising topics are given in this section.
6.6.1 Generic versus Domain-specific GenAI Models
As one of the most famous GenAI services nowadays, ChatGPT provides a generic GenAI model at the
expense of a large model size and a high running cost. It may be advantageous to trade breadth for depth of
generated content to lower the service cost and enhance the quality of services. That is, instead of handling
general questions, it is more efficient to train GenAI models in a specific domain. In addition to parameter
efficiency in training domain-specific models, domain-specific applications imply more homogeneous data
and users. As a result, under the edge-cloud computing paradigm, it is more likely to adopt caching [97] to
further improve efficiency. Examples of domain-specific applications include healthcare, financial advice,
etc., where the accuracy of generated content is the top priority in some application domains.
6.6.2 Decomposition of Large Language Models
ChatGPT is a large language model (LLM) built upon large pre-trained transformers for generative tasks.
It does not leverage the tool of knowledge graphs (KGs), where knowledge is stored in a graph-structured
format. It is appealing to decompose a large language model into smaller ones that have an interface with
109
domain-specific KGs. This decomposition is expected to lower the complexity of the GenAI system for
cost reduction. The resulting AIGC services can be more transparent and scalable. Furthermore, personalization is easier to offer with the help of KGs [158]. That is, generic KGs are stored in the cloud, while
personalized KGs are stored in local servers or user devices. In addition, edge and cloud servers can collaborate in a way that the reasoning tasks using LLMs are processed in the cloud with more computational
resources, while the edge servers are responsible for natural language understanding (NLU) and natural
language generation (NLG) with constrained resources.
6.6.3 Quality Assurance for AIGC
The quality assessment of the generated content, i.e., how similar are they to the human-generated content,
is an important future research topic. Such quality assurance modules can be easily deployed at user
devices as the filter of the content generated in the cloud. The quality assurance module can be trained
collaboratively with the GenAI models in the cloud to improve the performance capability [111]. We may
have different considerations against different AIGC modalities. Two examples are given below.
a) Visual Content. One may use common sense to evaluate the quality of generated visual content.
For example, a picture with a person riding a horse is more natural than the opposite. Generated content
that contradicts common sense tends to look strange to users. Sensitive content, copyright content, and
trademarks should also be avoided in the generated content [45, 46, 237]. Automatic detection [65, 96] of
strange and/or forbidden AIGC is still an open problem. Furthermore, deepfake images can be a security
concern for some applications. A lightweight deep fake detection solution [30] has been developed to
address this concern.
b) Textual Content. The quality of generated texts can be evaluated at three levels: grammatical
correctness, readability, and factual correctness. Coherency and conciseness are criteria for readability.
They are more difficult to evaluate than grammatical errors. Mis/disinformation is already common over
110
the Internet. It will be even easier to generate a large amount of fake news for malicious purposes with
the GenAI service.
6.6.4 Green GenAI Models
To address the high carbon footprint yielded by huge deep learning networks, green learning [100] has
been proposed as an alternative learning paradigm in recent years. A green learning model is characterized
by its low carbon footprint, lightweight model, low computational complexity, and logical transparency. In
addition, unlike deep neural networks, which require end-to-end optimization, green learning models are
modularized and can be optimized separately. Such a characteristic is particularly appealing under edgecloud collaboration as individual modules can be optimized at the user devices with minimum memory
requirement and carbon footprint. Green GenAI models have been explored in the last several years, e.g.,
NITES [105], TGHop [106], Pager [9], GENHOP [104]. These models are very attractive at the edges. They
can also be implemented in cloud servers to reduce carbon footprints and save electricity bills. More efforts
along this line are needed.
6.6.5 Attacks and Defense
Attacks and defenses are important in computer networks and AI models. From the communication perspective, since most of the computations are conducted in the cloud servers, user data will be transmitted
from user devices to edges, then finally to clouds. In addition, the generated content will be sent back to
the users. In such a process, the data will travel through many computers and networks, increasing the
risk of backdoor attacks on the models or the data. It is important to design a defense mechanism [13] for
the generated content, such as data encryption [114], to prevent any attack during transmission. From the
model perspective, the generated content can be manipulated to yield harmful outcomes [41, 166]. Such
111
an attack on the GenAI models is called an adversarial attack. Thus, detecting adversarial attacks and
improving the robustness and trustworthiness of GenAI models are essential.
6.6.6 Hierarchical Knowledge System
“Does GenAI have the intelligence to understand user requests?” There has been a heated debate for a while
about how GenAI models understand user inputs and react to them. However, like humans, there is no
intelligent agents can be built without a knowledge system. In the world of computers, knowledge systems
are usually represented as knowledge graphs (KGs) [84], which store knowledge in a graph format. To
achieve artificial general intelligence (AGI), a mechanism for the models to communicate with KGs is
required and demands further investigation [193, 226]. KGs are usually stored as databases and can interact
with GenAI models efficiently. In addition, the agent in the cloud and the agent on the user devices may
not need the same degree of the knowledge system due to the different hardware specifications. Cloud
agents can serve as “teacher models” and are equipped with more universal knowledge, while edge agents
usually only need to focus on a specific and customized task so the knowledge system can be efficiently
distilled and learn from the teacher models [158, 203]. Such a hierarchical knowledge system is important
to achieve AGI, especially under the edge-cloud computing paradigm.
6.6.7 Collaboration among Different Agencies
It is a unique characteristic of edge-cloud computing to achieve user and data privacy and data collaboration in model training at the same time. Such a characteristic is especially crucial for several specific
application domains that cherish these two requirements. For example, one practical application domain
is for the public sector [4]. While under different bureaus, the data are confidential and cannot be shared
among each other. However, it is very often that collaboration between different bureaus is required for
training a better GenAI system in the public sector. Other examples are GenAI for education [10] and
112
GenAI for hospitals [93]. While the data among different institutions should not be shared with each
other, common knowledge can be exchanged through an edge-cloud collaboration paradigm. These are
the practical examples for the future application domains for GenAI under edge-cloud computing.
6.6.8 Bias and Fairness
Bias and fairness have been important topics in AI research [142, 156] for a long time. They are even more
important for GenAI since the generated multimedia content might be affected by the bias more easily
than discriminative AI. The bias factors include cultural differences, differences in application domains,
etc. They may come from differences in large training corpora collected and stored in the cloud and in
distributed training data collected by user devices from different population groups. For example, the
chatbot might be trained primarily on English data in the cloud, and, as a result, it has a bias against lowresource languages with poorer performance. Healthcare-oriented GenAI is particularly concerned with
issues of bias and fairness since the corresponding professional services have high liability and demand
high accuracy. Through edge cloud collaboration [225], it is possible to mitigate the bias and fairness issue
in GenAI since the information is shared among cloud and edge servers, which allows a broader range of
data sources.
6.7 Conclusions
The training and deployment of GenAI services at scale pose a new challenge to the design of modern
edge-cloud computational systems due to extremely large model sizes, increased output dimensions, heavy
power consumption, and potential latency caused by a lack of computational and network resources. Two
illustrative GenAI services were envisioned to show the importance of developing GenAI systems at scale
on the one hand and validate the challenging claims on the other hand in this work. Afterward, an indepth discussion on various design considerations of GenAI services over current communication systems
113
were given. It was concluded that a desired design has to balance computational resources between edges
and cloud servers and consider latency, data privacy, and personalization. Specifically, federated and split
learning, where small GenAI models are trained at edges, while large GenAI models are trained at the
cloud by combining a large number of small models, are expected to play important roles. As a result,
most inference tasks can be distributed at edges. Finally, we point out several future research directions,
such as domain-specific GenAI models, decomposition of large language models, green GenAI models,
quality AIGC assurance, attacks and defense in edge-cloud computing, hierarchical knowledge system,
collaboration between different, and bias and fairness of GenAI.
114
Chapter 7
Conclusion and Future Work
7.1 Summary of the Research
In this thesis, we develop explainable and lightweight approaches for knowledge graph completion and also
identify key bottlenecks for scalable generative content delivery. Three novel models for KGC targeting
three different tasks, including link prediction, triple classification, and entity type prediction are proposed.
All methods are designed to fulfill the belief of being explainable and lightweight. Below, we summarize
the three novel methods.
KGBoost: A Classification-based Knowledge Graph Completion Method with Negative Sampling. In this work, we propose KGBoost, a knowledge graph completion method with a modularized design to model the unique pattern of each relation. Different from previous KG embedding models using a
single score function for all relations, we formulate link prediction in each relation as a binary classification
problem and leverage XGBoost to predict missing links. Besides, range-constrained with co-occurrence
(rcwc) negative sampling and self-adversarial negative sampling are proposed to generate effective negative samples. Experimental results show that KGBoost not only outperforms state-of-the-art methods in
link prediction, but also works well under low-dimensional setting.
115
GreenKGC: A Lightweight Knowledge Graph Completion Method. In this work, a lightweight
KGC method, called GreenKGC, was proposed to make accurate link predictions in low dimensions. It consists of three modules that can be trained individually: 1) representation learning, 2) feature pruning, and
3) decision learning. Experimental results in low dimensions demonstrate GreenKGC can achieve satisfactory performance in as low as 8 dimensions. In addition, experiments on ogbl-wikikg2 show GreenKGC
can get competitive results with much fewer model parameters. Furthermore, the ablation study shows
the effectiveness of KG partitioning and feature pruning.
AsyncET: Asynchronous Learning for Knowledge Graph Entity Typing with Auxiliary Relations. In this work, we observe that a single auxiliary relation hasType that was previously used in the
KGE methods when solving the KGET task is not adequate to model the diverse relations between entities
and types. Thus, we try to add more such auxiliary relations based on the types’ context. We define the
context of the types as a collection of attributes of the entities that the type is associated with. As such,
the neighborhood information is implicitly encoded when the auxiliary relations are added. Then, we propose an iterative training scheme, named AsyncET, for the KGET task. The entity embeddings are first
well-initialized by training with factual triples. Typing information is then encoded using typing triples.
We conduct experiments on two KGET datasets. The results demonstrate that AsyncET can outperform
all other methods while retaining a much better asymptotically time and space complexity.
In Chapter 6, we bridge KGs with applications in GenAI models for a better impact. GenAI models are
still in the early stages. There is a potential to improve the explainability and scalability of GenAI models
by integrating explainable and lightweight KGC methods. Thus, we identify the key bottlenecks for scaling
up GenAI model and provide design considerations when training and deploying GenAI models at scale.
Scalable Generative Content Delivery on Demand. Scaling up Generative AI (GenAI) models
poses unique challenges, including the increased output dimensions, heavy power consumption, and potential latency caused by a lack of computational and network resources. Two application scenarios were
116
Figure 7.1: Medical KG construction.
envisioned to show the importance of developing GenAI systems at scale. Design considerations focusing on balancing computational resources between edges and cloud servers, latency, data privacy, and
personalization are proposed. Specifically, federated and split learning, where small GenAI models are
trained individually, while large GenAI models are integrated from a large number of small models later,
are expected to play important roles. As a result, most inference tasks can be distributed.
7.2 Future Research Directions
The integration of KGs and large language models (LLMs) for better controllability, explainability, domain
specialty, and scalability, are important for green generative content delivery. In this section, two future
research directions are listed to extend the proposed green KG completion models to solve some real-world
problems.
7.2.1 Domain-Specific Knowledge Graph Construction and Applications
KGs are developed from the traditional expert systems, which are highly knowledge-intensive and require
experts in the loop. As machine learning on unstructured data becomes more mature, KGs can be directly
constructed and extracted from a large textual corpus. However, KGs still inherit the characteristics of
117
Figure 7.2: General pipeline of retrieval augmented generation with knowledge graphs.
expert systems in that they generally perform better in a given specific application domain instead of
serving a general purpose. Thus, it’s desired to build KGs for domains that require explainability and
scalability in their predictions and possess abundant of unprocessed resources. One example of such a
domain is medical informatics. As shown in Fig. 7.1, there are a lot of resources, such as research papers,
medical records, and glossaries, that can be used as raw data to build a medical KG. First, the entities and
relationships are identified. Then, through a regularization and matching process, the initial medical KG
can be augmented with more knowledge and be curated in a highly structured format. Applications of a
medical KG include the prediction of treatment outcomes [126], phenotypes identification, and interactive
medical AI systems.
7.2.2 Retrieval-Augmented Generation (RAG) with Knowledge Graphs
Large language models (LLMs) have demonstrated the power of text generation. However, there are still
some limitations. For example, no sources can be associated when generating responses. In addition, the
models can only be equipped with the knowledge at the moment of the model being trained. The poor
explainability and capability to be incrementally updated of LLMs inspire research in retrieval-augmented
generation (RAG). RAG aims to provide LLMs with additional resources through retrieval from the database
118
when receiving user prompts. The retrieval stage can be achieved with KGs, where many explainable and
lightweight reasoning models are proposed in this thesis. As shown in Fig. 7.2, we aim to handle the
reasoning task KGs, and the LLMs can serve as the user interface. Such a design can largely improve the
efficiency and explainability of LLMs for tasks such as question answering [86, 101].
119
Bibliography
[1] Eleni Adamopoulou and Lefteris Moussiades. “An overview of chatbot technology”. In: IFIP
international conference on artificial intelligence applications and innovations. Springer. 2020,
pp. 373–383.
[2] Addi Ait-Mlouk and Lili Jiang. “KBot: a Knowledge graph based chatBot for natural language
understanding over linked data”. In: IEEE Access 8 (2020), pp. 149220–149230.
[3] KM Annervaz, Somnath Basu Roy Chowdhury, and Ambedkar Dukkipati. “Learning beyond
datasets: Knowledge graph augmented neural networks for natural language processing”. In:
arXiv preprint arXiv:1802.05930 (2018).
[4] Naomi Aoki. “An experimental study of public trust in AI chatbots in the public sector”. In:
Government Information Quarterly 37.4 (2020), p. 101490.
[5] Suket Arora, Kamaljeet Batra, and Sarabjit Singh. “Dialogue system: A brief review”. In: arXiv
preprint arXiv:1306.4134 (2013).
[6] Andrea Asperti, Davide Evangelista, and Elena Loli Piccolomini. “A survey on variational
autoencoders from a green AI perspective”. In: SN Computer Science 2.4 (2021), p. 301.
[7] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and
Zachary Ives. “Dbpedia: A nucleus for a web of open data”. In: The semantic web. Springer, 2007,
pp. 722–735.
[8] Ahmad Ayad, Melvin Renner, and Anke Schmeink. “Improving the communication and
computation efficiency of split learning for iot applications”. In: 2021 IEEE Global Communications
Conference (GLOBECOM). IEEE. 2021, pp. 01–06.
[9] Zohreh Azizi and C-C Jay Kuo. “PAGER: Progressive Attribute-Guided Extendable Robust Image
Generation”. In: arXiv preprint arXiv:2206.00162 (2022).
[10] David Baidoo-Anu and Leticia Owusu Ansah. “Education in the era of generative artificial
intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and
learning”. In: Available at SSRN 4337484 (2023).
120
[11] Ivana Balažević, Carl Allen, Timothy Hospedales, and First Last. “Multi-relational poincaré graph
embeddings”. In: Advances in Neural Information Processing Systems 32 (2019).
[12] Ivana Balažević, Carl Allen, Timothy M Hospedales, and First Last. “Tucker: Tensor factorization
for knowledge graph completion”. In: arXiv preprint arXiv:1901.09590 (2019).
[13] Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi,
Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, et al. “Identifying
and Mitigating the Security Risks of Generative AI”. In: arXiv preprint arXiv:2308.14840 (2023).
[14] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. “Freebase: a
collaboratively created graph database for structuring human knowledge”. In: Proceedings of the
2008 ACM SIGMOD international conference on Management of data. 2008, pp. 1247–1250.
[15] Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. “A semantic matching energy
function for learning with multi-relational data”. In: Machine Learning 94.2 (2014), pp. 233–259.
[16] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.
“Translating embeddings for modeling multi-relational data”. In: Advances in neural information
processing systems 26 (2013).
[17] Ibrahim Bounhas, Nadia Soudani, and Yahya Slimani. “Building a morpho-semantic knowledge
graph for Arabic information retrieval”. In: Information Processing & Management 57.6 (2020),
p. 102124.
[18] Leo Breiman. “Random forests”. In: Machine learning 45.1 (2001), pp. 5–32.
[19] Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and
regression trees. Routledge, 2017.
[20] Andrew Brock, Jeff Donahue, and Karen Simonyan. “Large scale GAN training for high fidelity
natural image synthesis”. In: arXiv preprint arXiv:1809.11096 (2018).
[21] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are
few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901.
[22] Liwei Cai and William Yang Wang. “KBGAN: Adversarial Learning for Knowledge Graph
Embeddings”. In: Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).
2018, pp. 1470–1480.
[23] Keyan Cao, Yefan Liu, Gongjie Meng, and Qimeng Sun. “An overview on edge computing
research”. In: IEEE access 8 (2020), pp. 85714–85728.
[24] Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S Yu, and Lichao Sun. “A
comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to
chatgpt”. In: arXiv preprint arXiv:2303.04226 (2023).
121
[25] Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. “Evaluation of text generation: A survey”. In:
arXiv preprint arXiv:2006.14799 (2020).
[26] Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, and Christopher Ré.
“Low-dimensional hyperbolic knowledge graph embeddings”. In: arXiv preprint arXiv:2005.00545
(July 2020), pp. 6901–6914. doi: 10.18653/v1/2020.acl-main.617.
[27] Linlin Chao, Jianshan He, Taifeng Wang, and Wei Chu. “PairRE: Knowledge graph embeddings
via paired relation vectors”. In: arXiv preprint arXiv:2011.03798 (2020).
[28] Batyr Charyyev, Engin Arslan, and Mehmet Hadi Gunes. “Latency comparison of cloud
datacenters and edge servers”. In: GLOBECOM 2020-2020 IEEE Global Communications Conference.
IEEE. 2020, pp. 1–6.
[29] Chandramani Chaudhary, Poonam Goyal, Dhanashree Nellayi Prasad, and Yi-Ping Phoebe Chen.
“Enhancing the quality of image tagging using a visio-textual knowledge base”. In: IEEE
Transactions on Multimedia 22.4 (2019), pp. 897–911.
[30] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, and
C-C Jay Kuo. “Defakehop: A light-weight high-performance deepfake detector”. In: 2021 IEEE
International Conference on Multimedia and Expo (ICME). IEEE. 2021, pp. 1–6.
[31] Tianqi Chen and Carlos Guestrin. “XGBoost: A scalable tree boosting system”. In: Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016,
pp. 785–794.
[32] Peng Cheng and Utz Roedig. “Personal voice assistant security and privacy—a survey”. In:
Proceedings of the IEEE 110.4 (2022), pp. 476–507.
[33] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. “Learning phrase representations using RNN
encoder-decoder for statistical machine translation. arXiv 2014”. In: arXiv preprint arXiv:1406.1078
(2020).
[34] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra,
Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al.
“Palm: Scaling language modeling with pathways”. In: arXiv preprint arXiv:2204.02311 (2022).
[35] Byung-Gon Chun, Sunghwan Ihm, Petros Maniatis, Mayur Naik, and Ashwin Patti. “Clonecloud:
elastic execution between mobile device and cloud”. In: Proceedings of the sixth conference on
Computer systems. 2011, pp. 301–314.
[36] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. “Electra: Pre-training
text encoders as discriminators rather than generators”. In: arXiv preprint arXiv:2003.10555 (2020).
[37] Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. “Efficient data
reconciliation”. In: Information Sciences 137.1-4 (2001), pp. 1–15.
122
[38] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and
Anil A Bharath. “Generative adversarial networks: An overview”. In: IEEE signal processing
magazine 35.1 (2018), pp. 53–65.
[39] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N Mendes. “Improving efficiency and
accuracy in multilingual entity extraction”. In: Proceedings of the 9th international conference on
semantic systems. 2013, pp. 121–124.
[40] Rajarshi Das, Ameya Godbole, Ankita Naik, Elliot Tower, Manzil Zaheer, Hannaneh Hajishirzi,
Robin Jia, and Andrew McCallum. “Knowledge base question answering by case-based reasoning
over subgraphs”. In: International conference on machine learning. PMLR. 2022, pp. 4777–4793.
[41] Jieren Deng, Yijue Wang, Ji Li, Chao Shang, Hang Liu, Sanguthevar Rajasekaran, and
Caiwen Ding. “Tag: Gradient attack on transformer-based language models”. In: arXiv preprint
arXiv:2103.06819 (2021).
[42] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. “Convolutional 2d
knowledge graph embeddings”. In: Thirty-second AAAI conference on artificial intelligence. 2018.
[43] Laura Dietz, Alexander Kotov, and Edgar Meij. “Utilizing knowledge graphs for text-centric
information retrieval”. In: The 41st international ACM SIGIR conference on research & development
in information retrieval. 2018, pp. 1387–1390.
[44] Chris Donahue, Julian McAuley, and Miller Puckette. “Adversarial audio synthesis”. In: arXiv
preprint arXiv:1802.04208 (2018).
[45] Hongyang Du, Zonghang Li, Dusit Niyato, Jiawen Kang, Zehui Xiong, Huawei Huang, and
Shiwen Mao. “Generative AI-aided optimization for AI-generated content (AIGC) services in edge
networks”. In: arXiv preprint arXiv:2303.13052 (2023).
[46] Hongyang Du, Zonghang Li, Dusit Niyato, Jiawen Kang, Zehui Xiong, Dong In Kim, et al.
“Enabling AI-Generated Content (AIGC) Services in Wireless Edge Networks”. In: arXiv preprint
arXiv:2301.03220 (2023).
[47] Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. “A survey of vision-language pre-trained
models”. In: arXiv preprint arXiv:2202.10936 (2022).
[48] Sijing Duan, Dan Wang, Ju Ren, Feng Lyu, Ye Zhang, Huaqing Wu, and Xuemin Shen.
“Distributed Artificial Intelligence Empowered by End-Edge-Cloud Computing: A Survey”. In:
IEEE Communications Surveys & Tutorials (2022).
[49] Ibrahim A Elgendy and Rahul Yadav. “Survey on mobile edge-cloud computing: A taxonomy on
computation offloading approaches”. In: Security and Privacy Preserving for IoT and 5G Networks:
Techniques, Challenges, and New Directions (2022), pp. 117–158.
[50] Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and
Adam Roberts. “Gansynth: Adversarial neural audio synthesis”. In: arXiv preprint
arXiv:1902.08710 (2019).
123
[51] Melike Erol-Kantarci and Sukhmani Sukhmani. “Caching and computing at the edge for mobile
augmented reality and virtual reality (AR/VR) in 5G”. In: Ad Hoc Networks: 9th International
Conference, AdHocNets 2017, Niagara Falls, ON, Canada, September 28–29, 2017, Proceedings.
Springer. 2018, pp. 169–177.
[52] William Fedus, Ian Goodfellow, and Andrew M Dai. “Maskgan: better text generation via filling
in the_”. In: arXiv preprint arXiv:1801.07736 (2018).
[53] Chuan Feng, Pengchao Han, Xu Zhang, Bowen Yang, Yejun Liu, and Lei Guo. “Computation
offloading in mobile edge computing networks: A survey”. In: Journal of Network and Computer
Applications (2022), p. 103366.
[54] David Foster. Generative deep learning: teaching machines to paint, write, compose, and play.
O’Reilly Media, 2019.
[55] Johannes Fürnkranz. “Pairwise classification as an ensemble technique”. In: European Conference
on Machine Learning. Springer. 2002, pp. 97–110.
[56] Rui Gao, Xingsong Hou, Jie Qin, Jiaxin Chen, Li Liu, Fan Zhu, Zhao Zhang, and Ling Shao.
“Zero-VAE-GAN: Generating unseen features for generalized and transductive zero-shot
learning”. In: IEEE Transactions on Image Processing 29 (2020), pp. 3665–3680.
[57] Simson Garfinkel. Architects of the information society: 35 years of the Laboratory for Computer
Science at MIT. MIT press, 1999.
[58] Xiou Ge, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. “CompoundE: Knowledge Graph
Embedding with Translation, Rotation and Scaling Compound Operations”. In: arXiv preprint
arXiv:2207.05324 (2022).
[59] Xiou Ge, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. “CORE: A Knowledge Graph Entity
Type Prediction Method via Complex Space Regression and Embedding”. In: arXiv preprint
arXiv:2112.10067 (2021).
[60] Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and
Xavier Alameda-Pineda. “Dynamical variational autoencoders: A comprehensive review”. In:
arXiv preprint arXiv:2008.12595 (2020).
[61] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. “Generative adversarial networks”. In: Communications of
the ACM 63.11 (2020), pp. 139–144.
[62] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. “Knowledge distillation: A
survey”. In: International Journal of Computer Vision 129 (2021), pp. 1789–1819.
[63] Roberto Gozalo-Brizuela and Eduardo C Garrido-Merchan. “ChatGPT is not all you need. A State
of the Art Review of large Generative AI models”. In: arXiv preprint arXiv:2301.04655 (2023).
124
[64] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. “Draw: A
recurrent neural network for image generation”. In: International conference on machine learning.
PMLR. 2015, pp. 1462–1471.
[65] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. “Giqa: Generated image quality
assessment”. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part XI 16. Springer. 2020, pp. 369–385.
[66] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. “A review on generative
adversarial networks: Algorithms, theory, and applications”. In: IEEE transactions on knowledge
and data engineering (2021).
[67] Nitish Gupta, Sameer Singh, and Dan Roth. “Entity linking via joint encoding of types,
descriptions, and context”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural
Language Processing. 2017, pp. 2681–2690.
[68] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang,
Y. Zhang, and D. Tao. “A Survey on Vision Transformer”. In: IEEE Transactions on Pattern Analysis
and Machine Intelligence 45.01 (Jan. 2023), pp. 87–110. issn: 1939-3539. doi:
10.1109/TPAMI.2022.3152247.
[69] Junheng Hao, Muhao Chen, Wenchao Yu, Yizhou Sun, and Wei Wang. “Universal representation
learning of knowledge bases by jointly embedding instances and ontological concepts”. In:
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining. 2019, pp. 1709–1719.
[70] Abhishek Hazra, Mainak Adhikari, Tarachand Amgoth, and Satish Narayana Srirama. “Fog
computing for energy-efficient data offloading of IoT applications in industrial sensor networks”.
In: IEEE Sensors Journal 22.9 (2022), pp. 8663–8671.
[71] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. “Distilling the knowledge in a neural network”.
In: arXiv preprint arXiv:1503.02531 2.7 (2015).
[72] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. “Distilling the Knowledge in a Neural
Network”. In: NIPS Deep Learning and Representation Learning Workshop. 2015. url:
http://arxiv.org/abs/1503.02531.
[73] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8
(1997), pp. 1735–1780.
[74] Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld.
“Knowledge-based weak supervision for information extraction of overlapping relations”. In:
Proceedings of the 49th annual meeting of the association for computational linguistics: human
language technologies. 2011, pp. 541–550.
[75] Yongjun Hong, Uiwon Hwang, Jaeyoon Yoo, and Sungroh Yoon. “How generative adversarial
networks and their variants work: An overview”. In: ACM Computing Surveys (CSUR) 52.1 (2019),
pp. 1–43.
125
[76] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu,
Michele Catasta, and Jure Leskovec. “Open graph benchmark: Datasets for machine learning on
graphs”. In: Advances in neural information processing systems 33 (2020), pp. 22118–22133.
[77] Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, and Jeff Z Pan. “Transformer-based
entity typing in knowledge graphs”. In: arXiv preprint arXiv:2210.11151 (2022).
[78] Hongren Huang, Chen Li, Xutan Peng, Lifang He, Shu Guo, Hao Peng, Lihong Wang, and
Jianxin Li. “Cross-knowledge-graph entity alignment via relation prediction”. In:
Knowledge-Based Systems 240 (2022), p. 107813.
[79] Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. “Knowledge graph embedding based
question answering”. In: Proceedings of the twelfth ACM international conference on web search and
data mining. 2019, pp. 105–113.
[80] Zichao Huang, Bo Li, and Jian Yin. “Knowledge Graph Embedding by Learning to Connect Entity
with Relation”. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint
International Conference on Web and Big Data. Springer. 2018, pp. 400–414.
[81] SM Asiful Huda and Sangman Moh. “Survey on computation offloading in UAV-Enabled mobile
edge computing”. In: Journal of Network and Computer Applications (2022), p. 103341.
[82] Victor Hung, Avelino Gonzalez, and Ronald Demara. “Towards a context-based dialog
management layer for expert systems”. In: 2009 International Conference on Information, Process,
and Knowledge Management. IEEE. 2009, pp. 60–65.
[83] Abdul Jabbar, Xi Li, and Bourahla Omar. “A survey on generative adversarial networks: Variants,
applications, and training”. In: ACM Computing Surveys (CSUR) 54.8 (2021), pp. 1–49.
[84] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and S Yu Philip. “A survey on
knowledge graphs: Representation, acquisition, and applications”. In: IEEE Transactions on Neural
Networks and Learning Systems 33.2 (2021), pp. 494–514.
[85] Yuang Jiang, Shiqiang Wang, Victor Valls, Bong Jun Ko, Wei-Han Lee, Kin K Leung, and
Leandros Tassiulas. “Model pruning enables efficient federated learning on edge devices”. In: IEEE
Transactions on Neural Networks and Learning Systems (2022).
[86] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. “Triviaqa: A large scale
distantly supervised challenge dataset for reading comprehension”. In: arXiv preprint
arXiv:1705.03551 (2017).
[87] Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sangeetha. “Ammus: A
survey of transformer-based pretrained models in natural language processing”. In: arXiv preprint
arXiv:2108.05542 (2021).
[88] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling laws for neural language
models”. In: arXiv preprint arXiv:2001.08361 (2020).
126
[89] Seyed Mehran Kazemi and David Poole. “Simple embedding for link prediction in knowledge
graphs”. In: Advances in neural information processing systems 31 (2018).
[90] Latif U Khan, Walid Saad, Zhu Han, Ekram Hossain, and Choong Seon Hong. “Federated learning
for internet of things: Recent advances, taxonomy, and open challenges”. In: IEEE
Communications Surveys & Tutorials 23.3 (2021), pp. 1759–1799.
[91] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and
Mubarak Shah. “Transformers in vision: A survey”. In: ACM computing surveys (CSUR) 54.10s
(2022), pp. 1–41.
[92] Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. “FloWaveNet: A
generative flow for raw audio”. In: arXiv preprint arXiv:1811.02155 (2018).
[93] Michael R King. “The future of AI in medicine: a perspective from a Chatbot”. In: Annals of
Biomedical Engineering 51.2 (2023), pp. 291–295.
[94] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”. In: arXiv preprint
arXiv:1312.6114 (2013).
[95] Thomas N Kipf and Max Welling. “Semi-supervised classification with graph convolutional
networks”. In: arXiv preprint arXiv:1609.02907 (2016).
[96] Hyunsuk Ko, Dae Yeol Lee, Seunghyun Cho, and Alan C Bovik. “Quality prediction on deep
generative images”. In: IEEE Transactions on Image Processing 29 (2020), pp. 5964–5979.
[97] Seung-Woo Ko, Seong Jin Kim, Haejoon Jung, and Sang Won Choi. “Computation offloading and
service caching for mobile edge computing under personalized service preference”. In: IEEE
Transactions on Wireless Communications 21.8 (2022), pp. 6568–6583.
[98] Denis Krompaß, Stephan Baier, and Volker Tresp. “Type-constrained representation learning in
knowledge graphs”. In: International semantic web conference. Springer. 2015, pp. 640–655.
[99] Yogesh Kumar, Apeksha Koul, and Chamkaur Singh. “A deep learning approaches in
text-to-speech system: A systematic review and recent research perspective”. In: Multimedia Tools
and Applications 82.10 (2023), pp. 15171–15197.
[100] C-C Jay Kuo and Azad M Madni. “Green learning: Introduction, examples and outlook”. In:
Journal of Visual Communication and Image Representation (2022), p. 103685.
[101] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh,
Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. “Natural
questions: a benchmark for question answering research”. In: Transactions of the Association for
Computational Linguistics 7 (2019), pp. 453–466.
[102] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther.
“Autoencoding beyond pixels using a learned similarity metric”. In: International conference on
machine learning. PMLR. 2016, pp. 1558–1566.
127
[103] Hyeonjin Lee, Ubaid Ullah, Jeong-Sik Lee, Bomi Jeong, and Hyun-Chul Choi. “A Brief Survey of
text driven image generation and maniulation”. In: 2021 IEEE International Conference on
Consumer Electronics-Asia (ICCE-Asia). IEEE. 2021, pp. 1–4.
[104] Xuejing Lei, Wei Wang, and C-C Jay Kuo. “GENHOP: An Image Generation Method Based on
Successive Subspace Learning”. In: 2022 IEEE International Symposium on Circuits and Systems
(ISCAS). IEEE. 2022, pp. 3314–3318.
[105] Xuejing Lei, Ganning Zhao, and C.-C. Jay Kuo. “NITES: A Non-Parametric Interpretable Texture
Synthesis Method”. In: 2020 Asia-Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA ASC). IEEE. 2020, pp. 1698–1706.
[106] Xuejing Lei, Ganning Zhao, Kaitai Zhang, and C-C Jay Kuo. “TGHop: an explainable, efficient,
and lightweight method for texture generation”. In: APSIPA Transactions on Signal and
Information Processing 10 (2021).
[107] DB Lenat and RV Guha. “Building large knowledge-based systems: Representation and inference
in the CYC project”. In: Artificial Intelligence 61.1 (1993), p. 4152.
[108] Fangyuan Li, Jiahu Qin, and Wei Xing Zheng. “Distributed Q-Learning-Based Online
Optimization Algorithm for Unit Commitment and Dispatch in Smart Grid”. In: IEEE transactions
on cybernetics 50.9 (2019), pp. 4146–4156.
[109] Rui Li, Jianan Zhao, Chaozhuo Li, Di He, Yiqi Wang, Yuming Liu, Hao Sun, Senzhang Wang,
Weiwei Deng, Yanming Shen, et al. “HousE: Knowledge Graph Embedding with Householder
Parameterization”. In: arXiv preprint arXiv:2202.07919 (2022).
[110] Wei Yang Bryan Lim, Zehui Xiong, Dusit Niyato, Xianbin Cao, Chunyan Miao, Sumei Sun, and
Qiang Yang. “Realizing the metaverse with edge intelligence: A match made in heaven”. In: IEEE
Wireless Communications (2022).
[111] Kwan-Yee Lin and Guanxiang Wang. “Hallucinated-IQA: No-reference image quality assessment
via adversarial learning”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2018, pp. 732–741.
[112] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. “A survey of transformers”. In: AI
Open (2022).
[113] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. “Learning entity and relation
embeddings for knowledge graph completion”. In: Twenty-ninth AAAI conference on artificial
intelligence. 2015.
[114] Yijing Lin, Hongyang Du, Dusit Niyato, Jiangtian Nie, Jiayi Zhang, Yanyu Cheng, and
Zhaohui Yang. “Blockchain-aided secure semantic communication for ai-generated content in
metaverse”. In: IEEE Open Journal of the Computer Society 4 (2023), pp. 72–83.
[115] Yu Lin, Saurabh Mehta, Hande Küçük-McGinty, John Paul Turner, Dusica Vidovic, Michele Forlin,
Amar Koleti, Dac-Trung Nguyen, Lars Juhl Jensen, Rajarshi Guha, et al. “Drug target ontology to
classify and integrate drug discovery data”. In: Journal of biomedical semantics 8.1 (2017), pp. 1–16.
128
[116] Deyin Liu, Xu Chen, Zhi Zhou, and Qing Ling. “HierTrain: Fast hierarchical edge AI learning with
hybrid parallelism in mobile-edge-cloud computing”. In: IEEE Open Journal of the
Communications Society 1 (2020), pp. 634–645.
[117] Hanxiao Liu, Yuexin Wu, and Yiming Yang. “Analogical inference for multi-relational
embeddings”. In: International conference on machine learning. PMLR. 2017, pp. 2168–2178.
[118] Hugo Liu and Push Singh. “ConceptNet—a practical commonsense reasoning tool-kit”. In: BT
technology journal 22.4 (2004), pp. 211–226.
[119] Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi,
Jianping Fan, and Zhiqiang He. “A survey of visual transformers”. In: IEEE Transactions on Neural
Networks and Learning Systems (2023).
[120] Bei Luo, Raymond YK Lau, Chunping Li, and Yain-Whar Si. “A critical review of state-of-the-art
chatbot designs and applications”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery 12.1 (2022), e1434.
[121] Xin Lv, Lei Hou, Juanzi Li, and Zhiyuan Liu. “Differentiating concepts and instances for
knowledge graph embedding”. In: arXiv preprint arXiv:1811.04588 (2018).
[122] Farzaneh Mahdisoltani, Joanna Biega, and Fabian Suchanek. “Yago3: A knowledge base from
multilingual wikipedias”. In: 7th biennial conference on innovative data systems research. CIDR
Conference. 2014.
[123] Zheda Mai, Ruiwen Li, Hyunwoo Kim, and Scott Sanner. “Supervised contrastive replay:
Revisiting the nearest class mean classifier in online class-incremental continual learning”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021,
pp. 3589–3599.
[124] Yuyi Mao, Changsheng You, Jun Zhang, Kaibin Huang, and Khaled B Letaief. “A survey on mobile
edge computing: The communication perspective”. In: IEEE communications surveys & tutorials
19.4 (2017), pp. 2322–2358.
[125] Daniel Martin, Ana Serrano, Alexander W Bergman, Gordon Wetzstein, and Belen Masia.
“Scangan360: A generative model of realistic scanpaths for 360 images”. In: IEEE Transactions on
Visualization and Computer Graphics 28.5 (2022), pp. 2003–2013.
[126] Francis J McMahon. “Prediction of treatment outcomes in psychiatry—where do we stand?” In:
Dialogues in clinical neuroscience (2022).
[127] George A Miller. “WordNet: a lexical database for English”. In: Communications of the ACM 38.11
(1995), pp. 39–41.
[128] Maxime Mirka, Maxime France-Pillois, Gilles Sassatelli, and Abdoulaye Gamatié. “A generative ai
for heterogeneous network-on-chip design space pruning”. In: 2022 Design, Automation & Test in
Europe Conference & Exhibition (DATE). IEEE. 2022, pp. 1135–1138.
129
[129] Changsung Moon, Steve Harenberg, John Slankas, and Nagiza F Samatova. “Learning contextual
embeddings for knowledge graph completion”. In: (2017).
[130] Changsung Moon, Paul Jones, and Nagiza F Samatova. “Learning entity type embeddings for
knowledge graph completion”. In: Proceedings of the 2017 ACM on conference on information and
knowledge management. 2017, pp. 2215–2218.
[131] MG Sarwar Murshed, Christopher Murphy, Daqing Hou, Nazar Khan,
Ganesh Ananthanarayanan, and Faraz Hussain. “Machine learning at the network edge: A
survey”. In: ACM Computing Surveys (CSUR) 54.8 (2021), pp. 1–37.
[132] Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. “Learning attention-based
embeddings for relation prediction in knowledge graphs”. In: arXiv preprint arXiv:1906.01195
(2019).
[133] Preksha Nema, Mitesh Khapra, Anirban Laha, and Balaraman Ravindran. “Diversity driven
attention model for query-based abstractive summarization”. In: arXiv preprint arXiv:1704.08300
(2017).
[134] Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. “A Novel Embedding
Model for Knowledge Base Completion Based on Convolutional Neural Network”. In: Proceedings
of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana:
Association for Computational Linguistics, June 2018, pp. 327–333. doi: 10.18653/v1/N18-2053.
[135] Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. “A novel embedding
model for knowledge base completion based on convolutional neural network”. In: arXiv preprint
arXiv:1712.02121 (2017).
[136] Dinh C Nguyen, Ming Ding, Pubudu N Pathirana, Aruna Seneviratne, Jun Li, and H Vincent Poor.
“Federated learning for internet of things: A comprehensive survey”. In: IEEE Communications
Surveys & Tutorials 23.3 (2021), pp. 1622–1658.
[137] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. “Point-E: A
System for Generating 3D Point Clouds from Complex Prompts”. In: arXiv preprint
arXiv:2212.08751 (2022).
[138] Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. “Holographic embeddings of
knowledge graphs”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 30. 1.
2016.
[139] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. “A three-way model for collective
learning on multi-relational data”. In: Icml. 2011.
[140] Weiran Pan, Wei Wei, and Xian-Ling Mao. “Context-aware entity typing in knowledge graphs”.
In: arXiv preprint arXiv:2109.07990 (2021).
130
[141] Zhaoqing Pan, Weijie Yu, Xiaokai Yi, Asifullah Khan, Feng Yuan, and Yuhui Zheng. “Recent
progress on generative adversarial networks (GANs): A survey”. In: IEEE access 7 (2019),
pp. 36322–36333.
[142] Ravi B Parikh, Stephanie Teeple, and Amol S Navathe. “Addressing bias in artificial intelligence in
health care”. In: Jama 322.24 (2019), pp. 2377–2378.
[143] Shalin Parikh, Dharmin Dave, Reema Patel, and Nishant Doshi. “Security and privacy issues in
cloud, fog and edge computing”. In: Procedia Computer Science 160 (2019), pp. 734–739.
[144] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia,
Daniel Rothchild, David So, Maud Texier, and Jeff Dean. “Carbon emissions and large neural
network training”. In: arXiv preprint arXiv:2104.10350 (2021).
[145] Jay Pujara, Eriq Augustine, and Lise Getoor. “Sparsity and noise: Where knowledge graph
embeddings fall short”. In: Proceedings of the 2017 conference on empirical methods in natural
language processing. 2017, pp. 1751–1756.
[146] Xuan Qi and Chen Liu. “Enabling deep learning on iot edge: Approaches and evaluation”. In: 2018
IEEE/ACM Symposium on Edge Computing (SEC). IEEE. 2018, pp. 367–372.
[147] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. “Mirrorgan: Learning text-to-image
generation by redescription”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2019, pp. 1505–1514.
[148] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. “Learning transferable visual
models from natural language supervision”. In: International conference on machine learning.
PMLR. 2021, pp. 8748–8763.
[149] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
“Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9.
[150] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. “Exploring the limits of transfer learning with a unified
text-to-text transformer”. In: The Journal of Machine Learning Research 21.1 (2020), pp. 5485–5551.
[151] Jeremiah Ratican, James Hutson, and Andrew Wright. “A Proposed Meta-Reality Immersive
Development Pipeline: Generative AI Models and Extended Reality (XR) Content for the
Metaverse”. In: Journal of Intelligent Learning Systems and Applications 15 (2023).
[152] Ju Ren, Yundi Guo, Deyu Zhang, Qingqing Liu, and Yaoxue Zhang. “Distributed and efficient
object detection in edge computing: Challenges and solutions”. In: IEEE Network 32.6 (2018),
pp. 137–143.
[153] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. “Fastspeech 2:
Fast and high-quality end-to-end text to speech”. In: arXiv preprint arXiv:2006.04558 (2020).
131
[154] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. “Fastspeech:
Fast, robust and controllable text to speech”. In: Advances in neural information processing systems
32 (2019).
[155] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
“High-resolution image synthesis with latent diffusion models”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2022, pp. 10684–10695.
[156] Drew Roselli, Jeanna Matthews, and Nisha Talagala. “Managing bias in AI”. In: Companion
Proceedings of The 2019 World Wide Web Conference. 2019, pp. 539–544.
[157] Lanlan Rui, Siqi Yang, Shiyou Chen, Yang Yang, and Zhipeng Gao. “Smart Network Maintenance
in an Edge Cloud Computing Environment: An Adaptive Model Compression Algorithm Based
on Model Pruning and Model Clustering”. In: IEEE Transactions on Network and Service
Management (2022).
[158] Tara Safavi, Caleb Belth, Lukas Faber, Davide Mottin, Emmanuel Müller, and Danai Koutra.
“Personalized knowledge graph summarization: From the cloud to your pocket”. In: 2019 IEEE
International Conference on Data Mining (ICDM). IEEE. 2019, pp. 528–537.
[159] Tara Safavi and Danai Koutra. “Codex: A comprehensive knowledge graph completion
benchmark”. In: arXiv preprint arXiv:2009.07810 (Nov. 2020), pp. 8328–8350. doi:
10.18653/v1/2020.emnlp-main.669.
[160] Eric Samikwa, Antonio Di Maio, and Torsten Braun. “Ares: Adaptive resource-aware split
learning for internet of things”. In: Computer Networks 218 (2022), p. 109380.
[161] Apoorv Saxena, Aditay Tripathi, and Partha Talukdar. “Improving multi-hop question answering
over knowledge graphs using knowledge base embeddings”. In: Proceedings of the 58th annual
meeting of the association for computational linguistics. 2020, pp. 4498–4507.
[162] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and
Max Welling. “Modeling relational data with graph convolutional networks”. In: European
semantic web conference. Springer. 2018, pp. 593–607.
[163] Javier Selva, Anders S Johansen, Sergio Escalera, Kamal Nasrollahi, Thomas B Moeslund, and
Albert Clapés. “Video transformers: A survey”. In: IEEE Transactions on Pattern Analysis and
Machine Intelligence (2023).
[164] Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. “Kvqa:
Knowledge-aware visual question answering”. In: Proceedings of the AAAI conference on artificial
intelligence. Vol. 33. 01. 2019, pp. 8876–8884.
[165] Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, and Bowen Zhou. “End-to-end
structure-aware convolutional networks for knowledge base completion”. In: Proceedings of the
AAAI Conference on Artificial Intelligence. Vol. 33. 01. 2019, pp. 3060–3067.
[166] Jiawen Shi, Yixin Liu, Pan Zhou, and Lichao Sun. “BadGPT: Exploring Security Vulnerabilities of
ChatGPT via Backdoor Attacks to InstructGPT”. In: arXiv preprint arXiv:2304.12298 (2023).
132
[167] Weisong Shi and Schahram Dustdar. “The promise of edge computing”. In: Computer 49.5 (2016),
pp. 78–81.
[168] Yuanming Shi, Kai Yang, Tao Jiang, Jun Zhang, and Khaled B Letaief. “Communication-efficient
edge AI: Algorithms and systems”. In: IEEE Communications Surveys & Tutorials 22.4 (2020),
pp. 2167–2191.
[169] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari,
Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. “Using
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language
model”. In: arXiv preprint arXiv:2201.11990 (2022).
[170] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. “Reasoning with neural
tensor networks for knowledge base completion”. In: Advances in neural information processing
systems 26 (2013).
[171] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. “Learning structured output representation using
deep conditional generative models”. In: Advances in neural information processing systems 28
(2015).
[172] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. “RotatE: Knowledge Graph
Embedding by Relational Rotation in Complex Space”. In: International Conference on Learning
Representations. 2019. url: https://openreview.net/forum?id=HkgEQnRqYQ.
[173] Zhiqing Sun, Shikhar Vashishth, Soumya Sanyal, Partha Talukdar, and Yiming Yang. “A
Re-evaluation of Knowledge Graph Completion Methods”. In: Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics. Online: Association for Computational
Linguistics, July 2020, pp. 5516–5522. doi: 10.18653/v1/2020.acl-main.489.
[174] Jayachander Surbiryala and Chunming Rong. “Cloud computing: History and overview”. In: 2019
IEEE Cloud Summit. IEEE. 2019, pp. 1–7.
[175] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural
Networks. 2014. arXiv: 1409.3215 [cs.CL].
[176] Yong Xuan Tan, Chin Poo Lee, Mai Neo, Kian Ming Lim, Jit Yan Lim, and Ali Alqahtani. “Recent
Advances in Text-to-Image Synthesis: Approaches, Datasets and Future Research Prospects”. In:
IEEE Access (2023).
[177] Yun Tang, Jing Huang, Guangtao Wang, Xiaodong He, and Bowen Zhou. “Orthogonal Relation
Transforms with Graph Context Modeling for Knowledge Graph Embedding”. In: Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics. July 2020, pp. 2713–2722.
[178] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin,
Percy Liang, and Tatsunori B Hashimoto. “Alpaca: A strong, replicable instruction-following
model”. In: Stanford Center for Research on Foundation Models. https://crfm. stanford.
edu/2023/03/13/alpaca. html 3.6 (2023), p. 7.
133
[179] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. “Efficient transformers: A survey”. In:
ACM Computing Surveys 55.6 (2022), pp. 1–28.
[180] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha,
Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. “Lamda: Language models for
dialog applications”. In: arXiv preprint arXiv:2201.08239 (2022).
[181] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. “Wasserstein
auto-encoders”. In: arXiv preprint arXiv:1711.01558 (2017).
[182] Kristina Toutanova and Danqi Chen. “Observed versus latent features for knowledge base and
text inference”. In: Proceedings of the 3rd workshop on continuous vector space models and their
compositionality. 2015, pp. 57–66.
[183] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. “Llama: Open
and efficient foundation language models”. In: arXiv preprint arXiv:2302.13971 (2023).
[184] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard.
“Complex embeddings for simple link prediction”. In: International conference on machine
learning. PMLR. 2016, pp. 2071–2080.
[185] Shreshth Tuli, Nipam Basumatary, and Rajkumar Buyya. “Edgelens: Deep learning based object
detection in integrated iot, fog and cloud computing environments”. In: 2019 4th International
Conference on Information Systems and Computer Networks (ISCON). IEEE. 2019, pp. 496–502.
[186] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, Nilesh Agrawal, and Partha Talukdar.
“Interacte: Improving convolution-based knowledge graph embeddings by increasing feature
interactions”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 03. 2020,
pp. 3009–3016.
[187] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. “Composition-based
multi-relational graph convolutional networks”. In: arXiv preprint arXiv:1911.03082 (2019).
[188] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
[189] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and
Yoshua Bengio. “Graph attention networks”. In: arXiv preprint arXiv:1710.10903 (2017).
[190] Denny Vrandečić and Markus Krötzsch. “Wikidata: a free collaborative knowledgebase”. In:
Communications of the ACM 57.10 (2014), pp. 78–85.
[191] Bin Wang, Chen Zhang, Chengwei Wei, and Haizhou Li. “A Focused Study on Sequence Length
for Dialogue Summarization”. In: arXiv preprint arXiv:2209.11910 (2022).
134
[192] Bo Wang, Tao Shen, Guodong Long, Tianyi Zhou, Ying Wang, and Yi Chang.
“Structure-augmented text representation learning for efficient knowledge graph completion”. In:
Proceedings of the Web Conference 2021. 2021, pp. 1737–1748.
[193] Chenguang Wang, Xiao Liu, and Dawn Song. “Language models are open knowledge graphs”. In:
arXiv preprint arXiv:2010.11967 (2020).
[194] Kai Wang, Yu Liu, Qian Ma, and Quan Z Sheng. “Mulde: Multi-teacher knowledge distillation for
low-dimensional knowledge graph embeddings”. In: Proceedings of the Web Conference 2021. 2021,
pp. 1716–1726.
[195] Liang Wang, Wei Zhao, Zhuoyu Wei, and Jingming Liu. “SimKGC: Simple Contrastive Knowledge
Graph Completion with Pre-trained Language Models”. In: Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland:
Association for Computational Linguistics, May 2022, pp. 4281–4294. doi:
10.18653/v1/2022.acl-long.295.
[196] Lin Wang and Kuk-Jin Yoon. “Knowledge distillation and student-teacher learning for visual
intelligence: A review and new outlooks”. In: IEEE Transactions on Pattern Analysis and Machine
Intelligence (2021).
[197] Peifeng Wang, Shuangyin Li, and Rong Pan. “Incorporating gan for negative sampling in
knowledge representation learning”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 32. 1. 2018.
[198] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. “Knowledge graph embedding: A survey of
approaches and applications”. In: IEEE Transactions on Knowledge and Data Engineering 29.12
(2017), pp. 2724–2743.
[199] Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-Seng Chua.
“Explainable reasoning over knowledge graphs for recommendation”. In: Proceedings of the AAAI
conference on artificial intelligence. Vol. 33. 2019, pp. 5329–5336.
[200] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and
Jian Tang. “KEPLER: A unified model for knowledge embedding and pre-trained language
representation”. In: Transactions of the Association for Computational Linguistics 9 (2021),
pp. 176–194.
[201] Yifan Wang, Zhanxuan Mei, Qingyang Zhou, Ioannis Katsavounidis, and C-C Jay Kuo. “Green
image codec: a lightweight learning-based image coding method”. In: Applications of Digital
Image Processing XLV. Vol. 12226. SPIE. 2022, pp. 70–75.
[202] Yun-Cheng Wang, Xiou Ge, Bin Wang, and C-C Jay Kuo. “AsyncET: Asynchronous Learning for
Knowledge Graph Entity Typing with Auxiliary Relations”. In: arXiv preprint arXiv:2308.16055
(2023).
[203] Yun-Cheng Wang, Xiou Ge, Bin Wang, and C-C Jay Kuo. “Greenkgc: A lightweight knowledge
graph completion method”. In: arXiv preprint arXiv:2208.09137 (2022).
135
[204] Yun-Cheng Wang, Xiou Ge, Bin Wang, and C-C Jay Kuo. “KGBoost: A classification-based
knowledge base completion method with negative sampling”. In: Pattern Recognition Letters 157
(2022), pp. 104–111.
[205] Yun-Cheng Wang, Jintang Xue, Chengwei Wei, and C.-C. Jay Kuo. “An Overview on Generative
AI at Scale With Edge-Cloud Computing”. In: IEEE Open Journal of the Communications Society
(2023), pp. 1–1. doi: 10.1109/OJCOMS.2023.3320646.
[206] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. “Knowledge graph embedding by
translating on hyperplanes”. In: Proceedings of the AAAI Conference on Artificial Intelligence.
Vol. 28. 1. 2014.
[207] Zhengwei Wang, Qi She, and Tomas E Ward. “Generative adversarial networks in computer
vision: A survey and taxonomy”. In: ACM Computing Surveys (CSUR) 54.2 (2021), pp. 1–38.
[208] Chengwei Wei, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. “An overview on language
models: Recent developments and outlook”. In: arXiv preprint arXiv:2303.05759 (2023).
[209] Ruoqi Wei, Cesar Garcia, Ahmed El-Sayed, Viyaleta Peterson, and Ausif Mahmood. “Variations in
variational autoencoders-a comparative evaluation”. In: Ieee Access 8 (2020), pp. 153651–153670.
[210] Hao Wu, Xinchen Lyu, and Hui Tian. “Online optimization of wireless powered mobile-edge
computing for heterogeneous industrial internet of things”. In: IEEE Internet of Things Journal 6.6
(2019), pp. 9880–9892.
[211] Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin. “Ai-generated content
(aigc): A survey”. In: arXiv preprint arXiv:2304.06632 (2023).
[212] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. “Image
captioning and visual question answering based on attributes and external knowledge”. In: IEEE
transactions on pattern analysis and machine intelligence 40.6 (2017), pp. 1367–1381.
[213] Yikun Xian, Zuohui Fu, Shan Muthukrishnan, Gerard De Melo, and Yongfeng Zhang.
“Reinforcement knowledge graph reasoning for explainable recommendation”. In: Proceedings of
the 42nd international ACM SIGIR conference on research and development in information retrieval.
2019, pp. 285–294.
[214] Yuejia Xiang, Ziheng Zhang, Jiaoyan Chen, Xi Chen, Zhenxi Lin, and Yefeng Zheng. “OntoEA:
ontology-guided entity alignment via joint knowledge graph embedding”. In: arXiv preprint
arXiv:2105.07688 (2021).
[215] Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. “TransA: An adaptive approach for
knowledge graph embedding”. In: arXiv preprint arXiv:1509.05490 abs/1509.05490 (2015). arXiv:
1509.05490. url: http://arxiv.org/abs/1509.05490.
[216] Zhujun Xiao, Zhengxu Xia, Haitao Zheng, Ben Y Zhao, and Junchen Jiang. “Towards performance
clarity of edge video analytics”. In: 2021 IEEE/ACM Symposium on Edge Computing (SEC). IEEE.
2021, pp. 148–164.
136
[217] Haobo Xu, Ying Wang, Yujie Wang, Jiajun Li, Bosheng Liu, and Yinhe Han. “ACG-engine: An
inference accelerator for content generative neural networks”. In: 2019 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD). IEEE. 2019, pp. 1–7.
[218] Minrui Xu, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Shiwen Mao, Zhu Han,
Abbas Jamalipour, Dong In Kim, Victor Leung, et al. “Unleashing the power of edge-cloud
generative ai in mobile networks: A survey of aigc services”. In: arXiv preprint arXiv:2303.16129
(2023).
[219] Minrui Xu, Dusit Niyato, Junlong Chen, Hongliang Zhang, Jiawen Kang, Zehui Xiong,
Shiwen Mao, and Zhu Han. “Generative AI-empowered simulation for autonomous driving in
vehicular mixed reality metaverses”. In: arXiv preprint arXiv:2302.08418 (2023).
[220] Yadollah Yaghoobzadeh, Heike Adel, and Hinrich Schütze. “Noise mitigation for neural entity
typing and relation extraction”. In: arXiv preprint arXiv:1612.07495 (2016).
[221] Yadollah Yaghoobzadeh and Hinrich Schütze. “Corpus-level fine-grained entity typing using
contextual information”. In: arXiv preprint arXiv:1606.07901 (2016).
[222] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. “Embedding entities and
relations for learning and inference in knowledge bases”. In: arXiv preprint arXiv:1412.6575 (2014).
[223] Yijing Yang, Wei Wang, Hongyu Fu, and C-C Jay Kuo. “On Supervised Feature Selection from
High Dimensional Feature Spaces”. In: arXiv preprint arXiv:2203.11924 11.1 (2022).
[224] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le.
“Xlnet: Generalized autoregressive pretraining for language understanding”. In: Advances in
neural information processing systems 32 (2019).
[225] Jiangchao Yao, Shengyu Zhang, Yang Yao, Feng Wang, Jianxin Ma, Jianwei Zhang, Yunfei Chu,
Luo Ji, Kunyang Jia, Tao Shen, et al. “Edge-cloud polarization and collaboration: A comprehensive
survey for ai”. In: IEEE Transactions on Knowledge and Data Engineering (2022).
[226] Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. “QA-GNN:
Reasoning with language models and knowledge graphs for question answering”. In: arXiv
preprint arXiv:2104.06378 (2021).
[227] Junhai Zhai, Sufang Zhang, Junfen Chen, and Qiang He. “Autoencoder and its various variants”.
In: 2018 IEEE international conference on systems, man, and cybernetics (SMC). IEEE. 2018,
pp. 415–419.
[228] Chaoning Zhang, Chenshuang Zhang, Sheng Zheng, Yu Qiao, Chenghao Li, Mengchun Zhang,
Sumit Kumar Dam, Chu Myaet Thwal, Ye Lin Tun, Le Luang Huy, et al. “A Complete Survey on
Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?” In: arXiv preprint
arXiv:2303.11717 (2023).
[229] Chenshuang Zhang, Chaoning Zhang, Sheng Zheng, Mengchun Zhang, Maryam Qamar,
Sung-Ho Bae, and In So Kweon. “A survey on audio diffusion models: Text to speech synthesis
and enhancement in generative ai”. In: arXiv preprint arXiv:2303.13336 2 (2023).
137
[230] Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, and Dawei Song. “A survey of controllable
text generation using transformer-based pre-trained language models”. In: arXiv preprint
arXiv:2201.05337 (2022).
[231] Jing Zhang and Dacheng Tao. “Empowering things with intelligence: a survey of the progress,
challenges, and opportunities in artificial intelligence of things”. In: IEEE Internet of Things
Journal 8.10 (2020), pp. 7789–7817.
[232] Qianjin Zhang, Ronggui Wang, Juan Yang, and Lixia Xue. “Knowledge graph embedding by
reflection transformation”. In: Knowledge-Based Systems 238 (2022), p. 107861.
[233] Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. “Quaternion knowledge graph embeddings”. In:
Advances in neural information processing systems 32 (2019).
[234] Wuyang Zhang, Jiachen Chen, Yanyong Zhang, and Dipankar Raychaudhuri. “Towards efficient
edge cloud augmentation for virtual reality mmogs”. In: Proceedings of the Second ACM/IEEE
Symposium on Edge Computing. 2017, pp. 1–14.
[235] Yongqi Zhang, Quanming Yao, Wenyuan Dai, and Lei Chen. “AutoSF: Searching scoring functions
for knowledge graph embedding”. In: 2020 IEEE 36th International Conference on Data Engineering
(ICDE). IEEE. 2020, pp. 433–444.
[236] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. “ERNIE: Enhanced
language representation with informative entities”. In: arXiv preprint arXiv:1905.07129 (2019).
[237] Zicheng Zhang, Chunyi Li, Wei Sun, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. “A
Perceptual Quality Assessment Exploration for AIGC Images”. In: arXiv preprint arXiv:2303.12618
(2023).
[238] Zongshun Zhang, Andrea Pinto, Valeria Turina, Flavio Esposito, and Ibrahim Matta. “Privacy and
efficiency of communications in federated split learning”. In: IEEE Transactions on Big Data (2023).
[239] Ming Zhao, Jiahua Li, Fengxiao Tang, Sohaib Asif, and Yusen Zhu. “Learning based massive data
offloading in the iov: Routing based on pre-rlga”. In: IEEE Transactions on Network Science and
Engineering 9.4 (2022), pp. 2330–2340.
[240] Yu Zhao, Anxiang Zhang, Ruobing Xie, Kang Liu, and Xiaojie Wang. “Connecting embeddings for
knowledge graph entity typing”. In: arXiv preprint arXiv:2007.10873 (2020).
[241] Yu Zhao, Han Zhou, Anxiang Zhang, Ruobing Xie, Qing Li, and Fuzhen Zhuang. “Connecting
Embeddings Based on Multiplex Relational Graph Attention Networks for Knowledge Graph
Entity Typing”. In: IEEE Transactions on Knowledge and Data Engineering (2022).
[242] Yunxiang Zhao, Jianzhong Qi, Qingwei Liu, and Rui Zhang. “Wgcn: graph convolutional
networks with weighted structural features”. In: Proceedings of the 44th International ACM SIGIR
Conference on Research and Development in Information Retrieval. 2021, pp. 624–633.
[243] Wang Zhu, Jesse Thomason, and Robin Jia. “Generalization Differences between End-to-End and
Neuro-Symbolic Vision-Language Reasoning Systems”. In: arXiv preprint arXiv:2210.15037 (2022).
138
[244] Yushan Zhu, Wen Zhang, Mingyang Chen, Hui Chen, Xu Cheng, Wei Zhang, and Huajun Chen.
“DualDE: Dually Distilling Knowledge Graph Embedding for Faster and Cheaper Reasoning”. In:
Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 2022,
pp. 1516–1524.
[245] Jianhuan Zhuo, Qiannan Zhu, Yinliang Yue, Yuhong Zhao, and Weisi Han. “A
Neighborhood-Attention Fine-grained Entity Typing for Knowledge Graph Completion”. In:
Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 2022,
pp. 1525–1533.
139
Abstract (if available)
Abstract
To build an advanced artificial intelligence (AI) system, it is important to incorporate explainable and lightweight reasoning modules for better trustworthiness and scalability. Knowledge graphs (KGs) and Generative AI (GenAI) models are promising categories for developing a reasoning module in AI systems. In this thesis, we focus on solving two fundamental problems: 1) developing explainable and scalable approaches for KGs, and 2) identifying and quantifying the key bottlenecks for scalable generative content delivery on demand and proposing design considerations in training and deployment. More specifically, we aim to address four fundamental research problems: 1) designing a novel and explainable KGC model; 2) improving the proposed explainable KGC model such that it is lightweight and efficient; 3) improving the embeddings of KGs through incorporating entity typing information; and 4) quantifying the bottlenecks for scalable generative content delivery. In addition, we envision future research directions on how to incorporate KGs to achieve better controllability, explainability, and efficiency of generative models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning controllable data generation for scalable model training
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Green image generation and label transfer techniques
PDF
Scalable sampling and reconstruction for graph signals
PDF
Semantic structure in understanding and generation of the 3D world
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Green unsupervised single object tracking: technologies and performance evaluation
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Advances in understanding and leveraging structured data for knowledge-intensive tasks
PDF
Towards learning generalization
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Identifying and mitigating safety risks in language models
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Integration of digital twin and generative models in model-based systems upgrade methodology
Asset Metadata
Creator
Wang, Yun-Cheng
(author)
Core Title
Green knowledge graph completion and scalable generative content delivery
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
10/30/2023
Defense Date
10/20/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
edge-cloud computing,explainable AI,generative models,knowledge graph,knowledge graph completion,lightweight models,OAI-PMH Harvest,representation learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Jia, Robin (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
joewang622@gmail.com,yunchenw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113763047
Unique identifier
UC113763047
Identifier
etd-WangYunChe-12443.pdf (filename)
Legacy Identifier
etd-WangYunChe-12443
Document Type
Dissertation
Format
theses (aat)
Rights
Wang, Yun-Cheng
Internet Media Type
application/pdf
Type
texts
Source
20231103-usctheses-batch-1104
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
edge-cloud computing
explainable AI
generative models
knowledge graph
knowledge graph completion
lightweight models
representation learning