Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Revisiting FastMap: new applications
(USC Thesis Other)
Revisiting FastMap: new applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
REVISITING FASTMAP: NEW APPLICATIONS
by
Ang Li
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2024
Copyright 2024 Ang Li
Acknowledgements
First, I would like to thank my adviser, T. K. Satish Kumar, for his valuable guidance on my research and his
unwavering support for me at every stage of my doctoral education at the University of Southern California.
I would also like to thank him for helping me revise this document.
Second, I would like to thank the rest of my PhD dissertation committee, including John Carlsson,
Emilio Ferrara, Sven Koenig, and Aiichiro Nakano, for their valuable time, guidance, support, and feedback.
I would especially like to thank Sven for his feedback on many of my published research papers. I would
also like to thank Peter Stuckey from Monash University, Australia, for his valuable discussions with me on
my research projects.
Third, I would like to thank my other collaborators and other members of my adviser’s research group,
including Nori Nakata, Kushal Sharma, Omkar Thakoor, Malcolm White, Han Zhang, and Kexin Zheng,
for productive and friendly discussions.
Finally, I would like to thank my family members and friends for their understanding and constant
support during the intense period of my doctoral education.
My doctoral education at the University of Southern California has shaped me as a researcher and as a
person in many ways. I will cherish the memories of it for a long time to come, thanks to all the people
mentioned above for creating and engaging me in a prolific research environment.
ii
The research presented in this dissertation is the culmination of multiple published research papers. My
research was partially supported by DARPA under grant number HR001120C0157 and by NSF under grant
numbers 1409987, 1724392, 1817189, 1837779, 1935712, and 2112533.
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1: Introduction to FastMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Original FastMap Algorithm in Data Mining . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 An Algorithmic Description of FastMap . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 FastMap for Embedding Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Complementing FastMap with Locality Sensitive Hashing . . . . . . . . . . . . . . 10
1.4 Overview and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2: FastMap for Facility Location Problems . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Solving Facility Location Problems via FastMap and Locality Sensitive Hashing . . . . . . . 18
2.3.1 Solving Facility Location Problems in Euclidean Space without Obstacles . . . . . . 19
2.3.2 FastMap with Anya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Competing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.A Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 3: FastMap for Efficiently Computing Top-K Projected Centrality . . . . . . . . . . . 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Measures of Projected Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 FastMap for Top-K Centrality and Projected Centrality . . . . . . . . . . . . . . . . . . . . 38
3.3.1 FastMap for Closeness Centrality on Explicit Graphs . . . . . . . . . . . . . . . . . 39
3.3.2 FastMap for Harmonic Centrality on Explicit Graphs . . . . . . . . . . . . . . . . . 40
3.3.3 FastMap for Current-Flow Closeness Centrality on Explicit Graphs . . . . . . . . . 41
3.3.4 FastMap for Normalized Eigenvector Centrality on Explicit Graphs . . . . . . . . . 42
3.3.5 Generalization to Projected Centrality . . . . . . . . . . . . . . . . . . . . . . . . . 44
iv
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.A Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 4: FastMap for Community Detection and Block Modeling . . . . . . . . . . . . . . . 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Preliminaries of Block Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 The FastMap-Based Block Modeling Algorithm (FMBM) . . . . . . . . . . . . . . . . . . . 56
4.3.1 Probabilistically-Amplified Shortest-Path Distances . . . . . . . . . . . . . . . . . . 57
4.3.2 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.A Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 5: FastMap for Graph Convex Hull Computations . . . . . . . . . . . . . . . . . . . . 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 FastMap-Based Algorithms for Graph Convex Hull Computations . . . . . . . . . . . . . . 71
5.3 An Efficient Implementation of the Exact Brute-Force Algorithm . . . . . . . . . . . . . . . 79
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.A Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6: FastMapSVM: Combining FastMap and Support Vector Machines . . . . . . . . . 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 FastMapSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Case Study: FastMapSVM for Classifying Seismograms . . . . . . . . . . . . . . . . . . . 92
6.3.1 Distance Function on Seismograms . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.2 Earthquake Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.2.1 The Stanford Earthquake Dataset . . . . . . . . . . . . . . . . . . . . . . 95
6.3.2.2 Ridgecrest Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.3.1 Results on the Stanford Earthquake Dataset . . . . . . . . . . . . . . . . 97
6.3.3.2 Results on the Ridgecrest Dataset . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Chapter 7: FastMapSVM in the Constraint Satisfaction Problem Domain . . . . . . . . . . . . 103
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Preliminaries and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.3 Distance Function on Constraint Satisfaction Problems . . . . . . . . . . . . . . . . . . . . 107
7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4.2 Instance Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.A Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 8: Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
v
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2.1 Downsampling Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.2.2 FastMap Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.2.3 Mirroring and Solving Computational Geometry Problems on Graphs . . . . . . . . 130
8.2.4 FastMapSVM: Further Applications and Enhancements . . . . . . . . . . . . . . . 131
8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A Table of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
vi
List of Tables
2.1 Results for the Multi-Agent Meeting problem on various graph instances. . . . . . . . . . . 24
2.2 Results for the Multi-Agent Meeting problem on various grid-world instances. . . . . . . . . 25
2.3 Results for the Vertex K-Median problem on various graph instances. . . . . . . . . . . . . 25
2.4 Results for the Vertex K-Median problem on various grid-world instances. . . . . . . . . . . 25
2.5 Results for the Weighted Vertex K-Median problem on various graph instances. . . . . . . . 26
2.6 Results for the Weighted Vertex K-Median problem on various grid-world instances. . . . . 26
2.7 Results for the Capacitated Vertex K-Median problem on various graph instances. . . . . . . 26
2.8 Results for the Capacitated Vertex K-Median problem on various grid-world instances. . . . 27
2.A.1Notations used in Chapter 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Results of competing algorithms for top-K centrality computations. . . . . . . . . . . . . . . 46
3.2 Results of competing algorithms for top-K projected centrality computations. . . . . . . . . 47
3.A.1Notations used in Chapter 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Results of competing algorithms for block modeling on real-world single-view undirected
graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Results of competing algorithms for block modeling on the complement graphs of the
graphs in Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Results of competing algorithms for block modeling on sparse single-view undirected
graphs using Generative Model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Results of competing algorithms for block modeling on dense single-view undirected
graphs using Generative Model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
vii
4.5 Results of competing algorithms for block modeling on single-view undirected graphs
using Generative Model 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.A.1Notations used in Chapter 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Performance results of FMGCH with respect to graph convex hull computations. . . . . . . 82
5.A.1Notations used in Chapter 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.1 The accuracy, recall, precision, and the F1 score of FastMapSVM and its competing
methods on three datasets of Constraint Satisfaction Problems. . . . . . . . . . . . . . . . . 118
7.A.1Notations used in Chapter 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.1 Abbreviations used in the dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
viii
List of Figures
1.1 An illustration of the edit distance between two DNA strings. . . . . . . . . . . . . . . . . . 2
1.2 A clustering task on a domain with images of animals. . . . . . . . . . . . . . . . . . . . . 2
1.3 A clustering task on a domain with text documents. . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Iterative computation of coordinates in FastMap. . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 An illustration of the two shortest-path trees rooted at the pivots in each iteration of FastMap
on graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 An illustration of the Euclidean embedding produced by the graph version of FastMap for
shortest-path computations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 An illustration of how the FastMap embedding can be used to answer nearest-neighbor
queries efficiently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 An illustration of the coupling of FastMap and Locality Sensitive Hashing. . . . . . . . . . . 11
2.1 An illustration of the Multi-Agent Meeting problem, Vertex K-Median problem, Weighted
Vertex K-Median problem, and the Capacitated Vertex K-Median problem on the same
input graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The FastMap pipeline for the fast approximation of the all-pairs shortest-path distances
required by existing graph algorithms for facility location. . . . . . . . . . . . . . . . . . . 29
2.3 A comparison of three methods for solving the Vertex K-Median problem on a representative
movingAI instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 A communication network in which user terminals represent the pertinent vertices and
routers and switches represent the auxiliary vertices. . . . . . . . . . . . . . . . . . . . . . 34
4.1 A core-periphery graph in an airport domain with its FastMap embedding. . . . . . . . . . . 53
4.2 Two simple graphs that guide the design of a proper FastMap distance function for block
modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ix
4.3 A comparison of the visualization produced by FMBM against that of a competing method. . 65
5.1 An illustration of a graph convex hull. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 An important property of graph convex hulls analogous to geometric convex hulls. . . . . . 71
5.3 An illustration of the iterative FastMap-based algorithm for computing graph convex hulls. . 74
6.1 An illustration of the overall architecture of FastMapSVM. . . . . . . . . . . . . . . . . . . 89
6.2 An illustration of a distance function on seismograms. . . . . . . . . . . . . . . . . . . . . . 93
6.3 The performance statistics of FastMapSVM, EQTransformer, and CRED on the Stanford
Earthquake Dataset with varying training data size. . . . . . . . . . . . . . . . . . . . . . . 98
6.4 A comparison of the automatic scanning results produced by EQTransformer, CRED, and
FastMapSVM on the Ridgecrest dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.1 The (0,1)-matrix representation of a constraint and a Constraint Satisfaction Problem. . . . 106
7.2 An illustration of our novel distance function between two binary Constraint Satisfaction
Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 The graphical representation of a binary Constraint Satisfaction Problem obtained from its
(0,1)-matrix representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4 The 2-dimensional and 3-dimensional Euclidean embeddings produced by FastMapSVM
for classifying Constraint Satisfaction Problems. . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 The behavior of FastMapSVM in the Constraint Satisfaction Problem domain with respect
to the number of dimensions and with respect to the size of the training data. . . . . . . . . . 117
8.1 An illustration of downsampling graphs via FastMap. . . . . . . . . . . . . . . . . . . . . . 127
8.2 A possible FastMapSVM enhancement that not only maps the input space to a Euclidean
space but also maps the output space to a Euclidean space. . . . . . . . . . . . . . . . . . . 132
x
Abstract
Many recent breakthroughs in Artificial Intelligence (AI) have stemmed from our ability to create embeddings of complex objects. A “good” embedding of the objects in a given domain assigns numerical coordinates to every object so as to capture the properties of individual objects as well as the relationships between
them. Hence, embeddings open up the possibility of using powerful analytical and geometric techniques to
reason about complex objects. For example, recent breakthroughs in Natural Language Processing (NLP)
utilize word and sentence embeddings. This dissertation is also based on the same paradigm but with a
strong emphasis on the efficiency of the embedding procedure. Towards this end, we build on an algorithm
called FastMap that was originally introduced in the Data Mining community.
FastMap is a clever embedding algorithm that circumvents the complexity of representing individual
objects by leveraging a distance (similarity) function on pairs of them. On the one hand, objects in many
real-world domains often require complex representations and cannot always be visualized as points in
geometric space. Examples of such complex objects include long deoxyribonucleic acid (DNA) strings,
multi-media datasets like voice excerpts or images, and medical datasets like electrocardiograms (ECGs)
or magnetic resonance images (MRIs). On the other hand, many of the same real-world domains naturally
lend themselves to well-defined distance functions on pairs of objects. Examples include the edit distance
between two DNA strings, the Minkowski distance between two images, and the cosine similarity between
two text documents. FastMap leverages such a distance function efficiently: It embeds a collection of N
complex objects in an artificially created Euclidean space in only O(κN) time, where κ is the user-specified
xi
number of dimensions. While FastMap attempts to preserve all of the quadratic number of pairwise distances
between the objects in the Euclidean embedding, it remarkably does so in only linear time. Its efficiency has
already lead to many applications in Data Mining, particularly with regard to fast indexing, searching, and
enabling clustering algorithms on complex objects that otherwise require a collection of points in geometric
space as input.
In the first part of this dissertation, we present a generalization of FastMap to graphs: This graph version
of FastMap, also called FastMap for convenience, embeds the vertices of a undirected edge-weighted graph
as points in a Euclidean space in near-linear time (linear time after ignoring logarithmic factors). The
pairwise Euclidean distances between the points approximate a desired graph-based distance function on
the corresponding vertices. With a proper choice of the distance function, FastMap allows us to efficiently
interpret a graph-theoretic problem in geometric space. This leads to an important upshot: In the modern
era, graphs are used to represent social networks, communication networks, and transportation networks,
among similar structures in many other domains with entities and relationships between them. These graphs
can be very large with millions of vertices and hundreds of millions of edges. Therefore, algorithms with a
running time that is quadratic or more in the size of the input are undesirable. In fact, algorithms with any
super-linear running times, discounting logarithmic factors, are also largely undesirable. Hence, a desired
algorithm should have a near-linear running time close to that of merely reading the input. FastMap tries
to address these requirements by first creating a geometric interpretation of a given graph-theoretic problem
in only near-linear time and consequently enabling analytical and geometric techniques that are better at
absorbing large input sizes compared to discrete algorithms that work directly on the input graph. We
apply FastMap to solve various new graph-theoretic problems of significant interest in AI: including facility
location, top-K centrality computations, community detection and block modeling, and graph convex hull
computations. Through comprehensive experimental results, we show that our FastMap-based approaches
xii
outperform many state-of-the-art competing methods for these problems, in terms of both the efficiency and
the effectiveness.
In the second part of this dissertation, we propose a novel Machine Learning (ML) framework, called
FastMapSVM, which combines FastMap and Support Vector Machines (SVMs) to classify complex objects.
While Neural Networks (NNs) and related Deep Learning (DL) methods are popularly used for classifying
complex objects, they are generally based on the paradigm of characterizing individual objects. In contrast,
FastMapSVM is generally based on the paradigm of comparing pairs of objects via a distance function. One
benefit of FastMapSVM is that the distance function can encapsulate and invoke the intelligence of other
powerful algorithms such as the A
∗
search procedure and maxflow computations, among many other optimization methods. The distance function can also incorporate domain-specific knowledge that otherwise
may be hard for ML algorithms to automatically extract from the data. FastMapSVM serves as a lightweight
alternative to NNs for classifying complex objects, particularly when training data or time is limited. It
also extends the applicability of SVMs to domains with complex objects by combining the complementary
strengths of FastMap and SVMs. Furthermore, FastMapSVM provides a perspicuous visualization of the
objects and the classification boundaries between them. This aids human interpretation of the data and results. It also enables a human-in-the-loop framework for refining the processes of learning and decision
making. We apply FastMapSVM to solve various classification problems of significant interest in AI: including the problem of identifying and classifying seismograms in Earthquake Science and the problem
of predicting the satisfiability of Constraint Satisfaction Problems (CSPs). Through comprehensive experimental results, we show that FastMapSVM outperforms many state-of-the-art competing methods for these
problems, particularly in terms of the training data and time required for producing high-quality results.
In summary, FastMap plays a key role in representing the vertices of a graph, or complex objects in
other domains, as points in Euclidean spaces. The ability of FastMap to efficiently generate these “simplified” representations of the vertices, or the complex objects, enables many powerful downstream algorithms
xiii
developed in diverse research communities such as AI, ML, Computational Geometry, Mathematics, Operations Research, and Theoretical Computer Science. Hence, we envision that FastMap can facilitate and
harness the confluence of these algorithms and find future applications in many other problem domains that
are not necessarily discussed here.
xiv
Chapter 1
Introduction to FastMap
In this chapter, we first review the FastMap algorithm as it was originally presented in the Data Mining
community. We then review the graph version of the FastMap algorithm. We finally review some of the
previous applications of the FastMap algorithm and its graph version. Hence, this chapter is intended to
provide the necessary background material for the remaining chapters in this dissertation. It is also intended
to provide some related work, although each following chapter provides more context and related work on
the subject matter it broaches.
1.1 The Original FastMap Algorithm in Data Mining
Many algorithms developed in Machine Learning (ML) and Computational Geometry require the input to be
a collection of points in Euclidean space. For example, routinely used clustering algorithms, such as the Kmeans algorithm, assume that the input is presented as a collection of points in Euclidean space. Other ML
algorithms that make the same assumption include Gaussian Mixture Model (GMM) clustering, Principal
Component Analysis (PCA), and Support Vector Machines (SVMs), among many others. Many important
algorithms developed in Computational Geometry also work on a collection of points in Euclidean space.
For example, algorithms that construct Voronoi diagrams [6] to answer nearest-neighbor queries efficiently
1
Figure 1.1: Illustrates the edit distance between two DNA strings. The left half shows two snippets of DNA
strings extracted from a collection of them. The right half shows the minimum number of edit operations
required to convert one to the other.
Figure 1.2: Shows a domain where the complex objects are images of animals. A well-defined clustering
task is to group the images that portray the same animal species.
Figure 1.3: Shows a domain where the complex objects are text documents (albeit in different formats). A
well-defined clustering task is to group the text documents that have similar content.
require such an assumption. Similarly, many analytical techniques developed in Mathematics generally
require the conceptualization of the objects in the domain as points in Euclidean space.
2
Unfortunately, a plethora of such useful algorithms are rendered useless in domains with complex objects
that cannot be easily conceptualized as points in Euclidean space. Figures 1.1, 1.2, and 1.3 show three
such domains in which the complex objects are deoxyribonucleic acid (DNA) strings, images, and text
documents, respectively. In these domains, although it is unwieldy to conceptualize the objects as points
in Euclidean space, it is easy to observe that the clustering tasks are well defined. Hence, the quest for a
universal way of extending the applicability of algorithms that work on a collection of points in Euclidean
space to domains with complex objects is of immense significance.
Apropos this quest, FastMap [29] was originally introduced in the Data Mining community to circumvent the complexity of representing individual objects by leveraging a distance (similarity) function on pairs
of them. On the one hand, many real-world domains have objects such as DNA strings, multi-media
datasets like voice excerpts or images, or medical datasets like electrocardiograms (ECGs) or magnetic
resonance images (MRIs) that require complex representations and cannot always be visualized as points
in geometric space. On the other hand, many of the same real-world domains naturally lend themselves to
well-defined distance functions on pairs of objects. Examples include the edit distance between two DNA
strings [101], the Minkowski distance between two images [66], and the cosine similarity between two text
documents [98]. Figure 1.1 illustrates the edit distance between two snippets of DNA strings: It is the
minimum number of insertions, deletions, or substitutions that are needed to transform one to the other.
FastMap embeds a collection of N complex objects in an artificially created Euclidean space in only
O(κN) time, where κ is the user-specified number of dimensions. While FastMap attempts to preserve all
of the O(N
2
) number of pairwise distances between the objects in the Euclidean embedding, it remarkably
does so in only linear time. Hence, it efficiently enables geometric interpretations, algebraic manipulations,
and downstream ML algorithms.
1.1.1 An Algorithmic Description of FastMap
3
�" �"! �! �#
�"# �#!
�#
(a) the “cosine law” projection in a triangle
����(·,·)
�!
�"
�#
�$
�" − �!
�"
�!
�"
%
�!
%
(b) projection onto a hyperplane that is perpendicular
to OaOb
Figure 1.4: Illustrates how coordinates are computed and recursion is carried out in FastMap, borrowed
from [21].
FastMap gets as input a collection of N complex objects O, a domain-specific distance function D(·,·)
defined on all pairs of objects, and a user-specified value of κ. It outputs a κ-dimensional Euclidean embedding of the complex objects. A Euclidean embedding assigns a κ-dimensional point pi ∈ R
κ
to each object
Oi
. A good Euclidean embedding is one in which the Euclidean distance χi j between any two points pi
and pj closely approximates D(Oi
,Oj). Here, for pi = ([pi
]1,[pi
]2 ...[pi
]κ ) and pj = ([pj
]1,[pj
]2 ...[pj
]κ ),
χi j =
p
∑
κ
r=1
([pj
]r −[pi
]r)
2
; and D(Oi
,Oj) is the domain-specific distance between the objects Oi
,Oj ∈ O.
In the very first iteration, FastMap heuristically identifies the farthest pair of objects Oa and Ob in
linear time. Once Oa and Ob are determined, every other object Oi defines a triangle with sides of lengths
dai = D(Oa,Oi), dab = D(Oa,Ob), and dib = D(Oi
,Ob), as shown in Figure 1.4a. The sides of the triangle
define its entire geometry, and the projection of Oi onto the line OaOb is given by:
xi = (d
2
ai +d
2
ab −d
2
ib)/(2dab). (1.1)
4
Algorithm 1: FASTMAP-DATAMINING: A linear-time algorithm for embedding complex objects
in a Euclidean space.
1 Input: O, κ, and ε
2 Output: pi ∈ R
r
for all Oi ∈ O
1: for r = 1,2...κ do
2: Choose Oa ∈ O randomly and let Ob = Oa.
3: for t = 1,2...Q (a small constant) do
4: {dai : Oi ∈ O} ← {D(Oa,Oi) : Oi ∈ O}.
5: Oc ← argmaxOi
{d
2
ai −∑
r−1
j=1
([pi
] j −[pa] j)
2}.
6: if Oc == Ob then
7: Break.
8: else
9: Ob ← Oa; Oa ← Oc.
10: end if
11: end for
12: d
′
ab ← D(Oa,Ob)
2 −∑
r−1
j=1
([pb] j −[pa] j)
2
.
13: if d
′
ab < ε then
14: r ← r −1; Break.
15: end if
16: for each Oi ∈ O do
17: d
′
ai ← D(Oa,Oi)
2 −∑
r−1
j=1
([pi
] j −[pa] j)
2
.
18: d
′
ib ← D(Oi
,Ob)
2 −∑
r−1
j=1
([pb] j −[pi
] j)
2
.
19: [pi
]r ← (d
′
ai +d
′
ab −d
′
ib)/(2
p
d
′
ab).
20: end for
21: end for
22: return pi ∈ R
r
for all Oi ∈ O.
FastMap sets the first coordinate of pi
, the embedding of Oi
, to xi
. In the subsequent κ − 1 iterations,
the same procedure is followed for computing the remaining κ −1 coordinates of each object. However, the
distance function is adapted for different iterations. For example, for the first iteration, the coordinates of Oa
and Ob are 0 and dab, respectively. Because these coordinates fully explain the true domain-specific distance
between these two objects, from the second iteration onward, the rest of pa and pb’s coordinates should be
identical. Intuitively, this means that the second iteration should mimic the first one on a hyperplane that
is perpendicular to the line OaOb, as shown in Figure 1.4b. Although the hyperplane is never constructed
5
explicitly, its conceptualization implies that the distance function for the second iteration should be changed
for all i and j in the following way:
Dnew(O
′
i
,O
′
j
)
2 = D(Oi
,Oj)
2 −(xj −xi)
2
. (1.2)
Here, O
′
i
and O
′
j
are the projections of Oi and Oj
, respectively, onto this hyperplane, and Dnew(·,·) is the new
distance function. The same reasoning is used to derive the distance function for the third iteration from the
distance function for the second iteration, and so on.
In each of the κ iterations, FastMap heuristically finds the farthest pair of objects according to the
distance function defined for that iteration. These objects are called pivots and can be stored as reference
objects for future use. There are very few, that is, ≤ 2κ, reference objects. Technically, finding the farthest
pair of objects in any iteration takes O(N
2
) time. However, FastMap uses a linear-time “pivot changing”
heuristic [29] to efficiently and effectively identify a pair of objects Oa and Ob that is very often the farthest
pair. It does this by initially choosing a random object Ob and then choosing Oa to be the farthest object
away from Ob. It then reassigns Ob to be the farthest object away from Oa, reassigns Oa to be the farthest
object away from Ob, and so on, until convergence or a maximum of Q iterations, for a small constant
Q ≤ 10.
Algorithm 1 presents the pseudocode for the FastMap algorithm described above. It uses a threshold
parameter ε to detect large values of κ that have diminishing returns on the accuracy of approximating
the pairwise distances between the objects. Hence, it returns an embedding in R
r
for a certain r ≤ κ.
On Lines 5, 12, 17, and 18, the algorithm invokes the arguments explained in Figure 1.4b and updates the
distance function as required by the current iteration. On Lines 2 to 11, it finds the farthest pair of objects via
the pivot changing heuristic. On Lines 12 to 20, Algorithm 1 invokes the arguments explained in Figure 1.4a
and computes the next coordinate of each object.
6
1.2 FastMap for Embedding Graphs
FastMap can also be used to embed the vertices of a undirected edge-weighted graph in a Euclidean space.
The idea is to view the vertices of a given undirected edge-weighted graph G = (V,E,w) as the objects to be
embedded. However, the distance function on the vertices can be defined in many ways. For illustration and
without much loss of generality, we assume that the distance function of interest is the one that returns the
shortest-path distances between vertices. Hence, the objective is to embed the vertices of G in a Euclidean
space so as to preserve the pairwise shortest-path distances between them as Euclidean distances.
As such, the original FastMap algorithm described in Algorithm 1 cannot be directly used for generating
the required Euclidean embedding of the vertices in linear time. This is because it assumes that the distance
di j between any two objects Oi and Oj can be computed in constant time, independent of the number
of objects in the problem domain. However, computing the shortest-path distance between two vertices
depends on the size of the graph.
The issue of having to retain (near-)linear time complexity can be addressed as follows: In each iteration,
after we heuristically identify the farthest pair of vertices Oa and Ob, the distances dai and dib need to be
computed for all other vertices Oi
. Computing dai and dib for any single vertex Oi can no longer be done
in constant time but requires O(|E| + |V|log|V|) time instead [35]. However, since we need to compute
these distances for all vertices, computing two shortest-path trees rooted at each of the vertices Oa and Ob
yields all necessary shortest-path distances in one shot. Figure 1.5 illustrates the construction of thees two
shortest-path trees. The complexity of doing so is also O(|E|+|V|log|V|), which is only linear in the size
of the graph1
. This simple yet critical observation revives the applicability of FastMap on graphs.
The foregoing observations are used in [68] to build a graph version of FastMap that embeds the vertices
of a given undirected edge-weighted graph in a Euclidean space in near-linear time: The Euclidean distances
approximate the pairwise shortest-path distances between vertices.
1unless |E| = O(|V|), in which case the complexity is near-linear in the size of the input because of the log|V| factor
7
Figure 1.5: Illustrates the two shortest-path trees rooted at the pivots in each iteration of FastMap on graphs.
Algorithm 2: FASTMAP: A near-linear-time algorithm for embedding the vertices of a given
undirected edge-weighted graph in a Euclidean space.
1 Input: G = (V,E,w), κ, and ε
2 Output: pi ∈ R
r
for all vi ∈ V
1: for r = 1,2...κ do
2: Choose va ∈ V randomly and let vb = va.
3: for t = 1,2...Q (a small constant) do
4: {dai : vi ∈ V} ← ShortestPathTree(G, va).
5: vc ← argmaxvi
{d
2
ai −∑
r−1
j=1
([pi
] j −[pa] j)
2}.
6: if vc == vb then
7: Break.
8: else
9: vb ← va; va ← vc.
10: end if
11: end for
12: {dai : vi ∈ V} ← ShortestPathTree(G, va).
13: {dib : vi ∈ V} ← ShortestPathTree(G, vb).
14: d
′
ab ← d
2
ab −∑
r−1
j=1
([pb] j −[pa] j)
2
.
15: if d
′
ab < ε then
16: r ← r −1; Break.
17: end if
18: for each vi ∈ V do
19: d
′
ai ← d
2
ai −∑
r−1
j=1
([pi
] j −[pa] j)
2
.
20: d
′
ib ← d
2
ib −∑
r−1
j=1
([pb] j −[pi
] j)
2
.
21: [pi
]r ← (d
′
ai +d
′
ab −d
′
ib)/(2
p
d
′
ab).
22: end for
23: end for
24: return pi ∈ R
r
for all vi ∈ V.
Algorithm 2 presents the pseudocode for FastMap on graphs. It gets as input the undirected edgeweighted graph G = (V,E,w), where w(e) is the non-negative weight on edge e ∈ E, and the two parameters
κ and ε. It outputs an embedding of the vertices in R
r
for a certain r ≤ κ. As in Algorithm 1, r detects the
8
Figure 1.6: Illustrates the graph version of FastMap for shortest-path computations. The left panel shows
an undirected edge-weighted graph. The right panel shows the 3-dimensional Euclidean embedding of it
produced by FastMap. The Euclidean distances can be used for heuristic guidance while computing shortest
paths (shown in red) via A
∗
search.
point of diminishing returns on the accuracy of the embedding. Moreover, the overall control structure of
Algorithm 2 mirrors that of Algorithm 1 with the difference that the distances from the pivots to all other
objects in each iteration of Algorithm 2 are computed in one shot via the function ShortestPathTree(·,·) on
Lines 4, 12, and 13. On Lines 5, 14, 19, and 20, Algorithm 2 invokes the arguments explained in Figure 1.4b
and updates the distances as required by the current iteration. On Lines 2 to 11, it finds the farthest pair of
vertices via the pivot changing heuristic. On Lines 14 to 22, it invokes the arguments explained in Figure 1.4a
and computes the next coordinate of each vertex.
1.3 Applications
In this section, we briefly mention some of the applications of the original FastMap algorithm and the graph
version of it. The efficiency of the original FastMap algorithm has already lead to many applications in
Data Mining, particularly with regard to fast indexing, searching, and enabling clustering algorithms on
complex objects that otherwise require a collection of points in geometric space as input [29]. In the same
contexts, the embeddings produced by FastMap allow us to visualize the complex objects, their spread, and
the outliers, thereby facilitating human-in-the-loop reasoning methods. As for the graph version of FastMap,
9
Figure 1.7: Illustrates how the FastMap embedding can be used to answer nearest-neighbor queries efficiently. The left panel shows snippets of DNA strings, on pairs of which the edit distance is well defined.
The middle panel shows a 2-dimensional Euclidean embedding of these snippets of DNA strings produced
by FastMap. The right panel shows a Voronoi diagram that is constructed on these point representations to
answer nearest-neighbor queries efficiently at runtime.
a slight modification of it, presented in [21], preserves the admissibility and the consistency of the Euclidean
distance approximation, required in an A
∗
search framework, if used as a heuristic. This version of FastMap,
used in the A
∗
search framework for heuristic guidance, leads to one of the state-of-the-art algorithms for
shortest-path computations. Figure 1.6 provides an illustration.
With a runtime complexity close to that of merely reading the input, FastMap produces a Euclidean
embedding and a geometric interpretation of combinatorial problems on complex objects and graphs. In
doing so, it delegates the combinatorial heavy-lifting to analytical and geometric techniques that are often
better equipped for absorbing large input sizes. The properties of the Euclidean space can be leveraged in
many ways. First, it empowers analytical methods to set up and solve equations or establish other conditions
of optimality. Second, it empowers geometric methods to conceive of structures such as straight lines,
angles, and bisectors, which facilitate visual intuition and techniques from Computational Geometry.
1.3.1 Complementing FastMap with Locality Sensitive Hashing
While the input of FastMap and its graph version consists only of a finite number of objects, the Euclidean
space it generates is a continuous space and has an infinite number of points. Therefore, while every object
maps to a point in the Euclidean space, not every point in the Euclidean space maps to an object. In fact, a
10
Figure 1.8: Illustrates how FastMap and LSH are coupled with each other. FastMap allows us to efficiently
interpret a graph (a collection of complex objects) in geometric space. LSH allows us to efficiently map a
point of interest in the geometric space to the nearest vertex (object).
point in the Euclidean space deemed as belonging to a solution by a downstream algorithm—or any other
point of interest in the Euclidean space—may not map to an object in the original problem domain.
To address the foregoing concern, we assign any point of interest in the Euclidean space to its nearest neighbor that maps to an object. This requires us to answer nearest-neighbor queries very efficiently.
Fortunately, this problem is well studied in Computational Geometry. For example, in a 2-dimensional
Euclidean space with straight-line distances, nearest-neighbor queries can be answered in logarithmic time
using Voronoi diagrams [6]. Figure 1.7 illustrates this.
Although Voronoi diagrams are well defined in higher dimensions as well, constructing and using them
in higher-dimensional spaces may become computationally expensive. Therefore, in higher dimensions, we
use the idea of Locality Sensitive Hashing (LSH) [22]. LSH is a hashing technique that maps similar input
items to the same hash buckets with high probability. It answers nearest-neighbor queries very efficiently,
practically matching a near-logarithmic time complexity.
The efficiency and effectiveness of both FastMap and LSH allow us to combine them and rapidly switch
between the original problem domain and its geometric interpretation. Hence, solutions produced in one
space can be quickly interpreted in the other. Figure 1.8 illustrates this coupling, henceforth referred to as
the FastMap+LSH framework.
11
1.4 Overview and Contributions
In this dissertation, we present many new applications of FastMap, its graph version, and the FastMap+LSH
framework described above.
In the first part of the dissertation, we apply the graph version of FastMap to solve various new graphtheoretic problems of significant interest in AI. Chapters 2, 3, 4, and 5 describe such applications for facility location, top-K centrality computations, community detection and block modeling, and graph convex
hull computations, respectively. In each case, we choose a proper distance function to interpret the graphtheoretic problem in geometric space, after which the appropriate analytical and geometric techniques are
invoked to better absorb large input sizes compared to discrete algorithms that work directly on the input graph. In each case, we also conduct comprehensive experiments and show that our FastMap-based
approach outperforms state-of-the-art competing methods, in terms of both the efficiency and the effectiveness. Overall, our approach leads to an important upshot: In the modern era, graphs are used to represent
social networks, communication networks, and transportation networks, among similar structures in many
other domains with entities and relationships between them. These graphs can be very large with millions
of vertices and hundreds of millions of edges. Therefore, algorithms with a running time that is quadratic
or more in the size of the input are undesirable. In fact, algorithms with any super-linear running times,
discounting logarithmic factors, are also largely undesirable. Hence, a desired algorithm should have a nearlinear running time close to that of merely reading the input. Our approach addresses these requirements
by first creating a geometric interpretation of a given graph-theoretic problem in only near-linear time and
consequently enabling powerful downstream algorithms.
In the second part of the dissertation, we propose a novel ML framework, called FastMapSVM, which
combines FastMap and SVMs to classify complex objects. Chapter 6 presents FastMapSVM and its underlying concepts. It also presents an application of FastMapSVM for identifying and classifying seismograms
in Earthquake Science. Chapter 7 presents another important application of FastMapSVM for predicting the
12
satisfiability of Constraint Satisfaction Problems (CSPs). In both cases, we show that FastMapSVM outperforms many state-of-the-art competing methods, particularly in terms of the training data and time required
for producing high-quality results. Overall, our approach leads to an important upshot: While Neural Networks (NNs) and related Deep Learning (DL) methods are popularly used for classifying complex objects,
they are generally based on the paradigm of characterizing individual objects. In contrast, FastMapSVM
is generally based on the paradigm of comparing pairs of objects via a distance function. One benefit of
FastMapSVM is that the distance function can encapsulate and invoke the intelligence of other powerful
algorithms such as the A
∗
search procedure and maxflow computations, among many other optimization
methods. The distance function can also incorporate domain-specific knowledge that otherwise may be hard
for ML algorithms to automatically extract from the data. FastMapSVM serves as a lightweight alternative
to NNs for classifying complex objects, particularly when training data or time is limited. It also extends
the applicability of SVMs to domains with complex objects by combining the complementary strengths of
FastMap and SVMs. Furthermore, FastMapSVM provides a perspicuous visualization of the objects and
the classification boundaries between them. This aids human interpretation of the data and results. It also
enables a human-in-the-loop framework for refining the processes of learning and decision making.
In summary, our contributions show that FastMap plays a key role in representing the vertices of a
graph, or complex objects in other domains, as points in Euclidean spaces. The ability of FastMap to
efficiently generate these “simplified” representations of the vertices, or the complex objects, enables many
powerful downstream algorithms developed in diverse research communities such as AI, ML, Computational
Geometry, Mathematics, Operations Research, and Theoretical Computer Science. Hence, we envision that
FastMap can facilitate and harness the confluence of these algorithms and find future applications in many
other problem domains that are not necessarily discussed here.
13
Chapter 2
FastMap for Facility Location Problems
Facility Location Problems (FLPs) arise while serving multiple customers in a shared environment, minimizing transportation and other costs. Hence, they involve the optimal placement of facilities. They are
defined on graphs as well as in Euclidean spaces with or without obstacles; and they are typically NP-hard
to solve optimally. There are many heuristic algorithms tailored to different kinds of FLPs. However, FLPs
defined in Euclidean spaces without obstacles are the most amenable to efficient and effective heuristic algorithms. This motivates the idea of quickly reformulating FLPs on graphs and in Euclidean spaces with
obstacles to FLPs in Euclidean spaces without obstacles. In this chapter, we propose a new approach towards this end based on FastMap and LSH. Through extensive experiments, we show that our approach
significantly outperforms other state-of-the-art competing algorithms on a variety of FLPs: the Multi-Agent
Meeting (MAM) problem, Vertex K-Median (VKM) problem, Weighted VKM (WVKM) problem, and the
Capacitated VKM (CVKM) problem.
2.1 Introduction
FLPs are constrained optimization problems that seek the optimal placement of facilities for providing resources and services to multiple customers in a shared environment. That is, FLPs serve the purpose of
orchestrating shared resources between multiple customers. They are used to model decision problems
14
Figure 2.1: Illustrates the MAM, VKM, WVKM, and the CVKM problems. The first, second, third, and
the fourth panels show the optimal solution in red for the MAM, VKM, WVKM, and the CVKM problems,
respectively, on the same input graph. In the MAM problem, there are 3 agents, each initially on a different
vertex shown in blue. The VKM, WVKM, and the CVKM problems have different optimal solutions for the
same value of K = 2. In the WVKM problem, all vertices have weight 1; except ‘B’ has weight 10. In the
CVKM problem, τ = 4.
related to transportation, warehousing, polling, and healthcare, among many other tasks, for maximizing
efficiency, impact, and/or profit. FLPs can be defined on graphs or in geometric spaces, in continuous or
discrete environments, and with a variety of distance metrics and objectives. A compendium of FLPs along
with various algorithms and case studies can be found in [31].
FLPs defined on graphs as well as in Euclidean spaces with or without obstacles are NP-hard to solve
optimally [46, 89]. Nonetheless, there are many heuristic algorithms tailored to different kinds of FLPs.
FLPs defined on graphs are broadly applicable, since most environments can be represented as a graph
(even if discretization is required). Modulo discretization, they are more general compared to FLPs defined in Euclidean spaces with obstacles, which, in turn, are more general compared to FLPs defined in
Euclidean spaces without obstacles. However, FLPs defined in Euclidean spaces without obstacles are the
most amenable to efficient and effective heuristic algorithms. In fact, FLPs in Euclidean spaces without obstacles are definitionally very close to clustering problems, which, in turn, are amenable to popular clustering
algorithms such as the K-means and GMM clustering [83].
The foregoing summary motivates the idea of quickly reformulating FLPs on graphs and in Euclidean
spaces with obstacles to FLPs in Euclidean spaces without obstacles. In this chapter, we propose a new
approach towards this end based on FastMap and LSH. We use FastMap to efficiently reformulate an FLP
15
on a graph to an FLP in a Euclidean space without obstacles.1 We use LSH to efficiently interpret a solution
found in the FastMap embedding as a solution on the original graph.
We address four well-known FLPs in this chapter: the MAM problem, VKM problem, WVKM problem,
and the CVKM problem. Below, we briefly describe each of these problems on graphs. Their counterparts
in Euclidean spaces with or without obstacles have analogous definitions. Moreover, we assume that the
graphs are undirected for two reasons: for the ease of exposition and to preserve the analogy in Euclidean
spaces where distances are inherently symmetric.
In the MAM problem [5], the input is a graph and a set of agents, each initially on a different start
vertex. The task is to find a common vertex where all the agents should meet so as to minimize the sum
of the agents’ shortest-path distances to it.2 The VKM problem seeks K vertices on the input graph for the
placement of facilities so as to minimize the sum of the shortest-path distances over each vertex to its nearest
facility. The WVKM problem is similar to the VKM problem, except that the objective is to minimize the
sum of the weighted shortest-path distances over each vertex to its nearest facility. Here, each vertex is given
a weight that measures its importance. The CVKM problem is also similar to the VKM problem, except that
no facility is allowed to serve more than τ vertices.3
The MAM, VKM, WVKM, and the CVKM problems have many real-world applications. For example, in multi-agent coordination tasks [5], they can be used to choose a gathering point; in urban development [30], they can be used to optimally place various public service centers within a city; and in communication networks [79], they can be used to determine the optimal placement of computation sites for critical
multiplexing. Figure 2.1 shows examples of these four problems posed on the same input graph.
1As explained later, an FLP in Euclidean space with obstacles is also amenable to a similar reformulation.
2There are a few other variants of the MAM problem described in [5]. These differ in being conflict-tolerant or conflict-free and
having different objective functions.
3
In a more general version of the CVKM problem, there is a supply and a demand associated with each facility and vertex,
respectively. No facility is allowed to serve a total demand that exceeds its supply.
16
On each of the FLPs described above, including their Euclidean variants, we demonstrate the efficiency
and effectiveness of our approach through extensive experimentation: We show that our approach significantly outperforms other state-of-the-art competing algorithms. In discretized Euclidean spaces, we also
show that it is possible to combine FastMap with an any-angle path planner, such as Anya [51].
2.2 Preliminaries
In this section, we define the MAM, VKM, WVKM, and the CVKM problems. We first define the graph
variants of these problems. We then briefly describe their Euclidean variants.
The MAM problem is as follows: Given an undirected edge-weighted graph G = (V,E,w), where w(e)
is the non-negative weight on edge e ∈ E, and the start vertices s1,s2 ...sk ∈V of k agents, the task is to find
a vertex v
∗ ∈ V such that v
∗ = argminv∈V ∑
k
i=1
dG(si
, v). Here, dG(vi
, vj), for vi
, vj ∈ V, is the shortest-path
distance between vi and vj
in G with respect to the edge weights.
The VKM problem is as follows: Given an undirected edge-weighted graph G = (V,E,w), where w(e)
is the non-negative weight on edge e ∈ E, and a positive integer K, the task is to find a subset of vertices
U
∗ ⊆ V of cardinality K such that U
∗ = argminU ∑v∈V minu∈U dG(v,u).
The WVKM problem is as follows: Given an undirected vertex-weighted and edge-weighted graph
G = (V,E,w˜,w), where ˜w(v) is the non-negative weight on vertex v ∈V and w(e) is the non-negative weight
on edge e ∈ E, and a positive integer K, the task is to find a subset of vertices U
∗ ⊆ V of cardinality K such
that U
∗ = argminU ∑v∈V minu∈U w˜(v)dG(v,u).
17
The CVKM problem is as follows: Given an undirected edge-weighted graph G = (V,E,w), where w(e)
is the non-negative weight on edge e ∈ E, and positive integers K and τ, the task is to find a subset of vertices
U
∗ ⊆ V of cardinality K and an assignment function f
∗
: V → U
∗
such that (U
∗
, f
∗
) =
argminU, f ∑v∈V dG(v, f(v))
subjectto ∀u ∈ U : |{v ∈ V : f(v) = u}| ≤ τ.
(2.1)
The Euclidean variants of the MAM, VKM, WVKM, and the CVKM problems are defined in Euclidean
spaces, which are continuous. In a Euclidean space without obstacles, a given set of N points corresponds
to V; and the straight-line distances between pairs of these points correspond to the shortest-path distances
between the pairs of vertices. However, the solution may be allowed to contain points in the Euclidean space
outside of the given N points. In a Euclidean space with obstacles, shortest-path distances via free space,
that is, avoiding obstacle regions, replace straight-line distances; and the solution can only include points in
free space.
2.3 Solving Facility Location Problems via FastMap and Locality Sensitive
Hashing
In this section, we present our approach for solving FLPs via FastMap and LSH. We illustrate it on the
MAM, VKM, WVKM, and the CVKM problems. The FastMap component of our approach allows us to
quickly render the FLP in Euclidean space without obstacles: This enables efficient and effective geometric
and analytical techniques for solving the problem. The LSH component of our approach allows us to quickly
interpret the solution obtained in Euclidean space as a viable solution on the original graph.
18
2.3.1 Solving Facility Location Problems in Euclidean Space without Obstacles
Most FLPs defined on graphs are also defined in Euclidean spaces without obstacles. As described before,
the MAM, VKM, WVKM, and the CVKM problems are defined in such a space using Euclidean distances
instead of graph-based distances. There are two benefits of using FastMap to convert an FLP specified on a
graph to an FLP specified in the Euclidean embedding of that graph. First, FLPs defined in Euclidean spaces
without obstacles are the most amenable to efficient and effective heuristic algorithms. Second, invoking
FastMap with an intelligently designed distance function can simplify the problem even more.
For illustration of the above arguments, we consider the MAM problem on an input graph with k agents.
Solving it optimally requires the computation of k shortest-path trees rooted at the individual start vertices of
the agents. The same problem in Euclidean space without obstacles is referred to as the Fermat-Weber problem [26]. This problem is also NP-hard to solve optimally but is amenable to very effective heuristics [33].
Moreover, if FastMap is invoked on the input graph to preserve the square-roots of the shortest-path distances—instead of the shortest-path distances—the problem in the resulting Euclidean space becomes one
of finding a point that minimizes the sum of the squared distances to k given points. This is a significantly
easier problem since the required point is the centroid of the k given points.
Algorithm 2 (from Chapter 1) can be easily modified to incorporate the square-root of the shortest-path
distance function p
dG(·,·) between vertices. This is done by returning the square-roots of the shortest-path
distances found by the procedure ShortestPathTree(·,·) on Lines 4, 12, and 13.
The VKM, WVKM, and the CVKM problems can also utilize the FastMap embedding with the squareroot of the shortest-path distance function. Doing so makes them very similar to clustering problems popularly studied in ML. For example, in a Euclidean space without obstacles, clustering algorithms such as the
K-means algorithm intend to minimize the sum of the squared Euclidean distances over each data point to its
nearest centroid. With the Euclidean distances representing the square-roots of the shortest-path distances
19
in the input graph, this is equivalent to solving the VKM problem of minimizing the sum of the shortestpath distances over each vertex to its nearest facility. Similarly, the WVKM problem can also be solved by
invoking the K-means algorithm in the FastMap embedding that preserves the square-roots of the shortestpath distances: The K-means algorithm is also given a weight associated with each data point that measures
its importance. Finally, the CVKM problem can also be solved by invoking the constrained K-means algorithm [15] in the FastMap embedding that preserves the square-roots of the shortest-path distances: The
constrained K-means algorithm restricts the size of each cluster to be no more than a user-specified parameter τ.
Although we have described how to solve FLPs in the Euclidean space generated by FastMap on an
input graph, the solutions produced reside in the Euclidean space and are not yet interpretable on the original
graph. As described in Chapter 1, we use LSH towards this end.
2.3.2 FastMap with Anya
Compared to FLPs in Euclidean spaces without obstacles, FLPs in Euclidean spaces with obstacles are much
harder to solve. Primarily, this is because the shortest-path distances in the latter are no more straight-line
distances. In fact, even by itself, computing the shortest path between two points in a Euclidean space with
obstacles may be very hard. Even shortest-path algorithms that operate in a Euclidean space with obstacles
have to make various kinds of assumptions on the nature of the obstacles and the acceptable paths that
maneuver through them. For the same reason, we define FLPs in Euclidean spaces with obstacles only
when the environment also supports Anya [51], a popular any-angle path planner. We note that this is not a
restriction on the kinds of FLPs that can be discussed but is a standardization of the input environment that
is also applicable to the state-of-the-art shortest-path algorithms.
Anya [51] is a recent any-angle shortest-path algorithm for grid-worlds. Given any two discrete points
on a 2-dimensional grid-world, Anya finds a shortest any-angle path between them, if one exists. It uses
20
a variant of A
∗
search over sets of states represented as intervals. Anya is very efficient since it does not
require preprocessing or the introduction of additional memory overheads.
In our FastMap-based approach for solving FLPs, the FastMap component always generates a Euclidean
space without obstacles. It can be used to transform an input Euclidean space with obstacles to an output
Euclidean space without obstacles if the straight-line distances in the output space preserve the desired distances in the input space. Towards this end, we can use the any-angle shortest-path distance function on the
discrete points of a 2-dimensional grid-world, as generated by Anya. However, using the any-angle shortestpath distance function with FastMap has the same fundamental challenge as using the regular shortest-path
distance function with FastMap: To retain the near-linear time complexity of FastMap, the distances should
not be computed from a root vertex to all other vertices independently. The computations have to be amortized to yield all of them simultaneously, as shown on Lines 4, 12, and 13 of Algorithm 2 (from Chapter 1).
Thus, we modify Anya to compute the entire tree of any-angle shortest-path distances from a root vertex to all other vertices. We call this version as Anya-Dijkstra. Hence, FastMap with Anya is similar to
Algorithm 2, except that it replaces the procedure ShortestPathTree(·,·) by Anya-Dijkstra and returns the
square-roots of the any-angle shortest-path distances found by Anya-Dijkstra on Lines 4, 12, and 13.
2.4 Competing Algorithms
The MAM, VKM, WVKM, and the CVKM problems can be formulated using Integer Linear Programming
(ILP). We use the template for the CVKM problem presented below. In this template, di j is a shorthand for
dG(vi
, vj); cj
is a Boolean variable that is ‘1’ iff vj
is a facility; and bi j is a Boolean variable that is ‘1’ iff vj
is the facility assigned to vi
. For the VKM problem, τ = |V|. For the WVKM problem, di j = w˜(vi)dG(vi
, vj)
21
and τ = |V|. For the MAM problem, the outer summation of the objective function spans only the start
vertices of the agents, τ = |V|, and K = 1.
min ∑vi∈V ∑v j∈V bi jdi j
subjectto ∀vi ∈ V : ∑v j∈V bi j = 1
∀vj ∈ V : ∑vi∈V bi j ≤ τ
∀vi
, vj ∈ V : bi j ≤ cj
∑v j∈V cj = K.
(2.2)
The MAM problem can be solved optimally in polynomial time and does not require the ILP solver:
The algorithm computes a shortest-path tree rooted at each of the k start vertices of the agents.4 When
heuristic guidance is available, MM∗
[5] can also be used to solve the MAM problem optimally. For the
VKM problem, the state-of-the-art algorithm is FasterPAM [107], the successor of the Partition Around
Medoids (PAM) algorithm [60]. FasterPAM conducts local search by repeatedly swapping a vertex from
its current solution S with a vertex in V \ S. It runs in O(K|V|
2
) time. For the WVKM problem, the stateof-the-art implementation of the PAM algorithm is available in the procedure ‘wcKMedoids’ [73] in R 4.3.
The CVKM problem is significantly harder: To the best of our knowledge, there are no good solvers for this
problem that scale to the problem sizes discussed in this chapter.
The ILP solver and the PAM algorithms require the precomputation of the all-pairs shortest-path distances, which can be done via the Floyd-Warshall algorithm.
4FastMap, as in Algorithm 2 (from Chapter 1), has been used for the MAM problem [68]; but there, it does not use the squareroots of the shortest-path distances and, consequently, uses a heuristically computed solution—instead of the centroid—in the
Euclidean space.
22
2.5 Experimental Results
In this section, we provide experimental results that compare our approach to competing algorithms on the
MAM, VKM, WVKM, and the CVKM problems.
We implemented our approach in Python3. For the LSH module, we used the ‘FALCONN’ library [2]
that has many code-level optimizations. For the K-means procedure without weights, required for solving
the VKM problem, and for the K-means procedure with weights, required for solving the WVKM problem, we used the ‘scikit-learn’ library [92]. For the constrained K-means procedure, required for solving
the CVKM problem, we used the ‘k-means-constrained’ library. For the ILP solver, we used the Gurobi
Optimizer 10.0 [47]. We used the Anya procedure via a Python interface to its implementation in Java. For
FasterPAM [107], we used the ‘kmedoids’ library. However, for the PAM clustering of weighted data, we
used the procedure ‘wcKMedoids’ [73] implemented in R 4.3. For the Floyd-Warshall algorithm, we used
the ‘NetworkX’ library [49]. We conducted all experiments on a laptop with an Apple M2 Max chip and 96
GB memory. For evaluation purposes, we chose two categories of problem instances, both of which contain
realistic FLPs.
In the first category, we chose problem instances representative of FLPs that arise in warehousing, urban planning, and transportation domains, among others. In such cases, the environment is essentially a
2-dimensional map. Moreover, in such cases, both the regular FastMap (FM), that is, FastMap that implements the procedure ShortestPathTree(·,·) using Dijkstra’s algorithm (DJK), and FastMap with Anya
(FMA), that is, FastMap that implements the procedure ShortestPathTree(·,·) using Anya-Dijkstra (ADJK),
are defined. This enables a more direct comparison of the various algorithms. Such instances are available in
the movingAI dataset [118]: Each instance serves as both a graph instance and a 2-dimensional grid-world
instance. In it, each discrete point on a traversable cell5
is represented as a vertex. Adjacent vertices, corresponding to discrete points on the same traversable cell, are connected by an edge. A horizontal or vertical
5
as defined in [51] for the application of Anya
23
Instance Size (|V|, |E|)
Preprocessing k = 50 k = 100
for FM: Time (s) SO (%) Time (s) SO (%)
FM_pre (s) DJK FM FM DJK FM FM
orz102d (738, 2632) 0.05 0.09 0.00 1.23 0.19 0.00 0.93
den407d (852, 3054) 0.07 0.09 0.00 0.28 0.18 0.00 0.98
lak526d (954, 3329) 0.07 0.09 0.00 0.56 0.19 0.00 1.17
den009d (1003, 3620) 0.07 0.10 0.00 1.49 0.21 0.00 1.55
AR0512SR (896, 3275) 0.06 0.10 0.00 6.35 0.19 0.00 2.31
AR0402SR (1075, 3796) 0.10 0.13 0.00 6.49 0.26 0.00 1.70
AR0517SR (1083, 4078) 0.09 0.12 0.00 4.47 0.24 0.00 1.17
AR0530SR (1092, 3885) 0.08 0.11 0.00 6.45 0.22 0.00 0.57
Shanghai_0_256 (48696, 190303) 4.81 6.67 0.01 1.15 13.26 0.00 2.26
blastedlands (131342, 505974) 12.95 18.58 0.02 5.87 37.18 0.02 0.90
maze512-32-5 (253856, 990715) 21.82 35.25 0.04 1.09 70.54 0.04 2.81
wm00800 (800, 9498) 1.06 0.25 0.00 7.08 0.50 0.00 5.81
wm01000 (1000, 14923) 1.78 0.38 0.00 9.37 0.76 0.00 5.80
wm05000 (5000, 374925) 88.64 13.43 0.00 3.80 26.76 0.00 3.56
wm10000 (10000, 1499713) 360.29 54.00 0.00 1.65 107.38 0.00 1.13
Table 2.1: Shows the results for the MAM problem on various graph instances. ‘FM’, ‘FM_pre’, ‘DJK’,
and ‘SO’ stand for ‘FastMap’, ‘FastMap preprocessing’, ‘Dijkstra’, and ‘suboptimality’, respectively.
edge has unit weight but a diagonal edge has weight √
2. If the graph constructed this way has multiple
connected components, only the largest one is used to represent the instance. We used five representative
benchmark suites in this category: ‘Dragon Age: Origins’, ‘Warcraft III’, ‘Baldurs Gate II’, ‘City/Street
Maps’, and ‘Mazes’. The first three are from commercial game environments; the fourth is from the real
world; and the fifth is artificial. Both FastMap and FastMap with Anya use κ = 10 for these instances.
In the second category, we chose problem instances representative of FLPs that arise in communication
networks. In the field of Computer and Communication Networks, Waxman graphs [125] are used as realistic communication networks. Hence, we generated Waxman graph instances using NetworkX [49] with
commonly used parameter values α = 0.3 and β = 0.1, within a rectangular domain of 100×100, and with
the weight on each edge set to the Euclidean distance between its endpoints. FastMap uses κ = 100 for
these instances.6
Tables 2.1-2.8 show the performance results of various algorithms on the MAM, VKM, WVKM, and the
CVKM problems. Although our experiments are extensive and conclusive, we present only representative
results in these tables. In each table, representative results are shown in sets of rows: the first set on instances
from ‘Dragon Age: Origins’, the second set on instances from ‘Baldurs Gate II’, and the third set on the
6The Normalized Root Mean Square Deviation, as used in [68] to measure the accuracy of the FastMap embedding, is much
higher for Waxman graphs even with κ = 100 compared to movingAI instances with κ = 10.
24
Instance Size (|V|, |E|)
Preprocessing k = 50 k = 100
for FMA: Time (s) SO (%) Time (s) SO (%)
FMA_pre (s) ADJK FMA FM FMA ADJK FMA FM FMA
orz102d (738, 2632) 3.29 5.42 0.00 0.96 0.30 10.86 0.00 1.45 0.46
den407d (852, 3054) 2.66 3.92 0.00 0.91 0.58 8.07 0.00 0.09 0.13
lak526d (954, 3329) 3.77 5.12 0.00 2.52 2.52 10.01 0.00 2.90 5.15
den009d (1003, 3620) 2.00 3.46 0.00 3.65 1.08 6.96 0.00 0.05 0.29
AR0512SR (896, 3275) 10.81 13.84 0.00 4.91 2.71 27.62 0.00 2.76 3.47
AR0402SR (1075, 3796) 8.54 10.54 0.00 1.07 2.53 21.33 0.00 0.03 3.53
AR0517SR (1083, 4078) 9.84 13.89 0.00 6.33 4.28 27.95 0.00 1.82 2.72
AR0530SR (1092, 3885) 12.09 15.37 0.00 0.64 0.21 30.87 0.00 1.33 0.43
Shanghai_0_256 (48696, 190303) 16.77 16.09 0.01 1.56 1.27 32.12 0.01 1.06 3.07
blastedlands (131342, 505974) 44.81 35.70 0.02 6.87 7.52 71.59 0.02 3.70 5.50
maze512-32-5 (253856, 990715) 34.69 55.79 0.04 2.68 3.03 111.51 0.04 1.43 1.48
Table 2.2: Shows the results for the MAM problem on various grid-world instances. ‘FMA’, ‘FMA_pre’,
and ‘ADJK’ stand for ‘FastMap with Anya’, ‘FastMap with Anya preprocessing’, and ‘Anya-Dijkstra’,
respectively.
Instance Size (|V|, |E|)
Preprocessing Preprocessing K = 10 K = 20
for ILP, PAM: for FM: Time (s) SO (%) Time (s) SO (%)
FW (s) FM_pre (s) ILP PAM FM PAM FM ILP PAM FM PAM FM
orz102d (738, 2632) 27.10 0.05 79.34 0.00 0.00 1.47 3.70 175.76 0.00 0.00 1.84 1.01
den407d (852, 3054) 41.24 0.07 48.70 0.00 0.00 1.34 3.44 132.63 0.00 0.01 1.95 1.78
lak526d (954, 3329) 56.36 0.07 170.70 0.01 0.02 0.23 1.09 71.53 0.00 0.01 7.43 2.72
den009d (1003, 3620) 64.38 0.07 83.20 0.12 0.06 0.21 1.23 96.65 0.08 0.06 4.23 1.60
AR0512SR (896, 3275) 48.20 0.06 52.95 0.00 0.00 54.19 2.52 485.62 0.00 0.01 0.99 2.98
AR0402SR (1075, 3796) 82.27 0.10 73.03 0.07 0.05 1.85 2.68 74.51 0.11 0.16 0.11 4.06
AR0517SR (1083, 4078) 85.25 0.09 1556.52 0.15 0.12 0.22 0.80 4467.56 0.19 0.06 0.45 2.66
AR0530SR (1092, 3885) 84.71 0.08 121.16 0.08 0.03 1.43 0.41 209.84 0.13 0.10 0.12 5.69
Shanghai_0_256 (48696, 190303) - 4.40 - - 0.15 - - - - 0.25 - -
blastedlands (131342, 505974) - 12.88 - - 0.49 - - - - 0.72 - -
maze512-32-5 (253856, 990715) - 22.61 - - 0.65 - - - - 1.00 - -
wm00800 (800, 9498) 35.94 1.06 1650.12 0.00 0.02 0.03 16.50 3302.97 0.00 0.02 0.04 13.16
wm01000 (1000, 14923) 70.30 1.78 9188.83 0.06 0.04 0.07 15.04 9970.25 0.09 0.07 0.19 16.66
wm05000 (5000, 374925) - 88.64 - - 0.12 - - - - 0.16 - -
wm10000 (10000, 1499713) - 360.29 - - 0.33 - - - - 0.49 - -
Table 2.3: Shows the results for the VKM problem on various graph instances. ‘FW’, ‘ILP’, and ‘PAM’
stand for ‘Floyd-Warshall’, ‘ILP solver’, and ‘PAM algorithm’, respectively. The ILP solver and the PAM
algorithm require the Floyd-Warshall algorithm in a preprocessing phase for the computation of the all-pairs
shortest-path distances.
Instance Size (|V|, |E|)
Preprocessing K = 10 K = 20
for FMA: Time (s) SO (%) Time (s) SO (%)
FMA_pre (s) ILP FMA PAM FM FMA ILP FMA PAM FM FMA
orz102d (738, 2632) 3.29 77.38 0.28 6.44 1.69 1.35 214.98 0.28 38.30 1.57 2.87
den407d (852, 3054) 2.66 49.04 0.28 0.51 3.65 1.32 144.29 0.23 31.33 2.42 4.77
lak526d (954, 3329) 3.77 170.97 0.28 0.04 1.80 0.12 72.49 0.28 0.31 1.98 6.70
den009d (1003, 3620) 2.00 83.16 0.28 -0.02 2.41 0.89 95.91 0.28 0.28 3.83 4.14
AR0512SR (896, 3275) 10.81 55.76 0.27 0.00 1.29 3.26 519.66 0.28 0.99 2.20 4.25
AR0402SR (1075, 3796) 8.54 74.70 0.29 2.01 1.21 5.10 75.09 0.29 0.51 4.07 4.27
AR0517SR (1083, 4078) 9.84 1581.05 0.28 3.68 0.78 0.73 5905.02 0.28 0.56 1.50 0.72
AR0530SR (1092, 3885) 12.09 122.93 0.28 1.90 3.64 3.33 211.75 0.30 1.58 3.91 1.26
Shanghai_0_256 (253856, 990715) 14.07 - 0.10 - - - - 0.30 - - -
blastedlands (131342, 505974) 37.21 - 0.47 - - - - 0.67 - - -
maze512-32-5 (253856, 990715) 39.19 - 0.68 - - - - 1.11 - - -
Table 2.4: Shows the results for the VKM problem on various grid-world instances.
largest instances from ‘City/Street Maps’, ‘Warcraft III’, and ‘Mazes’, in that order. These three sets are
from the first category and serve as both graph and grid-world instances. The odd-numbered tables also have
a fourth set of rows from the second category that serve only as graph instances. A ‘-’ is associated with any
25
Instance Size (|V|, |E|)
Preprocessing Preprocessing K = 10 K = 20
for ILP, PAM: for FM: Time (s) SO (%) Time (s) SO (%)
FW (s) FM_pre (s) ILP PAM FM PAM FM ILP PAM FM PAM FM
orz102d (738, 2632) 27.10 0.05 44.49 0.06 0.00 2.61 4.26 95.91 0.14 0.00 4.12 5.50
den407d (852, 3054) 41.24 0.07 52.29 0.15 0.00 1.69 0.72 74.92 0.18 0.01 4.91 4.90
lak526d (954, 3329) 56.36 0.07 101.26 0.09 0.02 1.19 2.78 60.25 0.21 0.01 3.35 2.49
den009d (1003, 3620) 64.38 0.07 79.18 0.11 0.01 1.12 1.63 137.58 0.35 0.01 2.39 4.09
AR0512SR (896, 3275) 48.20 0.06 146.60 0.08 0.00 1.57 0.72 119.31 0.33 0.01 1.68 3.78
AR0402SR (1075, 3796) 82.27 0.10 80.94 0.11 0.02 3.92 3.18 92.47 0.25 0.03 4.92 6.26
AR0517SR (1083, 4078) 85.25 0.09 2907.62 0.19 0.27 0.93 1.09 571.17 0.45 0.02 2.85 1.93
AR0530SR (1092, 3885) 84.71 0.08 318.33 0.14 0.02 2.15 0.90 127.48 0.28 0.02 3.67 5.11
Shanghai_0_256 (48696, 190303) - 4.77 - - 0.17 - - - - 0.25 - -
blastedlands (131342, 505974) - 13.10 - - 0.38 - - - - 0.70 - -
maze512-32-5 (253856, 990715) - 21.89 - - 0.68 - - - - 1.07 - -
wm00800 (800, 9498) 35.94 1.06 1946.18 0.12 0.24 0.00 15.84 1388.69 0.21 0.01 0.00 17.19
wm01000 (1000, 14923) 70.30 1.78 5135.40 0.20 0.02 0.00 11.62 15112.05 0.21 0.02 0.01 12.68
wm05000 (5000, 374925) - 88.64 - - 0.07 - - - - 0.07 - -
wm10000 (10000, 1499713) - 360.29 - - 0.32 - - - - 0.37 - -
Table 2.5: Shows the results for the WVKM problem on various graph instances.
Instance Size (|V|, |E|)
Preprocessing K = 10 K = 20
for FMA: Time (s) SO (%) Time (s) SO (%)
FMA_pre (s) ILP FMA PAM FM FMA ILP FMA PAM FM FMA
orz102d (738, 2632) 3.29 37.61 0.28 2.63 1.49 2.60 37.60 0.22 3.97 3.51 2.95
den407d (852, 3054) 2.66 50.66 0.28 1.69 5.39 0.14 61.92 0.22 3.97 2.41 2.18
lak526d (954, 3329) 3.77 105.26 0.28 1.66 1.61 7.51 72.85 0.28 3.90 3.63 2.45
den009d (1003, 3620) 2.00 128.13 0.28 2.18 0.81 0.54 91.62 0.28 2.60 4.08 3.88
AR0512SR (896, 3275) 10.81 53.07 0.28 1.21 0.13 2.82 129.15 0.28 2.70 2.98 5.86
AR0402SR (1075, 3796) 8.54 84.35 0.29 2.31 4.10 4.63 100.25 0.29 1.90 2.52 5.81
AR0517SR (1083, 4078) 9.84 92.75 0.28 0.97 0.51 1.11 1358.58 0.28 2.11 1.98 1.67
AR0530SR (1092, 3885) 12.09 204.61 0.27 1.29 4.39 2.22 139.88 0.28 4.71 3.88 2.83
Shanghai_0_256 (253856, 990715) 14.98 - 0.21 - - - - 0.40 - - -
blastedlands (131342, 505974) 24.34 - 0.22 - - - - 0.45 - - -
maze512-32-5 (253856, 990715) 37.99 - 0.67 - - - - 1.03 - - -
Table 2.6: Shows the results for the WVKM problem on various grid-world instances.
Instance Size (|V|, |E|)
Preprocessing Preprocessing K = 10 K = 20
for ILP: for FM: Time (s) SO (%) Time (s) SO (%)
FW (s) FM_pre (s) ILP FM FM ILP FM FM
orz102d (738, 2632) 27.10 0.05 101.07 0.09 1.80 162.59 0.13 3.57
den407d (852, 3054) 41.24 0.07 49.35 0.04 4.02 166.29 0.09 2.88
lak526d (954, 3329) 56.36 0.07 184.90 0.04 7.82 84.68 0.28 3.34
den009d (1003, 3620) 64.38 0.07 86.95 0.17 1.45 114.34 0.26 4.10
AR0512SR (896, 3275) 48.20 0.06 66.96 0.06 2.89 495.82 0.10 3.19
AR0402SR (1075, 3796) 82.27 0.10 90.72 0.06 3.99 85.61 0.28 5.63
AR0517SR (1083, 4078) 85.25 0.09 1942.77 0.24 3.17 3615.14 0.29 2.85
AR0530SR (1092, 3885) 84.71 0.08 175.05 0.30 0.54 242.01 0.23 3.94
Shanghai_0_256 (48696, 190303) - 4.85 - 6.19 - - 33.48 -
blastedlands (131342, 505974) - 15.18 - 59.19 - - 62.15 -
maze512-32-5 (253856, 990715) - 22.50 - 13.18 - - 20.08 -
wm00800 (800, 9498) 35.94 1.06 2721.67 0.13 48.18 4590.93 0.27 62.34
wm01000 (1000, 14923) 70.30 1.78 10531.79 0.13 35.18 23911.17 0.24 58.05
wm05000 (5000, 374925) - 88.64 - 1.08 - - 2.56 -
wm10000 (10000, 1499713) - 360.29 - 3.07 - - 6.04 -
Table 2.7: Shows the results for the CVKM problem on various graph instances.
instance whose preprocessing time exceeds 60 min. The suboptimality ‘SO’ columns report (cost - optimal
cost)/(optimal cost) as a percentage. In general, our approach is the only one that can scale to large input
sizes for the VKM, WVKM, and the CVKM problems.
26
Instance Size (|V|, |E|)
Preprocessing K = 10 K = 20
for FMA: Time (s) SO (%) Time (s) SO (%)
FMA_pre (s) ILP FMA FM FMA ILP FMA FM FMA
orz102d (738, 2632) 3.29 90.55 0.05 1.52 2.44 192.79 0.19 2.40 3.64
den407d (852, 3054) 2.66 66.04 0.09 2.12 4.45 153.89 0.10 2.43 2.32
lak526d (954, 3329) 3.77 187.95 0.09 2.01 2.08 78.51 0.27 3.91 1.65
den009d (1003, 3620) 2.00 116.48 0.08 1.00 1.05 106.17 0.25 1.59 1.60
AR0512SR (896, 3275) 10.81 62.47 0.26 5.51 3.47 521.99 0.13 4.24 8.09
AR0402SR (1075, 3796) 8.54 84.41 0.21 2.02 4.80 90.01 0.14 5.27 7.96
AR0517SR (1083, 4078) 9.84 1237.97 0.42 3.20 3.22 4324.49 0.35 3.35 2.45
AR0530SR (1092, 3885) 12.09 138.18 0.11 1.66 3.64 254.33 0.21 5.71 1.68
Shanghai_0_256 (48696, 190303) 17.02 - 12.27 - - - 21.37 - -
blastedlands (131342, 505974) 42.48 - 33.20 - - - 34.67 - -
maze512-32-5 (253856, 990715) 39.58 - 14.39 - - - 24.48 - -
Table 2.8: Shows the results for the CVKM problem on various grid-world instances.
Table 2.1 compares FastMap (FM) and the brute-force algorithm (DJK) on the MAM problem. DJK
computes the ground truth by rooting a shortest-path tree at each of the k start vertices of the agents. Here,
each graph instance is designed by picking the k start vertices at random. The FastMap preprocessing time
(FM_pre) refers to the time taken by FastMap to generate the Euclidean embedding of the graph plus the time
taken by LSH for the initial indexing. This preprocessing time is required only once per graph, independent
of k and the start vertices. We observe that FastMap is significantly faster than the brute-force algorithm on
larger instances, up to 4-5 orders of magnitude.7
In fact, FastMap is very often more efficient even with the
preprocessing time included. It also produces solutions that are within just 7% suboptimality on instances
from the first category and within 10% suboptimality on instances from the second category. Table 2.2 shows
a similar dominance of FastMap and FastMap with Anya (FMA) over the brute-force algorithm (ADJK) that
uses Anya to compute an any-angle shortest-path tree rooted at each of the k start vertices for generating the
ground truth on grid-world instances. The FastMap times are excluded from Table 2.2 since they appear in
Table 2.1. The suboptimality of FastMap is different in Tables 2.1 and 2.2, since the quality of a solution is
measured using any-angle shortest-path distances in Table 2.2.
Table 2.3 compares FastMap, the ILP solver, and FasterPAM for solving graph instances of the VKM
problem. Both the ILP solver and FasterPAM use the Floyd-Warshall algorithm (FW) in a preprocessing
step to compute the all-pairs shortest-path distances. The preprocessing time of FastMap is significantly
7Based on the results reported in [5], FastMap also seems significantly faster—and more scalable to large graphs with large
values of k—compared to MM∗
.
27
smaller than that of the Floyd-Warshall algorithm: The latter is about 3 orders of magnitude slower and not
even viable for large graphs. Moreover, at query time, FastMap and FasterPAM are both significantly faster
than the ILP solver. FastMap produces solutions within just 6% suboptimality on instances from the first
category and within 17% suboptimality on instances from the second category. FasterPAM also produces
high-quality solutions but it does so with occasional outliers and does not scale to large instances. Table 2.4
on grid-world instances shows a similar dominance of FastMap and FastMap with Anya—within similar
suboptimality ranges—over the ILP solver and FasterPAM with respect to the preprocessing time and over
the ILP solver with respect to the query time. The FastMap times, the Floyd-Warshall preprocessing times,
and the FasterPAM query times are excluded from Table 2.4 since they appear in Table 2.3.
8 However,
the query times of the ILP solver vary across different runs on the same instance and, thus, are included
in Table 2.4. Tables 2.5 and 2.6 show the same trends for the WVKM problem,9
instances of which are
generated by assigning to each vertex a weight chosen uniformly at random from the interval [0,1). Here,
the PAM algorithm is the ‘wcKMedoids’ procedure in R 4.3.
Table 2.7 compares FastMap and the ILP solver for solving graph instances of the CVKM problem.
For representative results, these instances use τ = ⌈2|V|/K⌉. Although the CVKM problem renders many
approaches ineffective due to its comparative hardness over the VKM and the WVKM problems, FastMap
can still solve all the instances, even those with nearly a quarter-million vertices and a million edges, in mere
seconds. On smaller instances, FastMap is also significantly faster than the ILP solver, which generates the
ground truth. Moreover, FastMap produces solutions within just 8% suboptimality on instances from the
first category and within 63% suboptimality10 on instances from the second category. Table 2.8 shows the
8The computation of the all-pairs any-angle shortest-path distances is very expensive. Therefore, the Floyd-Warshall algorithm
is invoked by treating the grid-world as a graph. This also creates the remote possibility of the ILP solver producing a suboptimal
solution, although it is practically still treated as the ground truth.
9The ILP solver can also be an anytime solver. However, its performance is nowhere comparable to that of FastMap. For
example, on the representative instance ‘wm01000’ with K = 20, compared to FM, ILP takes about 400× time to produce a mere
34% suboptimal solution for both the VKM and the WVKM problems.
1063% suboptimality is still impressive, since the CVKM problem is combinatorially very hard and, till date, there is no known
polynomial-time constant-factor approximation algorithm for it.
28
Figure 2.2: Illustrates the FastMap pipeline—as an alternative to the Floyd-Warshall pipeline—for the fast
approximation of the all-pairs shortest-path distances. Here, the input graph is for the VKM problem with
K = 3. The distortion in the pairwise shortest-path distances produced by FastMap is largely absorbed by
FasterPAM and does not affect the quality of the final solution (shown in green).
Figure 2.3: Compares three methods for solving the VKM problem on a representative movingAI instance.
These three methods are generally applicable for solving FLPs on graphs, as mentioned in this chapter. On
this instance, all three methods produce solutions within 1% suboptimality for K = 3. However, they do so
with different preprocessing and query times. ‘APSP Distances’ refer to the all-pairs shortest-path distances.
The outputs of the three methods are shown in the same color used for each of them.
same trends for solving the grid-world instances of the CVKM problem with the same τ, within just 9%
suboptimality using FastMap and FastMap with Anya.
29
2.6 Discussion
While we have shown how FastMap can be used to solve FLPs on graphs by quickly reformulating them
to FLPs in Euclidean spaces without obstacles, there is another way in which FastMap can be used for
solving FLPs on graphs. In this method, the idea is to merely empower the existing state-of-the-art heuristic
algorithms that work directly on the input graph G. Many such existing algorithms first compute the metric
closure of G in a preprocessing phase, which, in turn, requires the computation of the all-pairs shortest-path
distances. As observed in the foregoing sections, this preprocessing phase can easily become a bottleneck.
FastMap can alleviate this issue by computing the all-pairs shortest-path distances approximately but very
efficiently: They are approximated by the Euclidean distances in the FastMap embedding. Figure 2.2 shows
the suggested pipeline of this method.
Figure 2.3 compares the three methods discussed so far. The first method refers to the existing method
of computing the metric closure of the input graph in a preprocessing phase followed by the application
of a state-of-the-art algorithm that works directly on the graph. The second method refers to our method
discussed in the previous sections of this chapter: It uses FastMap to create a Euclidean embedding, solves
the FLP in this Euclidean space without obstacles by relating it to a clustering problem, and finally uses LSH
to interpret the solution back on the original graph. The third method refers to the suggested method from
the previous paragraph: It resembles the first method, except that it uses FastMap to merely approximate the
all-pairs shortest-path distances required for computing the metric closure of the input graph.
In general, the third method is not as efficient and effective as the second method. However, it can still
be useful when the FLP in the Euclidean space without obstacles is not readily relatable to a well-studied
clustering problem. The Vertex K-Center (VKC) problem [78] may be one such problem: It is like the
VKM problem, except that the objective is to minimize the maximum of the shortest-path distances over
each vertex to its nearest facility. A comprehensive study of this third method can be found in [120].
30
2.7 Conclusions
In this chapter, we studied four representative FLPs: the MAM, VKM, WVKM, and the CVKM problems.
Like most FLPs, these problems are well defined on graphs as well as in Euclidean spaces with or without
obstacles. While they are generally difficult to solve optimally, the ones defined in a Euclidean space without
obstacles are akin to clustering problems. We used the idea of FastMap to reformulate FLPs defined on a
graph to FLPs defined in a Euclidean space without obstacles. Subsequently, we used standard clustering
algorithms to solve the problems in the resulting Euclidean space and LSH to interpret the solutions back
on the original graph. End to end, our approach produces high-quality solutions with orders-of-magnitude
speedup over state-of-the-art competing algorithms.
31
Appendix
2.A Table of Notations
Notation Description
K The number of facilities in the VKM, WVKM, and the CVKM problems.
τ The upper bound on the number of vertices that any facility can serve.
G = (V,E,w) An undirected edge-weighted graph, where w(e) is the non-negative weight on
edge e ∈ E.
k The number of agents in the MAM problem.
dG(vi
, vj) The shortest-path distance between vi and vj
in G with respect to the edge
weights.
G = (V,E,w˜,w) An undirected vertex-weighted and edge-weighted graph, where ˜w(v) is the nonnegative weight on vertex v ∈ V and w(e) is the non-negative weight on edge
e ∈ E.
κ The user-specified number of dimensions of the FastMap embedding.
Table 2.A.1: Describes the notations used in Chapter 2.
32
Chapter 3
FastMap for Efficiently Computing Top-K Projected Centrality
In this chapter, we describe how to use FastMap for efficient top-K centrality computations. In graph theory
and network analysis, various measures of centrality are used to characterize the importance of vertices in
a graph. Although different measures of centrality have been invented to suit the nature and requirements
of different underlying problem domains, their application is restricted to explicit graphs. Here, we first
define implicit graphs that involve auxiliary vertices in addition to the pertinent vertices. We then generalize
the various measures of centrality on explicit graphs to corresponding measures of projected centrality on
implicit graphs. Finally, we propose a FastMap-based unifying framework for approximately, but very
efficiently computing the top-K pertinent vertices in explicit graphs for various measures of centrality and
in implicit graphs for the generalizations of these measures to projected centrality.
3.1 Introduction
Graphs are used to represent entities in a domain and important relationships between them: Often, vertices
represent the entities and edges represent the relationships. However, graphs can also be defined implicitly
by using two kinds of vertices and edges between the vertices: The pertinent vertices represent the main
entities, that is, the entities of interest; the auxiliary vertices represent the hidden entities; and the edges
33
Figure 3.1: Shows a communication network with user terminals, routers, and switches. The solid lines
indicate direct communication links. Depending on the application, the user terminals may be considered as
the pertinent vertices while the routers and the switches may be considered as the auxiliary vertices.
represent relationships between the vertices. For example, in an air transportation domain, the pertinent vertices could represent the international airports, the auxiliary vertices could represent the domestic airports,
and the edges could represent the flight connections between the airports. In a social network, the pertinent
vertices could represent individuals, the auxiliary vertices could represent communities, and the edges could
represent friendships or memberships. Figure 3.1 shows another example in the domain of communication
networks. Here, depending on the application, the pertinent vertices could represent the user terminals,
the auxiliary vertices could represent the routers and the switches, and the edges could represent the direct
communication links between them.
Explicit and implicit graphs can be used to model transportation networks, social networks, communication networks, and biological networks, among many others. In most of these domains, the ability to identify
the “important” pertinent vertices has many applications. For example, the important pertinent vertices in
an air transportation network could represent transportation hubs, such as Amsterdam and Los Angeles
for Delta Airlines. The important pertinent vertices in a social network could represent highly influential
individuals. Similarly, the important pertinent vertices in a communication network could represent the
admin-users, and the important pertinent vertices in a properly modeled biological network could represent
biochemicals critical for cellular operations.
34
The important pertinent vertices in a graph (network) as well as the task of identifying them depend
on the definition of “importance”. Such a definition is typically domain-specific. It has been studied in
explicit graphs and is referred to as a measure of centrality. For example, the page rank is a popular
measure of centrality used in Internet search engines [90]. In general, there are several other measures of
centrality defined on explicit graphs, such as the degree centrality, the closeness centrality [37], the harmonic
centrality [12], the current-flow closeness centrality [116, 18], the eigenvector centrality [13], the Katz
centrality [59], and the betweenness centrality [36, 16, 17].
The degree centrality of a vertex measures the immediate connectivity of it, that is, the number of its
neighbors. The closeness centrality of a vertex is the average shortest-path distance between that vertex
and all other vertices. The harmonic centrality resembles the closeness centrality but reverses the sum and
reciprocal operations in its mathematical definition to be able to handle disconnected vertices and infinite
distances. The current-flow closeness centrality also resembles the closeness centrality but uses an “effective
resistance” between two vertices instead of the shortest-path distance between them. The eigenvector centrality scores the vertices based on the eigenvector corresponding to the largest eigenvalue of the adjacency
matrix. The Katz centrality generalizes the degree centrality by incorporating a vertex’s k-hop neighbors
with a weight α
k
, where α ∈ (0,1) is an attenuation factor. The betweenness centrality of a vertex measures
the number of shortest paths between any two vertices that utilize it.
While many measures of centrality are frequently used on explicit graphs, they are not frequently used on
implicit graphs. However, for any measure of centrality, a measure of “projected” centrality can be defined
on implicit graphs. The measure of projected centrality is equivalent to the regular measure of centrality
applied on a graph that “factors out” the auxiliary vertices from the implicit graph. Auxiliary vertices can be
factored out by conceptualizing a clique on the pertinent vertices, in which an edge connecting two pertinent
vertices is annotated with a graph-based distance between them that, in turn, is derived from the implicit
35
graph.1 The graph-based distance can be the shortest-path distance or any other domain-specific distance.
If there are no auxiliary vertices, the graph-based distance function is expected to be such that the measure
of projected centrality reduces to the regular measure of centrality.
For a given measure of projected centrality, the projected centrality of a pertinent vertex is referred to
as its projected centrality value. Identifying the important pertinent vertices in a network is equivalent to
identifying the top-K pertinent vertices with the highest projected centrality values. Graph theoretically,
the projected centrality values can be computed for all pertinent vertices of a network in polynomial time,
for most measures of projected centrality. However, in practical domains, the real challenge is to achieve
scalability to very large networks with millions of vertices and hundreds of millions of edges. Therefore,
algorithms with a running time that is quadratic or more in the size of the input are undesirable. In fact,
algorithms with any super-linear running times, discounting logarithmic factors, are also largely undesirable.
In other words, modulo logarithmic factors, a desired algorithm should have a near-linear running time close
to that of merely reading the input.
Although attempts to achieve such near-linear running times exist, they are applicable only for certain
measures of centrality on explicit graphs. For example, [28, 20] approximate the closeness centrality using
sampling-based procedures. A number of approaches [38, 133] approximate the betweenness centrality
using refined estimators and a near-linear-time hypergraph sketching procedure. [10, 11] approximate the
betweenness centrality on dynamic graphs. [9] maintains and updates a lower bound for each vertex, utilizing
the bound to skip the analysis of a vertex when appropriate. It supports fairly efficient approximation
algorithms for computing the top-K vertices for the closeness and the harmonic centrality measures. [14]
provides a survey on many of the algorithms mentioned above. However, algorithms of the aforementioned
kind are known only for a few measures of centrality on explicit graphs. Moreover, such algorithms do not
provide a general framework since they are tied to specific measures of centrality.
1The clique on the pertinent vertices is a mere conceptualization. Constructing it explicitly may be expensive and/or practically
prohibitive for large graphs since it requires the computation of the graph-based distance between every pair of the pertinent vertices.
36
In this chapter, we generalize the various measures of centrality on explicit graphs to corresponding
measures of projected centrality on implicit graphs. Importantly, we also propose a framework for computing the top-K pertinent vertices approximately, but very efficiently, using FastMap, for various measures
of centrality and projected centrality. The FastMap framework allows us to conceptualize the various measures of centrality and projected centrality in Euclidean space, thereby facilitating a variety of geometric
and analytical techniques for efficiently computing the top-K pertinent vertices. It is extremely valuable
because it implements this reformulation for different measures of centrality and projected centrality in only
near-linear time and delegates the combinatorial heavy-lifting to geometric and analytical techniques that
are better equipped for efficiently absorbing large input sizes.
Computing the top-K pertinent vertices in the FastMap framework for different measures of projected
centrality often requires interpreting analytical solutions found in the FastMap embedding back in the graphical space. We achieve this via nearest-neighbor queries and LSH. Through experimental results on a comprehensive set of benchmark and synthetic instances, we show that the FastMap+LSH framework is both
efficient and effective for many popular measures of centrality and their generalizations to projected centrality. For our experiments, we also implement generalizations of some competing algorithms on implicit
graphs. Overall, our approach demonstrates the benefits of drawing power from analytical techniques via
FastMap for efficiently computing the top-K projected centrality.
3.2 Measures of Projected Centrality
In this section, we generalize measures of centrality to corresponding measures of projected centrality.
Consider an implicit graph G = (V,E,w), where V
P ⊆V and V
A ⊆V, for V
P∪V
A =V and V
P∩V
A = /0, are
the pertinent vertices and the auxiliary vertices, respectively. We define a graph G
P = (V
P
,E
P
,w
P
), where,
for any two distinct vertices v
P
i
, v
P
j ∈ V
P
, the edge (v
P
i
, v
P
j
) ∈ E
P
is annotated with the weight w((v
P
i
, v
P
j
)) =
DG(v
P
i
, v
P
j
). Here, DG(·,·) is a distance function defined on pairs of vertices in G. For any measure of
37
centrality M defined on explicit graphs, an equivalent measure of projected centrality MP
can be defined
on implicit graphs as follows: MP on G is equivalent to M on G
P
.
The distance function DG(·,·) can be the shortest-path distance function or any other domain-specific
distance function. If it is a graph-based distance function, computing it would typically require the consideration of the entire graph G, including the auxiliary vertices V
A
. For example, computing the shortest-path
distance between v
P
i
and v
P
j
in V
P
requires us to utilize the entire graph G. Other graph-based distance
functions are the probabilistically-amplified distance function, introduced in Chapter 4, and the effective
resistance between two vertices when interpreting the non-negative weights on edges as electrical resistance
values.
3.3 FastMap for Top-K Centrality and Projected Centrality
In this section, we first show how to use the FastMap framework, coupled with LSH, for efficiently computing the top-K pertinent vertices with the highest centrality values in a given explicit graph (network),
for various measures of centrality. We then generalize our methodology to the corresponding measures of
projected centrality on implicit graphs. We note that the FastMap framework is applicable as a general
paradigm, independent of the measure of centrality or projected centrality: The measure of centrality or
projected centrality that is specific to the problem domain affects only the distance function used in the
FastMap embedding and the analytical techniques that work on it. In other words, the FastMap framework
allows us to interpret and reason about the various measures of centrality or projected centrality by invoking
the power of analytical techniques. This is in stark contrast to other approaches that are tailored to a specific
measure of centrality or its corresponding measure of projected centrality.
We recollect that, in the FastMap framework, any point of interest computed analytically in the Euclidean
embedding may not map to a vertex in the original graph. Therefore, we use LSH to find the point closest to
38
the point of interest that corresponds to any of the vertices. In fact, LSH not only answers nearest-neighbor
queries very efficiently but also finds the top-K nearest neighbors of a query point efficiently.
We assume that the input is an undirected edge-weighted graph G = (V,E,w), where V is the set of
vertices, E is the set of edges, and for any edge e ∈ E, w(e) is the non-negative weight on it. We also assume
that G is connected since several measures of centrality and projected centrality are not very meaningful
for disconnected graphs.2 For simplicity, we further assume that there are no self-loops or multiple edges
between any two vertices.
In the rest of this section, we first show how to use the FastMap framework for computing the top-K
vertices in explicit graphs, for some popular measures of centrality. We then show how to use the FastMap
framework more generally for computing the top-K pertinent vertices in implicit graphs, for the corresponding measures of projected centrality.
3.3.1 FastMap for Closeness Centrality on Explicit Graphs
Let dG(u, v) denote the shortest-path distance between two distinct vertices u, v ∈ V. The closeness centrality [37] of v is the reciprocal of the average shortest-path distance between v and all other vertices. It is
defined as follows:
Cclo(v) = |V| −1
∑u∈V,u̸=v dG(u, v)
. (3.1)
Computing the closeness centrality values of all vertices and identifying the top-K vertices with the highest such values require calculating the shortest-path distances between all pairs of vertices. All-pair shortestpath computations generally require O(|V||E|+|V|
2
log|V|) time via the Floyd–Warshall algorithm [34].
The FastMap framework allows us to avoid the above complexity and compute the top-K vertices using
a geometric interpretation. We know that given N points q1,q2 ...qN in Euclidean space R
κ
, finding the
point q that minimizes ∑
N
i=1
(q−qi)
2
is easy. In fact, it is the centroid given by q = (∑
N
i=1
qi)/N. Therefore,
2For disconnected graphs, we usually consider the measures of centrality and projected centrality on each connected component
separately.
39
we can use the distance function p
dG(·,·) in Algorithm 2 (from Chapter 1) to embed the square-root of
the shortest-path distances between vertices. This is done by returning the square-roots of the shortest-path
distances found by ShortestPathTree() on Lines 4, 12, and 13. Computing the centroid in the resulting
embedding minimizes the sum of the shortest-path distances to all vertices. This centroid is mapped back to
the original graphical space via LSH.
Overall, we use the following steps to find the top-K vertices: (1) Use FastMap with the square-root
of the shortest-path distance function between vertices to create a Euclidean embedding; (2) Compute the
centroid of all points corresponding to vertices in this embedding; and (3) Use LSH to return the top-K
nearest neighbors of the centroid.
3.3.2 FastMap for Harmonic Centrality on Explicit Graphs
The harmonic centrality [12] of a vertex v is the sum of the reciprocal of the shortest-path distances between
v and all other vertices. It is defined as follows:
Char(v) = ∑
u∈V,u̸=v
1
dG(u, v)
. (3.2)
As in the case of closeness centrality, the time complexity of computing the top-K vertices, based on
shortest-path algorithms, is O(|V||E| + |V|
2
log|V|). However, the FastMap framework once again allows
us to avoid this complexity and compute the top-K vertices using analytical techniques. Given N points
q1,q2 ...qN in Euclidean space R
κ
, finding the point q that maximizes ∑
N
i=1
1
∥q−qi∥
is not easy. However, the
Euclidean space enables gradient ascent and the standard ingredients of local search to avoid local maxima
and efficiently arrive at good solutions. In fact, the centroid obtained after running Algorithm 2 (from
Chapter 1) is a good starting point for the local search.
Overall, we use the following steps to find the top-K vertices: (1) Use Algorithm 2 to create a Euclidean
embedding; (2) Compute the centroid of all points corresponding to vertices in this embedding; (3) Perform
40
gradient ascent starting from the centroid to maximize ∑
N
i=1
1
∥q−qi∥
; and (4) Use LSH to return the top-K
nearest neighbors of the result of the previous step.
3.3.3 FastMap for Current-Flow Closeness Centrality on Explicit Graphs
The current-flow closeness centrality [116, 18] is a variant of the closeness centrality based on “effective
resistance”, instead of the shortest-path distance, between vertices. It is also known as the information
centrality, under the assumption that information spreads like electrical current. The current-flow closeness
centrality of a vertex v is the reciprocal of the average effective resistance between v and all other vertices.
It is defined as follows:
Cc f c(v) = |V| −1
∑u∈V,u̸=v RG(u, v)
. (3.3)
The term RG(u, v) represents the effective resistance between u and v. A precise mathematical definition for
it can be found in [18].
Computing the current-flow closeness centrality values of all vertices and identifying the top-K vertices
with the highest such values are slightly more expensive than calculating the shortest-path distances between
all pairs of vertices. The best known time complexity is O(|V||E|log|V|) [18].
Once again, the FastMap framework allows us to avoid the above complexity and compute the topK vertices by merely changing the distance function used in Algorithm 2 (from Chapter 1). We use the
probabilistically-amplified shortest-path distance (PASPD) function presented in Algorithm 3 (from Chapter 4).3 The PASPD function computes the sum of the shortest-path distances between two vertices in a set
of graphs Gset. Gset contains different lineages of graphs, each starting from the given graph. In each lineage, a fraction of probabilistically-chosen edges is progressively dropped to obtain nested subgraphs. The
probabilistically-amplified shortest-path distance captures the effective resistance between two vertices for
3We ignore edge-complement graphs by deleting Line 14 of this algorithm.
41
the following two reasons: (a) Larger the dG(u, v), larger the probabilistically-amplified shortest-path distance between u and v, as larger the effective resistance between them should be; and (b) Larger the number
of paths between u and v in G, smaller the probabilistically-amplified shortest-path distance between them,
as smaller the effective resistance between them should be.
Overall, we use the same steps as in Section 3.3.1 to find the top-K vertices, except for the distance
function being modified to the PASPD function.
3.3.4 FastMap for Normalized Eigenvector Centrality on Explicit Graphs
Suppose G is an undirected unweighted graph with the adjacency matrix A, where Ai j is set to 1 if(vi
, vj) ∈ E
and to 0 otherwise. The eigenvector centrality [13] of vi
is the corresponding component of the eigenvector
for the largest eigenvalue of A. In essence, the eigenvector centrality of a vertex depends on those of its
neighbors, imparting to it the semblance of the page rank. In fact, the normalized eigenvector centrality is a
measure that is intimately related to the page rank [90]. It uses a matrix N that is more generally defined for
directed unweighted graphs as follows:
Ni j =
1
od(vi)
if (vi
, vj) ∈ E
0 otherwise
. (3.4)
Here, od(vi) refers to the out-degree of vi
. If G is undirected, od(vi) is just the degree of vi
, that is, the
number of neighbors of vi
.
42
N uses a row-wise normalization of the entries of A and, therefore, is a row-stochastic matrix. When G
is edge-weighted, the weight of an edge from vi
to vj can be interpreted as a “resistance” to the transition
from vi
to vj
. Therefore, N can be generalized as follows:
Ni j =
1/w((vi
,v j))
∑(vi
,u)∈E 1/w((vi
,u)) if (vi
, vj) ∈ E
0 otherwise
. (3.5)
The normalized eigenvector centrality of vi
is the corresponding component of the eigenvector e for the
largest eigenvalue λ of N
T
. Since N is a row-stochastic matrix, λ = 1 and e satisfies the equation e = N
T
e.
Moreover, by the Perron–Frobenius theorem [95], e has a unique solution with positive entries.
In general, for a directed graph on |V| vertices, computing the eigenvector centrality values of all vertices
and identifying the top-K vertices with the highest such values are more expensive than calculating the
largest eigenvalue of a |V| × |V| matrix. Although these tasks are easier on undirected graphs, producing
accurate results for them via the FastMap framework is a significant step towards generalization of the
framework to directed graphs.4
For undirected graphs, the FastMap framework allows us to compute the top-K vertices using analytical techniques. It is well known that row-stochastic matrices relate to infinite-length random walks and
stationary distributions.5
In turn, random walks are related to Brownian motion, diffusion equations, and
heat equations [64]. Therefore, it is conceivable that setting up a proper heat equation in the FastMap embedding could efficiently generate high-quality solutions. However, this refined approach is left for future
work. Here, we use a simpler approach for a proof of concept. Since Gaussian distributions solve certain
kinds of heat equations, we fit a mixture of Gaussian distributions on the point representations of vertices in
4FastMap has already been generalized to directed graphs [43]. However, the subsequent analytical techniques have to be
generalized.
5
In such random walks, we start from an initial vertex and, in each iteration, we hop from the current vertex u to one of its
neighbors v chosen with a probability proportional to 1/w((u, v)).
43
the FastMap embedding. The vertices close to the centers of the dominant Gaussian distributions are more
likely to have higher normalized eigenvector centrality values.
Towards this end, we invoke a GMM clustering procedure6
. GMM clustering generates k clusters, for a
specified value of k. Each cluster c ∈ {1,2... k} is characterized by a Gaussian distribution πcN(⃗x;⃗µc,Σc),
where πc is the amplitude representing its total probability mass, ⃗µc is its center, and Σc is its covariance
matrix. While it is possible to choose k automatically via the Silhouettes measure [102], doing so is computationally expensive. Instead, we choose k = 3 and hierarchically decompose each resulting cluster into
3 smaller clusters. We set the total probability mass of each cluster c in the resulting 9 clusters to πcπpa(c)
,
where pa(c) is the parent cluster of c. Using LSH, we return the top-K nearest neighbors of ⃗µc
∗ , where
c
∗ = argmaxcπcπpa(c)
.
Overall, we use the following steps to find the top-K vertices: (1) Use Algorithm 2 (from Chapter 1) to
create a Euclidean embedding; (2) Use hierarchical GMM clustering and find the dominant cluster; and (3)
Use LSH to return the top-K nearest neighbors of the center of the Gaussian distribution representing this
cluster.
3.3.5 Generalization to Projected Centrality
We now generalize the FastMap framework to compute the top-K pertinent vertices in implicit graphs for
different measures of projected centrality. There are several methods to do this. The first method is to
create an explicit graph by factoring out the auxiliary vertices, that is, the explicit graph G
P = (V
P
,E
P
,w
P
)
has only the pertinent vertices V
P
and is a complete graph on them, where for any two distinct vertices
v
P
i
, v
P
j ∈ V
P
, the edge (v
P
i
, v
P
j
) ∈ E
P
is annotated with the weight DG(v
P
i
, v
P
j
). This is referred to as the
All-Pairs Distance (APD) method. For the closeness, harmonic, and the normalized eigenvector centrality
measures, DG(·,·) is the shortest-path distance function. For the current-flow closeness centrality measure,
6
say, the one available in Python3
44
DG(·,·) is the PASPD function. The second method also constructs the explicit graph G
P but computes
the weight on each edge only approximately using differential heuristics [119]. This is referred to as the
Differential Heuristic Distance (DHD) method. The third method is similar to the second, except that it uses
the FastMap heuristics [68] instead of the differential heuristics. This is referred to as the FastMap Distance
(FMD) method.
The foregoing three methods are inefficient because they construct G
P
explicitly by computing the distances between all pairs of pertinent vertices. To avoid this inefficiency, we propose the fourth and the fifth
methods. The fourth method is to directly create the FastMap embedding for all vertices of G but apply the
analytical techniques only to the points corresponding to the pertinent vertices. This is referred to as the
FastMap All-Vertices (FMAV) method. The fifth method is to create the FastMap embedding only for the
pertinent vertices of G and apply the analytical techniques to their corresponding points. This is referred to
as the FastMap Pertinent-Vertices (FMPV) method.
3.4 Experimental Results
We used six datasets in our experiments: DIMACS, wDIMACS, TSP, movingAI, Tree, and SmallWorld. The
DIMACS dataset7
is a standard benchmark dataset of unweighted graphs. We obtained edge-weighted versions of these graphs, constituting our wDIMACS dataset, by assigning an integer weight chosen uniformly
at random from the interval [1,10] to each edge. We also obtained edge-weighted graphs from the TSP (Traveling Salesman Problem) dataset [100] and large unweighted graphs from the movingAI dataset [118].8
In
addition to these benchmark datasets, we synthesized Tree and SmallWorld graphs using the Python library
NetworkX [49]. For the trees, we assigned an integer weight chosen uniformly at random from the interval
[1,10] to each edge. We generated the small-world graphs using the Newman-Watts-Strogatz model [86].
7https://mat.tepper.cmu.edu/COLOR/instances.html
8The movingAI dataset contains grid-world maps with free cells and obstacle cells. Each such map is converted to a graph with
vertices representing the free cells and unweighted edges representing adjacent free cells with a common side. Only those resulting
graphs that are connected are retained for our experiments.
45
Instance Size (|V|, |E|)
Closeness Harmonic Current-Flow Eigenvector
GT FM nDCG GT FM nDCG GT FM nDCG GT FM nDCG
myciel5 (47, 236) 0.01 0.01 0.8810 0.01 0.08 0.8660 0.00 0.06 0.7108 0.00 0.02 0.4507
games120 (120, 638) 0.06 0.03 0.9619 0.06 0.21 0.9664 0.02 0.12 0.9276 0.01 0.05 0.8335
miles1500 (128, 5198) 0.41 0.09 0.9453 0.42 0.29 0.8888 0.05 0.79 0.9818 0.01 0.12 0.9379
queen16_16 (256, 6320) 1.06 0.12 0.9871 1.07 0.49 0.9581 0.11 0.84 0.9381 0.04 0.14 0.8783
le450_5d (450, 9757) 3.13 0.23 0.9560 3.11 0.91 0.9603 0.30 1.59 0.8648 0.13 0.31 0.7077
myciel4 (23, 71) 0.00 0.01 0.9327 0.00 0.04 0.8299 0.00 0.02 0.7697 0.00 0.02 0.7180
games120 (120, 638) 0.07 0.03 0.8442 0.07 0.21 0.8004 0.02 0.14 0.9032 0.01 0.06 0.6868
miles1000 (128, 3216) 0.28 0.07 0.9427 0.28 0.25 0.8510 0.04 0.47 0.7983 0.03 0.15 0.5904
queen14_14 (196, 4186) 0.56 0.09 0.8866 0.55 0.38 0.8897 0.06 0.73 0.9188 0.03 0.15 0.7399
le450_5c (450, 9803) 3.32 0.24 0.8843 3.27 0.88 0.9203 0.29 2.09 0.8196 0.13 0.29 0.7285
kroA200 (200, 19900) 2.59 0.52 0.9625 2.54 0.50 0.7275 0.14 2.95 0.7589 0.04 0.30 0.7336
pr226 (226, 25425) 4.01 0.51 0.9996 4.06 0.63 0.6803 0.18 3.52 0.7978 0.06 0.36 0.5372
pr264 (264, 34716) 6.87 0.62 0.9911 6.95 0.84 0.6506 0.30 6.30 0.8440 0.07 0.61 0.9816
lin318 (318, 50403) 13.62 1.00 0.9909 13.16 1.19 0.9537 0.43 8.07 0.7243 0.10 0.89 0.7981
pcb442 (442, 97461) 39.97 1.91 0.9984 39.25 2.01 0.9757 0.84 17.77 0.7283 0.21 2.10 0.9445
orz203d (244, 442) 0.11 0.06 0.9975 0.11 0.41 0.9943 0.05 0.13 0.8482 0.04 0.06 1.0000
den404d (358, 632) 0.23 0.08 0.9969 0.23 0.58 0.8879 0.10 0.14 0.9471 0.12 0.06 1.0000
isound1 (2976, 5763) 18.19 0.63 0.9987 18.55 4.86 0.9815 6.18 1.95 0.9701 13.77 0.28 1.0000
lak307d (4706, 9172) 46.74 1.02 0.9996 48.56 7.66 0.9866 15.82 2.90 0.9845 50.07 0.50 1.0000
ht_chantry_n (7408, 13865) 131.30 1.51 0.9969 134.42 12.29 0.9144 37.92 3.64 0.9189 183.27 0.69 0.8694
n0100 (100, 99) 0.01 0.02 0.9171 0.01 0.17 0.8102 0.01 0.05 0.9124 0.01 0.04 0.4799
n0500 (500, 499) 0.34 0.10 0.8861 0.33 0.79 0.6478 0.18 0.16 0.9466 0.21 0.06 0.1748
n1000 (1000, 999) 1.33 0.19 0.9125 1.38 1.63 0.9477 0.70 0.34 0.7292 0.86 0.09 0.2075
n1500 (1500, 1499) 3.00 0.28 0.8856 3.15 2.39 0.6360 1.61 0.41 0.8118 2.25 0.11 0.2140
n2000 (2000, 1999) 5.25 0.37 0.9078 5.56 3.21 0.8925 2.71 0.75 0.9516 4.97 0.15 0.1330
n0100k4p0.3 (100, 262) 0.02 0.03 0.9523 0.03 0.17 0.9308 0.01 0.08 0.9326 0.03 0.05 0.7818
n0500k6p0.3 (500, 1913) 0.83 0.11 0.9340 0.85 0.83 0.8951 0.23 0.51 0.8411 0.16 0.11 0.6807
n1000k4p0.6 (1000, 3192) 3.00 0.24 0.9119 3.07 1.68 0.8975 1.13 0.83 0.8349 0.70 0.22 0.6386
n4000k6p0.6 (4000, 19121) 86.37 1.09 0.9095 85.18 6.70 0.9160 214.99 8.13 0.8054 32.05 0.86 0.6185
n8000k6p0.6 (8000, 38517) 387.25 2.86 0.9240 392.76 14.65 0.9368 474.30 11.08 0.7466 235.38 1.83 0.6376
Table 3.1: Shows results for various measures of centrality. Entries show running times in seconds and
nDCG values.
For the regular measures of centrality, the graphs in the six datasets were used as such. For the projected
measures of centrality, 50% of the vertices in each graph were randomly chosen to be the pertinent vertices. (The choice of the percentage of pertinent vertices need not be 50%. This value is chosen merely for
presenting illustrative results.)
We note that the largest graphs chosen in our experiments have about 18,500 vertices and 215,500
edges.9 Although FastMap itself runs in near-linear time and scales to much larger graphs, some of the
baseline methods used for comparison in Tables 3.1 and 3.2 are impeded by such large graphs. Nonetheless,
our choice of problem instances and the experimental results on them illustrate the important trends in the
effectiveness of our approach.
9Tables 3.1 and 3.2 show only representative instances that may not match these numbers.
46
Instance Running Time (s) nDCG
APD DHD FMD ADT FMAV FMPV DHD FMD ADT FMAV FMPV
Closeness
queen16_16 0.55 0.32 0.12 0.05 0.12 0.10 0.9827 0.9674 0.9839 0.9707 0.9789
le450_5d 1.58 0.98 0.29 0.09 0.27 0.26 0.9576 0.9621 0.9872 0.9577 0.9521
queen14_14 0.29 0.19 0.08 0.03 0.08 0.09 0.9274 0.8996 0.9639 0.9377 0.8898
le450_5c 1.71 1.00 0.32 0.09 0.28 0.23 0.9169 0.9231 0.9700 0.8959 0.8798
lin318 7.01 0.68 0.50 0.45 0.97 1.05 0.9285 1.0000 0.9645 0.9867 1.0000
pcb442 18.03 1.31 0.93 0.84 2.26 1.97 0.9455 1.0000 0.9663 0.9969 0.9950
lak307d 24.78 116.69 25.85 1.98 0.36 0.35 0.8991 0.9928 0.9387 0.9994 0.9960
ht_chantry_n 67.59 266.07 58.18 4.88 0.47 0.45 0.8956 0.9952 0.9522 0.9879 0.9879
n1500 1.71 10.87 2.35 0.19 0.06 0.06 0.7990 0.9485 0.9830 0.7759 0.7758
n2000 2.96 19.45 4.32 0.35 0.08 0.09 0.8187 0.9691 0.9720 0.9284 0.9229
n4000k6p0.6 44.54 78.68 16.96 1.37 0.84 0.68 0.9403 0.9172 0.9582 0.9146 0.9299
n8000k6p0.6 207.24 319.56 67.23 5.39 1.37 1.58 0.9432 0.9376 0.9522 0.9274 0.9296
Harmonic
queen16_16 0.54 0.31 0.12 0.00 0.40 0.32 0.9730 0.9602 0.9729 0.9466 0.9730
le450_5d 1.59 0.95 0.31 0.00 0.65 0.52 0.9467 0.9499 0.9901 0.9447 0.9578
queen14_14 0.31 0.18 0.08 0.00 0.29 0.22 0.9235 0.8912 0.9903 0.9381 0.8767
le450_5c 1.74 0.98 0.30 0.00 0.65 0.51 0.8905 0.8694 0.9962 0.8485 0.8866
lin318 6.93 0.72 0.52 0.00 1.22 1.06 0.9922 1.0000 0.8974 0.8128 0.8289
pcb442 20.79 1.39 1.04 0.01 2.07 1.85 0.9924 1.0000 0.9859 0.9274 0.9755
lak307d 24.76 105.98 23.11 0.16 4.09 3.94 0.9265 0.9915 0.9908 0.9819 0.9762
ht_chantry_n 65.83 262.83 58.63 0.37 6.34 6.37 0.7549 0.6484 0.9947 0.8909 0.9925
n1500 1.74 10.83 2.41 0.01 1.33 1.22 0.6280 0.7079 0.9858 0.6575 0.7469
n2000 2.95 19.09 4.15 0.01 1.71 1.61 0.6695 0.6647 0.9354 0.9227 0.9252
n4000k6p0.6 44.53 78.08 17.17 0.10 3.74 3.76 0.9252 0.9083 0.9971 0.9163 0.9072
n8000k6p0.6 215.14 314.85 67.36 0.43 7.94 7.39 0.9351 0.9111 0.9999 0.9074 0.9439
Current-Flow
queen16_16 5.92 0.63 0.72 - 1.21 1.26 0.9638 0.9542 - 0.9690 0.9683
le450_5d 17.50 1.47 1.45 - 2.02 2.05 0.9463 0.9678 - 0.9529 0.9632
queen14_14 3.25 0.38 0.59 - 1.29 1.16 0.9243 0.8916 - 0.8867 0.8946
le450_5c 30.90 2.03 2.58 - 5.55 5.03 0.9275 0.9372 - 0.8832 0.8778
lin318 61.02 3.29 5.22 - 11.69 10.99 0.8995 1.0000 - 0.9993 0.9990
pcb442 180.43 6.96 12.75 - 24.14 23.77 0.9781 1.0000 - 0.9997 0.9997
lak307d 265.38 105.91 24.77 - 3.28 2.71 0.6864 0.6158 - 0.6800 0.6316
ht_chantry_n 499.27 260.40 58.07 - 1.88 3.54 0.4634 0.4568 - 0.4314 0.4997
n1500 5.70 10.62 2.45 - 0.55 0.30 0.6731 0.6820 - 0.6436 0.6487
n2000 10.01 19.05 4.21 - 0.58 0.63 0.6689 0.6847 - 0.6487 0.6468
n4000k6p0.6 448.20 76.73 19.42 - 7.74 7.65 0.9218 0.9130 - 0.9320 0.9126
n8000k6p0.6 2014.92 306.53 73.18 - 16.04 14.40 0.9325 0.9174 - 0.9189 0.9229
Eigenvector
queen16_16 0.56 0.33 0.13 - 0.12 0.13 0.9536 0.9541 - 0.9576 0.9383
le450_5d 1.61 1.01 0.32 - 0.29 0.21 0.9569 0.9469 - 0.9410 0.9474
queen14_14 0.30 0.19 0.08 - 0.13 0.14 0.9270 0.8715 - 0.8630 0.9077
le450_5c 1.71 1.03 0.35 - 0.27 0.25 0.8920 0.8864 - 0.9322 0.8841
lin318 6.94 0.71 0.50 - 0.78 0.88 0.9276 1.0000 - 0.8120 0.7834
pcb442 17.90 1.36 1.05 - 1.85 1.65 0.9823 1.0000 - 0.9331 0.9626
lak307d 31.23 112.31 30.54 - 0.39 0.57 0.8147 0.9783 - 0.9045 0.9728
ht_chantry_n 87.72 287.02 79.17 - 0.61 0.73 0.5739 0.6980 - 0.7092 0.7838
n1500 1.89 11.47 2.67 - 0.10 0.11 0.4143 0.5353 - 0.5712 0.5383
n2000 3.35 20.17 4.81 - 0.14 0.12 0.5533 0.6402 - 0.5640 0.5308
n4000k6p0.6 45.17 82.48 20.50 - 0.81 0.69 0.9272 0.8970 - 0.9129 0.9114
n8000k6p0.6 234.65 344.21 94.71 - 2.51 1.45 0.9473 0.9209 - 0.9294 0.9286
Table 3.2: Shows results for various measures of projected centrality. Entries show running times in seconds
and nDCG values.
We used two metrics for evaluation: the normalized Discounted Cumulative Gain (nDCG), and the
running time. The nDCG [124] is a standard measure of the effectiveness of a ranking system. Here, it
is used to compare the (projected) centrality values of the top-K vertices returned by an algorithm against
the (projected) centrality values of the top-K vertices in the ground truth (GT). The nDCG value is in the
interval [0,1], with higher values representing better results, that is, closer to the GT. We set K = 10. All
47
experiments were done on a laptop with a 3.1 GHz Quad-Core Intel Core i7 processor and 16 GB LPDDR3
memory. We implemented FastMap in Python3 and set κ = 4.
Table 3.1 shows the performance of our FastMap (FM) framework against standard baseline algorithms
that produce the GT for various measures of centrality. For the closeness, harmonic, and the current-flow
closeness measures, the standard baseline algorithms are available in NetworkX.10 For the normalized eigenvector measure, a standard baseline algorithm can be implemented using matrix computations. The rows of
the table are divided into six blocks corresponding to the six datasets in the order: DIMACS, wDIMACS,
TSP, movingAI, Tree, and SmallWorld. For illustration, only five representative instances are shown in each
block. For the closeness, harmonic, and the current-flow closeness measures, we observe that FM produces
high-quality solutions. For the normalized eigenvector measure, FM generally produces good-quality solutions, with only occasional poor results. For all measures of centrality, FM is significantly faster than the
standard baseline algorithms on large instances.
Table 3.2 shows the performances of APD, DHD, FMD, FMAV, and FMPV for various measures of
projected centrality. An additional column, called “Adapted” (ADT), is introduced for the closeness and
the harmonic measures of projected centrality. For the closeness and the harmonic measures, ADT refers
to our intelligent adaptations of state-of-the-art algorithms, presented in [20] and [9], respectively, to the
projected case. The rows of the table are divided into four blocks corresponding to the four measures of
projected centrality. For illustration, only twelve representative instances are shown in each block: the
largest two from each block of Table 3.1. The nDCG values for DHD, FMD, ADT, FMAV, and FMPV
are computed against the GT produced by APD. We observe that all our algorithms produce high-quality
solutions for the various measures of projected centrality. While the success of ADT is attributed to the
intelligent adaptations of two separate algorithms, the success of FMAV and FMPV is attributed to the
power of appropriate analytical techniques used in the same FastMap framework. The success of DHD and
10https://networkx.org/documentation/stable/reference/algorithms/centrality.html
48
FMD is attributed to their ability to closely approximate the all-pairs distances. We also observe that FMAV
and FMPV are significantly more efficient than APD, DHD, and FMD since they avoid the construction of
explicit graphs on the pertinent vertices. For the same reason, ADT is also efficient when applicable.
For all measures of centrality and projected centrality considered in this chapter, Tables 3.1 and 3.2
demonstrate that our FastMap approach is viable as a unified framework for leveraging the power of analytical techniques. This is in contrast to the nature of other existing algorithms that are tied to certain measures
of centrality and have to be generalized to the projected case separately.
3.5 Conclusions
In this chapter, we generalized various measures of centrality on explicit graphs to corresponding measures
of projected centrality on implicit graphs. Computing the top-K pertinent vertices with the highest projected
centrality values is not always easy for large graphs. To address this challenge, we proposed a unifying
framework based on FastMap, exploiting its ability to embed a given undirected graph into a Euclidean
space in near-linear time such that the pairwise Euclidean distances between vertices approximate a desired graph-based distance function between them. We designed different distance functions for different
measures of projected centrality and invoked various procedures for computing analytical solutions in the
resulting FastMap embedding. We also coupled FastMap with LSH to interpret analytical solutions found
in the FastMap embedding back in the graphical space. Overall, we experimentally demonstrated that the
FastMap+LSH framework is both efficient and effective for many popular measures of centrality and their
generalizations to projected centrality.
Unlike other methods, our FastMap framework is not tied to a specific measure of projected centrality.
This is because its power stems from its ability to transform a graph-theoretic problem into Euclidean space
49
in only near-linear time close to that of merely reading the input. Consequently, it delegates the combinatorics tied to any given measure of projected centrality to various kinds of analytical techniques that are
better equipped for efficiently absorbing large input sizes.
50
Appendix
3.A Table of Notations
Notation Description
K Top-K vertices with the highest centrality values.
G = (V,E,w) An undirected edge-weighted graph, where V is the set of vertices, E is the set of
edges, and for any edge e ∈ E, w(e) is the non-negative weight on it.
G = (V,E,w) An implicit graph, where V
P ⊆V and V
A ⊆V, for V
P∪V
A =V and V
P∩V
A = /0,
are the pertinent vertices and the auxiliary vertices, respectively.
G
P = (V
P
,E
P
,w
P
) An explicit graph constructed from G = (V,E,w), where, for any two distinct vertices v
P
i
, v
P
j ∈ V
P
, the edge (v
P
i
, v
P
j
) ∈ E
P
is annotated with the weight
w((v
P
i
, v
P
j
)) = DG(v
P
i
, v
P
j
).
DG(·,·) A distance function that can be the shortest-path distance function or any other
domain-specific distance function.
M Any measure of centrality defined on explicit graphs.
MP A measure of projected centrality defined on implicit graphs equivalent to M
defined on explicit graphs.
dG(u, v) The shortest-path distance between two distinct vertices u, v ∈ V.
Cclo(v) The closeness centrality value of v ∈ V.
κ The user-specified number of dimensions of the FastMap embedding.
Char(v) The harmonic centrality value of v ∈ V.
Cc f c(v) The current-flow closeness centrality value of v ∈ V.
RG(u, v) The effective resistance between two vertices u, v ∈ V.
A The adjacency matrix of an undirected unweighted graph G(V,E), where Ai j is
set to 1 if (vi
, vj) ∈ E and to 0 otherwise.
N A row-stochastic matrix that uses a row-wise normalization of the entries of A.
Table 3.A.1: Describes the notations used in Chapter 3.
51
Chapter 4
FastMap for Community Detection and Block Modeling
Community detection and the more general block modeling algorithms are used to discover important latent
structures in graphs. They are the graph equivalent of clustering algorithms. However, existing community
detection and block modeling algorithms work directly on the given graphs, making them computationally
expensive and less effective on large complex graphs. In this chapter, we propose a FastMap-based block
modeling algorithm, FMBM, on single-view undirected unweighted graphs. In the first phase, FMBM uses
FastMap with the PASPD function between vertices: Our novel PASPD function is explained in this chapter.
In the second phase, it uses GMMs for identifying clusters (blocks) in the resulting Euclidean space. We
show that FMBM outperforms other state-of-the-art methods on many benchmark and synthetic instances,
in terms of both efficiency and solution quality. It also enables a perspicuous visualization of the clusters
(blocks) in the graphs, not provided by other methods.
4.1 Introduction
Finding inherent groups in graphs, that is, the “graph” clustering problem, has important applications in
many real-world domains, such as identifying communities in social networks [42], analyzing the diffusion
of ideas in them [69], identifying functional modules in protein-protein interactions [65], and understanding
52
(a) a core-periphery graph in the airport domain (b) a FastMap embedding of the graph on the left
Figure 4.1: Shows a core-periphery graph in an airport domain with its FastMap embedding. (a) shows a
core-periphery graph in the airport domain with edges representing flight connections, red vertices representing “hub” airports at the core, and blue vertices representing “local” airports at the periphery. (b) shows
a FastMap embedding of the graph in Euclidean space, in which the red and blue vertices correctly appear
in the core and periphery, respectively.
the modular design of brain networks [3]. Identifying the groups involves mapping each vertex in the graph
to a group (cluster), where vertices in the same group share important properties in the underlying graph.
The conditions under which two vertices are deemed to be similar and therefore belonging to the same
group are popularly studied in community detection and block modeling [1]. In community detection, a
group (community) implicitly requires its vertices to be more connected to each other than to vertices of
other groups. Although this is justified in many real-world domains, such as social networks, it is not
always justified in general.
Block modeling uses more general criteria for identifying groups (blocks) where community detection
fails. For example, block modeling can be used to correctly identify groups in core-periphery graphs characterized by a core of vertices tightly connected to each other and a peripheral set of vertices loosely connected
to each other but well connected to the core.1 Core-periphery graphs are common in many real-world domains, such as financial networks and flight networks [1, 99]. Figure 4.1a shows a core-periphery graph in
an air transportation domain [23].
Existing community detection and block modeling algorithms work directly on the given graphs and
are mostly inefficient. Block modeling algorithms typically use matrix operations that incur cubic time
1The conditions used in community detection prevent the proper identification of peripheral groups.
53
complexities even within their inner loops. For example, FactorBlock [19], a state-of-the-art block modeling
algorithm, uses matrix multiplications in its inner loop and an expectation-maximization-style outer loop.
Due to their inefficiency, existing block modeling algorithms are not scalable and result in poor solution
qualities on large complex graphs.
In this chapter, we propose a FastMap-based algorithm for block modeling on single-view2 undirected
unweighted graphs. We also propose a novel distance function that probabilistically amplifies the shortestpath distances between vertices. We refer to this distance function as the PASPD function.
Our FastMap-based block modeling algorithm (FMBM) works in two phases. In the first phase,
FMBM uses FastMap to efficiently embed the vertices of a given graph in a Euclidean space, preserving
the probabilistically-amplified shortest-path distances between them. In the second phase, FMBM identifies clusters (blocks) in the resulting Euclidean space using standard methods from unsupervised learning.
Therefore, the first phase of FMBM efficiently reformulates the block modeling problem from a graphical space to a Euclidean space, as illustrated in Figure 4.1b; and the second phase of FMBM leverages
any technique that is already known or can be developed for clustering in Euclidean space. In our current
implementation of FMBM, we use GMMs for identifying clusters in the Euclidean space.
We empirically show that, in addition to the theoretical advantages of FMBM, it outperforms other stateof-the-art methods on many benchmark and synthetic test cases. We report on the superior performance
of FMBM in terms of both efficiency and solution quality. We also show that it enables a perspicuous
visualization of clusters in the graphs, beyond the capabilities of other methods.
2
In a single-view graph, there is at most one edge between any two vertices.
54
4.2 Preliminaries of Block Modeling
In this section, we review some preliminaries of block modeling. Let G = (V,E) be an undirected unweighted graph with vertices V = {v1, v2 ... vn} and edges E = {e1, e2 ... em} ⊆V ×V. Let A ∈ {0,1}
n×n be
the adjacency matrix representation of G, where Ai j = 1 iff (vi
, vj) ∈ E.
A block model decomposes G into a set of k vertex partitions representing the blocks (groups), for a
given value of k. The partitions are represented by the membership matrix C ∈ {0,1}
n×k
, where Ci j = 0
and Ci j = 1 represent vertex vi being absent from and being present in partition j, respectively. Therefore,
∑
k
j=1 Ci j = 1 for all 1 ≤ i ≤ n. An image matrix is a matrix M ∈ [0,1]
k×k
, where Mi j represents the likelihood
of an edge between a vertex in partition i and a vertex in partition j. The block model decomposition of G,
as discussed in [19], tries to approximate A by CMC⊤ with the best choice for C and M. In other words,
the objective is:
min
C,M
∥A−CMC⊤∥
2
F
, (4.1)
where ∥· ∥F is the Frobenius norm. An improved objective function is also considered in [19] to account for
the imbalance of edges to non-edges3
in A, since real-world graphs are typically sparse with significantly
more non-edges than edges. The revised objective is:
min
C,M
∥(A−CMC⊤) ◦ (A−R)∥
2
F
, (4.2)
where R ∈ [0,1]
n×n
, Ri j =
m
n
2
, and ◦ represents element-wise multiplication.
The above formalization can be generalized to directed graphs and multi-view graphs [99]. It can also
be generalized to soft partitioning, where each vertex partially belongs to each partition, that is, C ∈ [0,1]
n×k
with ∑
k
j=1 Ci j = 1 for all 1 ≤ i ≤ n.
3
that is, a pair of vertices not connected by an edge
55
(a) (b) (c)
(d) (e) (f)
Figure 4.2: Shows two simple graphs that guide the design of a proper FastMap distance function for block
modeling. (a) shows a fully-connected bipartite graph with the red and blue vertices indicating the two
partitions. (b) shows a core-periphery graph with the red vertices indicating the core and the blue vertices
indicating the periphery. All pairs of red vertices are connected by edges (not all shown to avoid clutter).
(c) and (d) show the FastMap Euclidean embeddings of the graphs in (a) and (b), respectively, using the
shortest-path distance function. This naive FastMap distance function fails for block modeling. Red and
blue points correspond to red and blue vertices of the graphs, respectively. Many vertices are mapped to the
same point. (e) and (f) show the FastMap Euclidean embeddings of the graphs in (a) and (b), respectively,
when using the PASPD function. This FastMap distance function is appropriate for block modeling.
4.3 The FastMap-Based Block Modeling Algorithm (FMBM)
In this section, we describe FMBM, our novel algorithm for block modeling based on FastMap [68]. As
mentioned before, FMBM works in two phases. In the first phase, FMBM uses FastMap to efficiently embed vertices in a κ-dimensional Euclidean space, preserving the probabilistically-amplified shortest-path
distances between them. In the second phase, FMBM identifies the required blocks in the resulting Euclidean space using GMM clustering.
To facilitate the description of FMBM, we first examine what happens when FastMap is used naively
in the first phase, that is, when it is used to embed the vertices of a given undirected unweighted graph in
56
a κ-dimensional Euclidean space for preserving the pairwise shortest-path distances. This naive attempt
fails even in relatively simple cases. For example, Figures 4.2a-4.2d show that it fails on a bipartite graph
and a core-periphery graph. This is so because preserving the pairwise shortest-path distances in Euclidean
space does not necessarily help GMM clustering to identify the two blocks (partitions). In fact, in a bipartite
graph, the closest neighbors of a vertex are in the other partition.
4.3.1 Probabilistically-Amplified Shortest-Path Distances
From the foregoing discussion, it is clear that the shortest-path distance between two vertices vi and vj
is not a
viable distance function for block modeling. Therefore, in this subsection, we create a new distance function
D(vi
, vj) for pairs of vertices based on the following intuitive guidelines: (a) The smaller the shortest-path
distance between vi and vj
, the smaller the distance D(vi
, vj) should be; (b) The more paths exist between vi
and vj
, the smaller the distance D(vi
, vj) should be; and (c) The complement graph4 G¯ of the given graph G
should yield the same distance function as G: The distance function should be independent of the arbitrary
choice of representing a relationship between two vertices as either an edge or a non-edge. Intuitively,
these guidelines capture an effective “resistance” between vertices and facilitate the subsequent embedding
to represent relative “potentials” of vertices in Euclidean space. The effectiveness of these guidelines is
validated through test cases in this section and comprehensive experiments in the next section.
In adherence with these guidelines, we design the novel PASPD function DP(vi
, vj) as follows:
DP(vi
, vj) = ∑
G∈Gset
dG(vi
, vj). (4.3)
Here, dG(vi
, vj) represents the shortest-path distance between vi and vj
in an undirected graph G. Gset
represents a collection of undirected graphs, each derived from either the given graph G or its complement
4The complement graph G¯ has the same vertices as the original graph G but represents every edge in G as a non-edge and every
non-edge in G as an edge.
57
G¯. In particular, each graph in Gset is an edge-induced subgraph of either G or G¯.
5 The edge-induced
subgraphs are created by probabilistically dropping edges from G or G¯.
Intuitively, the use of shortest-path distances on multiple graphs that are probabilistically derived from
the same input graph G accounts for DP(·,·). Indeed, the smaller dG(vi
, vj), the smaller DP(vi
, vj) also
is. Similarly, the more paths between vi and vj
in G, the more likely it is for such paths to survive in its
edge-induced subgraphs, and the smaller DP(vi
, vj) consequently is. Moreover, since the subgraphs in Gset
are derived from both G and G¯, DP(·,·) satisfies all the intuitive guidelines mentioned above. From an
efficiency perspective, the use of multiple graphs does not create much overhead if the number of graphs
does not depend on the size of G. However, G¯ can have significantly more edges than G if G is sparse.
In such cases, if G has n vertices and m <
n
2
/2 edges, G¯ itself is probabilistically derived from G by
randomly retaining only m out of the
n
2
−m edges that it would otherwise have. This keeps the size of G¯
upper-bounded by the size of the input.
Although more details on FMBM are presented in the next subsection, the benefits of using the PASPD
function are visually apparent in Figures 4.2e and 4.2f. In both cases, the red and blue vertices are mapped
to linearly-separable red and blue points, respectively, in Euclidean space. Its benefits can also be seen
in Figure 4.1b, where the core red vertices are mapped to a core set of red points and the peripheral blue
vertices are mapped to a peripheral set of blue points, respectively, in Euclidean space. In this case, although
the red and blue points are not linearly separable, GMM clustering [83] in the second phase of FMBM is
capable of separating them using two overlapping but different Gaussian distributions.
4.3.2 Main Algorithm
Algorithm 3 shows the pseudocode for computing the PASPD function DP(·,·) parameterized by L and
F. Like the shortest-path distance function, it, too, can be computed efficiently (in one shot) for all pairs
5An edge-induced subgraph of G has the same vertices as G but a subset of its edges.
Algorithm 3: SS-PASPD: An algorithm for computing the single-source probabilisticallyamplified shortest-path distances.
1 Input: G = (V,E) and vs ∈ V
2 Parameters: L and F
3 Output: dsi for each vi ∈ V
1: Let G¯ = (V,E¯) be the complement graph of G.
2: Gset ← {} and Tset ← {}.
3: for l = 1,2...L do
4: G ← G and G¯ ← G¯.
5: if |E¯| > |E| then
6: Drop |E¯| −|E| randomly chosen edges from G¯ .
7: end if
8: Gset ← Gset ∪ {G} and f ← |E|/F.
9: while G has edges do
10: Drop f randomly chosen edges from G to obtain Gˆ.
11: Gset ← Gset ∪ {Gˆ}.
12: G ← Gˆ.
13: end while
14: Repeat lines 8-13 for G¯ .
15: end for
16: for Gi ∈ Gset do
17: Ti ← SS-ShortestPathDistance(Gi
, vs).
18: Tset ← Tset ∪ {Ti}.
19: end for
20: for each vj ∈ V do
21: ds j ← ∑Ti∈Tset
Ti(vj).
22: end for
23: return dsi for each vi ∈ V.
Algorithm 4: FMBM: A FastMap-based block modeling algorithm.
1 Input: G = (V,E) and k
2 Parameters: L, F, T, κ, and ε
3 Output: ci for each vi ∈ V
1: ob jmin ← ∞ and Cbest ← /0.
2: for t = 1,2...T do
3: P ← FastMap(G,κ, ε).
4: C ← GMM(P, k).
5: ob j ← GetObjectiveValue(G,C).
6: if ob j ≤ ob jmin then
7: ob jmin ← ob j.
8: Cbest ← C.
9: end if
10: end for
11: return ci for each vi ∈ V according to Cbest.
59
(vs
, vi), for a specified source vs and all vi ∈ V. On Lines 3-15, the algorithm populates Gset with L lineages
of F nested edge-induced subgraphs of G and G¯. On Lines 5-7, the algorithm constructs the complement
graph G¯ but probabilistically retains at most |E| of its edges. On Lines 16-23, it uses the single-source
shortest-path distance function to compute and return the sum of the shortest-path distances from vs
to vi
in
all G ∈ Gset, for all vi ∈V. If vs and vi are disconnected in any graph G ∈ Gset, dG(vs
, vi) is technically equal
to ∞. However, for practical reasons in such cases, dG(vs
, vi) is set to twice the maximum shortest-path
distance from vs
to any other vertex connected to it in G. Ti on Line 17 refers to the array of shortest-path
distances from vs
in Gi
. Ti(vj) on Line 21 is the array element that corresponds to vertex vj
.
Algorithm 4 shows the pseudocode for FMBM. On Line 3, it essentially implements FastMap as described in Algorithm 2 (from Chapter 1) but calls SS-PASPD(·,·) in Algorithm 3 instead of the regular
single-source shortest-path distance function. L and F are simply passed to Algorithm 3 in the function call
SS-PASPD(·,·). Because Algorithm 4 employs randomization, it qualifies as a Monte Carlo algorithm. It
implements an outer loop to boost the performance of FMBM using T independent trials. On Lines 3-9,
each trial invokes the GMM clustering algorithm and evaluates the results on the objective function in Equation 4.2,
6 keeping record of the best value. The results of the best trial, that is, the block assignment ci for
each vi ∈ V, are returned on Line 11.
7
A formal time complexity analysis of FMBM is evasive since Line 4 of Algorithm 4 calls the GMM clustering procedure, which has no defined time complexity. Therefore, we only claim to be able to reformulate
the block modeling problem on graphs to its Euclidean version in O(LFκ(|E| + |V|log|V|)) time in each
of the T iterations. Here, the factor LF comes from the cardinality of Gset in Algorithm 3, and the factor
κ(|E| + |V|log|V|) comes from the complexity of FastMap that uses SS-PASPD(·,·) on Line 3 of Algorithm 4. The time complexity of GetObjectiveValue(·,·) on Line 5 is technically O(|V|
2
k +|V|k
2
), where k
is the user-specified number of blocks, also passed to the GMM clustering algorithm. This time complexity
6M can be computed from A and C in O(|E|+k
2
) time while evaluating the objective function in Equation 4.2.
7The domain of each ci
is {1,2... k}. Block Bh refers to the collection of all vertices vi ∈ V such that ci = h.
60
Test Case Size (|V|, |E|)
FMBM Graph-Tool DANMF CPLNS
Objective NMI Time Objective NMI Time Objective NMI Time Objective NMI Time
adjnoun (112, 425) 616.86 0.0025 6.11 612.98 0.2978 0.04 636.75 0.0083 1.62 591.76 0.0154 1.51
baboons (14, 23) 11.97 0.0158 0.54 11.49 0.2244 0.00 15.49 0.1341 0.97 12.81 0.0172 0.87
football (115, 613) 665.97 0.5608 9.22 343.32 0.9150 0.03 863.91 0.2574 1.55 558.94 0.6991 83.33
karate (34, 78) 74.66 0.6127 1.47 64.67 0.2512 0.00 81.94 0.1672 0.77 75.43 0.2228 1.06
polblogs (1490, 16715) 98788.53 0.0098 239.33 99014.21 0.4668 2.14 101195.89 0.0465 404.02 95859.73 0.0543 506.29
polbooks (105, 441) 522.33 0.5329 6.20 496.02 0.5462 0.02 590.20 0.3177 1.98 531.48 0.2073 2.09
Table 4.1: Shows the comparative results on real-world single-view undirected graphs.
comes from the matrix multiplication CMC⊤ in Equation 4.2. The factor |V|
2
in this matrix multiplication,
and more generally in Equation 4.2, can be reduced to O(|E|) by evaluating |E| entries corresponding to
edges and min(|E|,
|V|
2
− |E|) randomly chosen entries corresponding to non-edges in the matrix expression (A − CMC⊤) ◦ (A − R). The matrix multiplication CM takes O(|V|k
2
) time and results in a |V| × k
matrix. |E|+min(|E|,
|V|
2
− |E|) entries in the multiplication of this matrix with C
⊤ can be computed in
O(|E|k) time. Overall, therefore, the reformulation to Euclidean space can be done in near-linear time, that
is, linear in |V| and |E|, after ignoring logarithmic factors.
4.4 Experimental Results
In this section, we present empirical results on the comparative performances of FMBM and three other
state-of-the-art solvers for block modeling: Graph-Tool, DANMF, and CPLNS. We also compared against
two other solvers for block modeling: FactorBlock [19] and ASBlock [99]. However, they are not competitive with the other solvers; and we exclude them from Tables 4.1, 4.2, 4.3, 4.4, and 4.5 to save column
space. Graph-Tool [93] uses an agglomerative multi-level Markov Chain Monte Carlo algorithm and has
been largely ignored in the computer science literature on block modeling; DANMF [132] uses deep autoencoders; and CPLNS [76] uses constraint programming with large neighborhood search.
We used the following hyperparameter values for FMBM: L = 4, F = 10, T = 10, κ = 4, and ε = 10−4
.
These values are only important as ballpark estimates. We observed that the performance of FMBM often
stays stable within broad ranges of hyperparameter values, imparting robustness to FMBM. Moreover, only
a few different hyperparameter settings had to be examined to determine the best one. The value of k,
Test Case Size (|V|, |E|)
FMBM Graph-Tool DANMF CPLNS
Objective NMI Time Objective NMI Time Objective NMI Time Objective NMI Time
adjnoun (112, 5791) 611.04 0.0048 25.34 636.34 0.0168 0.41 641.40 0.0000 8.34 591.54 0.0169 1.85
baboons (14, 68) 12.64 0.0547 0.62 12.86 0.0416 0.01 15.46 0.0500 1.23 13.35 0.0316 0.86
football (115, 5944) 595.52 0.5899 27.53 344.38 0.9111 0.17 815.71 0.2229 9.56 525.54 0.7040 82.11
karate (34, 483) 72.73 0.7625 2.46 77.31 0.2065 0.02 84.23 0.0914 1.78 75.00 0.2439 1.04
polblogs (1490, 1094951) 26896.52 0.0153 3155.33 26048.04 0.0454 49.90 - - > 1 hour 25871.42 0.0541 470.48
polbooks (105, 5019) 509.88 0.5409 21.55 606.96 0.0867 0.13 631.77 0.0141 8.09 531.65 0.2056 2.15
Table 4.2: Shows the comparative results on the complement graphs of the graphs in Table 4.1.
Test Case Size (|E|)
FMBM Graph-Tool DANMF CPLNS
Objective NMI Time Objective NMI Time Objective NMI Time Objective NMI Time
V0400b04 6722 9355.91 0.1622 83.22 8385.61 0.6565 1.49 9465.48 0.0111 10.88 9400.46 0.0534 39.74
V0800b04 14723 24201.35 0.1775 199.14 22848.79 0.6599 3.99 24387.24 0.0019 62.50 24303.43 0.0394 195.08
V1600b04 25103 45849.52 0.2357 391.92 44598.02 0.9667 7.04 46379.74 0.0043 753.62 46292.59 0.0206 1018.54
V3200b04 70973 134217.82 0.0348 1376.41 131751.34 0.6654 108.08 - - > 1 hour - - > 1 hour
V0400b10 3246 5461.99 0.1217 40.9 4728.69 0.8542 0.63 5489.8 0.0509 7.25 5364.07 0.1289 101.01
V0800b10 7499 13623.69 0.0425 108.69 12612.44 0.9596 2.01 13636.29 0.0156 47.49 13485.93 0.0734 423.03
V1600b10 15118 28782.35 0.0691 272.65 27828.70 0.8556 4.82 28829.58 0.0117 537.27 28682.31 0.0384 2019.76
V3200b10 36170 70292.36 0.0369 782.44 68653.51 0.9173 17.62 70315.35 0.0074 3272.65 - - > 1 hour
V0400b20 2297 4048.03 0.1639 29.66 3632.92 0.6256 0.54 4064.77 0.1859 8.03 - - > 1 hour
V0800b20 5049 9451.30 0.0848 72.92 8960.16 0.5857 1.90 9460.80 0.0828 62.32 - - > 1 hour
V1600b20 11575 22305.01 0.0457 251.06 21591.83 0.6718 3.91 22315.63 0.0445 444.49 - - > 1 hour
V3200b20 24639 48321.14 0.0212 579.90 47650.73 0.6067 12.89 - - > 1 hour - - > 1 hour
Table 4.3: Shows the comparative results on sparse single-view undirected graphs using Generative Model
1.
Test Case Size (|E|)
FMBM Graph-Tool DANMF CPLNS
Objective NMI Time Objective NMI Time Objective NMI Time Objective NMI Time
V0100b06 4125 815.58 0.1308 20.06 818.83 0.1291 0.46 829.20 0.0829 5.85 750.28 0.2727 3.50
V0300b06 41771 4702.60 0.2061 176.65 4819.43 0.0311 1.93 4828.49 0.0149 143.10 4746.32 0.0629 18.14
V0500b06 118536 10473.00 0.0537 513.31 10489.79 0.0193 4.86 10497.63 0.0053 658.45 10419.98 0.0411 41.13
V0100b08 4212 782.91 0.2182 19.23 761.68 0.3024 0.33 807.05 0.1532 6.02 712.92 0.3331 4.59
V0300b08 42078 4468.67 0.0944 175.55 4487.12 0.0443 1.48 4498.23 0.0151 144.96 4397.28 0.0895 19.95
V0500b08 120166 8188.55 0.1503 505.41 8278.07 0.0267 4.94 8290.82 0.0056 680.16 8175.52 0.0582 59.04
V0100b10 4268 731.61 0.3446 19.22 720.38 0.3290 0.32 779.34 0.1570 6.41 662.42 0.4100 6.93
V0300b10 42385 4069.42 0.1652 171.93 4126.01 0.0569 1.47 4141.94 0.0292 146.23 4009.88 0.1160 29.28
V0500b10 120366 7931.97 0.1174 516.16 7985.47 0.0347 4.03 8000.60 0.0000 616.74 7872.77 0.0761 60.22
Table 4.4: Shows the comparative results on dense single-view undirected graphs using Generative Model
1.
Test Case Size (|E|)
FMBM Graph-Tool DANMF CPLNS
Objective NMI Time Objective NMI Time Objective NMI Time Objective NMI Time
V0400b04 7176 9843.84 0.0327 75.09 9444.48 0.8214 1.77 9855.97 0.0116 12.23 9786.67 0.0246 34.55
V0800b04 16780 27040.21 0.0087 192.97 26982.65 0.0698 5.67 27053.24 0.0008 69.68 26945.04 0.0087 243.28
V1600b04 28183 51531.41 0.0146 386.32 51556.38 0.0043 8.04 51559.96 0.0025 684.58 51467.00 0.0073 1098.12
V3200b04 72182 136376.53 0.0087 1196.30 135967.24 0.5287 32.92 - - > 1 hour - - > 1 hour
V0400b10 7069 9729.49 0.0639 73.76 9349.76 0.4827 1.52 9753.04 0.0625 10.69 9585.60 0.0837 127.68
V0800b10 15613 25528.33 0.0385 181.56 25022.25 0.4893 4.05 25553.52 0.0274 86.58 25353.08 0.0418 494.10
V1600b10 32294 58279.85 0.0195 428.40 58244.14 0.0305 13.92 58305.57 0.0157 687.84 58103.92 0.0224 2459.38
V3200b10 79173 148741.84 0.0082 1307.21 148745.65 0.0065 32.88 - - > 1 hour - - > 1 hour
V0400b20 6829 9453.81 0.1425 72.78 9139.14 0.2039 1.38 9506.35 0.1790 9.98 - - > 1 hour
V0800b20 15106 24810.19 0.0784 176.60 24521.35 0.1251 3.49 24866.54 0.0918 62.93 - - > 1 hour
V1600b20 30462 55265.42 0.0436 420.49 55207.39 0.0366 9.38 55310.70 0.0440 606.39 - - > 1 hour
V3200b20 67675 128267.45 0.0232 1199.32 128234.10 0.0168 40.91 - - > 1 hour - - > 1 hour
Table 4.5: Shows the comparative results on single-view undirected graphs using Generative Model 2.
that is, the number of blocks, was given as input for all the solvers in the experiments.8 We used three
8Although Graph-Tool does not require a user-specified value of k, it has a tendency to produce trivial solutions with k = 1,
resulting in 0 NMI values when the value of k is not explicitly specified.
62
metrics for comparison: the value of the objective function stated in Equation 4.2, the Normalized Mutual
Information (NMI) value with respect to the ground truth, and the running time in seconds. Unlike other
methods, FMBM is an anytime algorithm since it uses multiple trials. Each trial takes roughly (1/T)’th, that
is, one-tenth, of the time reported for FMBM in the experimental results. For each method and test case, we
averaged the results over 10 runs. All experiments were conducted on a laptop with a 3.1 GHz Quad-Core
Intel Core i7 processor and 16 GB LPDDR3 memory. Our implementation of FMBM was done in Python3
with NetworkX [49].
Although the underlying theory of FMBM can be generalized to directed edge-weighted graphs [43]
and to multi-view graphs, the current version of FMBM is operational only on singe-view undirected unweighted graphs, sufficient to illustrate the power of FastMap embeddings. Therefore, only such test cases
are borrowed from other commonly used datasets [85, 99]. However, we also created new synthetic test
cases to be able to do a more comprehensive analysis.9
The synthetic test cases were generated according to two similar stochastic block models [1] as follows.
In Generative Model 1, given a user-specified number of vertices |V| and a user-specified number of blocks
k, we first assign each vertex to a block chosen uniformly at random to obtain the membership matrix C,
representing the ground truth. The image matrix M is drafted using certain “block structural characteristics”
designed for that instance with a parameter p. Each entry Mi j is set to either p or 10p according to a rule
explained below. If Mi j is set to p (10p), the two blocks Bi and Bj are weakly (strongly) connected to each
other with respect to p. The adjacency matrix A, representing the entire graph, is constructed from C and M
by connecting any two vertices vs ∈ Bi and vt ∈ Bj with probability Mi j. In Generative Model 2, each entry
Mi j is set to cp, where c is an integer chosen uniformly at random from the interval [1,10].
Tables 4.1, 4.2, 4.3, 4.4, and 4.5 show the comparative performances of FMBM, Graph-Tool, DANMF,
and CPLNS.10 Table 4.1 contains commonly used real-world test cases from [85] and [99]. Here, FMBM
9https://github.com/leon-angli/Synthetic-Block-Modeling-Dataset
10DANMF did not assign any block membership to a few vertices in some synthetic test cases. We assign Block B1 by default to
such vertices.
63
outperforms DANMF and CPLNS with respect to the value of the objective function on 3 out of 6 instances,
despite the fact that it uses the expression in Equation 4.2 only for evaluation on Line 5 of Algorithm 4.
Graph-Tool performs well on all the instances. Table 4.2 shows the comparative performances on the complement graphs of the graphs in Table 4.1. This is done to test the robustness of the solvers against encoding
the same relationships between vertices as either edges or non-edges. While the value of the objective function and the running time are expected to change, the NMI value is expected to be stable. We observe that
FMBM and CPLNS are the only solvers that convincingly pass this test. Moreover, FMBM outperforms the
other solvers on more instances than in Table 4.1. Tables 4.1 and 4.2 do not test scalability since |V| is small
in these test cases.
Table 4.3 contains synthetic sparse test cases from Generative Model 1, named “Vnbk”, where n indicates the number of vertices and k indicates the number of blocks. These test cases have the following block
structural characteristics. Each block is strongly connected to two other randomly chosen blocks and weakly
connected to the remaining ones (including itself). We set p = (ln|V|)/|V|, making |E| = O(|V|log|V|) in
expectation. After generating A, we also add some noise to it by flipping each of its entries independently
with probability 0.05/|V|. FMBM outperforms DANMF and CPLNS with respect to both the value of
the objective function and the NMI value on 8 out of 12 instances. We also begin to see FMBM’s advantages in scalability. However, Graph-Tool outperforms all other methods by a significant margin on all the
instances. Table 4.4 contains synthetic dense test cases from Generative Model 1 constructed by setting
p = (ln|V|)/|V|, modifying each entry Mi j to 1 − Mi j, and adding noise, as before. We observe that the
performance of Graph-Tool is poor on such dense graphs. FMBM outperforms DANMF and CPLNS with
respect to the NMI value on 6 out of 9 instances. Although CPLNS produces marginally better values of the
objective function, its performance on large sparse graphs in Table 4.3 is bad.
Table 4.5 contains synthetic test cases from Generative Model 2 constructed by setting p = (ln|V|)/|V|.
FMBM outperforms DANMF and CPLNS with respect to the value of the objective function on 6 out of 12
64
(a) standard graph visualization of blocks
in an instance with 1,600 vertices and
4,353 edges
(b) FMBM visualization of blocks in Euclidean space for the instance from 4.3a
(c) standard graph visualization of blocks
in an instance with 1,600 vertices and
25,103 edges
(d) FMBM visualization of blocks in Euclidean space for the instance from 4.3c
Figure 4.3: Compares the visualization produced by FMBM against that of a competing method. (a) and (c)
show visualizations of two different instances with four blocks obtained by a standard graph visualization
procedure in NetworkX used with Graph-Tool. (b) and (d) show visualizations of the same two instances
obtained in the Euclidean embedding by FMBM. Four different colors are used to indicate the four different
blocks. The FMBM visualization is more helpful for gauging the spread of blocks, both individually and
relative to each other.
instances. It also outperforms DANMF and CPLNS with respect to the NMI value on a different set of 6
instances. Graph-Tool performs comparatively well on all the instances but occasionally produces low NMI
values.
4.4.1 Visualization
In addition to identifying blocks, their visualization is important for uncovering trends, patterns, and outliers in large graphs. A good visualization aids human intuition for gauging the spread11 of blocks, both
individually and relative to each other. In market analysis, for example, a representative element can be
11The spread here refers to how a block extends from its center to its periphery.
65
chosen from each block with proper visualization. Figure 4.3 shows that FMBM provides a much more
perspicuous visualization compared to a standard graph visualization procedure in NetworkX12 used with
Graph-Tool, even though Graph-Tool shows good overall performance in Tables 4.1, 4.3, and 4.5. This is so
because FMBM solves the block modeling problem in Euclidean space, while other approaches use abstract
methods that are harder to visualize.
4.5 Conclusions
In this chapter, we proposed FMBM, a FastMap-based algorithm for block modeling. In the first phase,
FMBM adapts FastMap to embed a given undirected unweighted graph into a Euclidean space in near-linear
time such that the pairwise Euclidean distances between vertices approximate a probabilistically-amplified
graph-based distance function between them. In doing so, it avoids having to directly work on the given
graphs and instead reformulates the graph block modeling problem to a Euclidean version. In the second
phase, FMBM uses GMM clustering for identifying clusters (blocks) in the resulting Euclidean space. Empirically, FMBM outperforms other state-of-the-art methods like FactorBlock, Graph-Tool, DANMF, and
CPLNS on many benchmark and synthetic test cases. FMBM also enables a perspicuous visualization of
the blocks in the graphs, not provided by other methods.
12https://networkx.org/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw.
html
66
Appendix
4.A Table of Notations
Notation Description
G = (V,E) An undirected unweighted graph with vertices V = {v1, v2 ... vn} and edges E =
{e1, e2 ... em} ⊆ V ×V.
A The adjacency matrix representation of G, where Ai j = 1 iff (vi
, vj) ∈ E.
k The user-specified number of blocks.
C The membership matrix for block modeling on G, where Ci j = 0 and Ci j = 1 represent vertex vi being absent from and being present in partition j, respectively.
M The image matrix, where Mi j represents the likelihood of an edge between a
vertex in partition i and a vertex in partition j.
∥ · ∥F The Frobenius norm.
◦ The element-wise multiplication for two matrices.
R A matrix ∈ [0,1]
n×n
, where Ri j =
m
n
2
.
κ The user-specified number of dimensions of the FastMap embedding.
G¯ The complement graph of G.
DP(·,·) The PASPD function.
dG(vi
, vj) The shortest-path distance between vi and vj
in G.
Gset A collection of undirected graphs, each derived from either the given graph G or
its complement G¯.
L The number of lineages of G and G¯ in Gset.
F The number of nested edge-induced subgraphs in each lineage.
ε The threshold parameter in FastMap that is used to detect large values of κ that
have diminishing returns on the accuracy of approximating the pairwise distances
between the vertices.
T The number of independent trials used in FMBM.
Bh The block that refers to the collection of all vertices with membership h.
Table 4.A.1: Describes the notations used in Chapter 4.
67
Chapter 5
FastMap for Graph Convex Hull Computations
Given an undirected edge-weighted graph G and a subset of vertices S in it, the graph convex hull CHG
S
of S in G is the set of vertices obtained by the process of initializing CHG
S
to S and iteratively adding
until convergence all vertices on all shortest paths between all pairs of vertices in CHG
S
of one iteration
to constitute CHG
S
of the next iteration. Computing the graph convex hull has applications in shortestpath computations, active learning, and in identifying geodesic cores in social networks, among others.
Unfortunately, computing it exactly is prohibitively expensive on large graphs. In this chapter, we present
a FastMap-based algorithm for efficiently computing approximate graph convex hulls. Using FastMap’s
ability to facilitate geometric interpretations, our approach invokes the power of well-studied algorithms
in Computational Geometry that efficiently compute the convex hull of a set of points in Euclidean space.
Through experimental studies, we show that our approach not only is several orders of magnitude faster than
the exact brute-force algorithm but also outperforms the state-of-the-art approximation algorithm, both in
terms of generality and the quality of the solutions produced.
5.1 Introduction
In Computational Geometry, the convex hull of a finite set of points in Euclidean space is defined as the
smallest convex polygon in that space that contains all of them. The problem of computing the convex hull
68
(a) geometric environment (b) graph environment
Figure 5.1: Illustrates a graph convex hull. (a) shows a geometric environment with obstacles (black regions)
that is discretized as a grid-world. (b) shows a graph representation G of the environment in (a), with vertices
representing the top-left corners of the free cells (non-black regions) and edges, weighted by their Euclidean
lengths, connecting pairs of vertices on the boundary of the same free cell. In both (a) and (b), the red dots
indicate S and the union of the red and orange dots indicates CHG
S
.
of a given finite set of points is a cornerstone problem with numerous applications: In discrete Geometry,
several results rely on convex hulls [108]. In Mathematics, convex hulls are used to study polynomials [96]
and matrix eigenvalues [56]. In Statistics, they are used to define risk sets [52]. They also play a key role in
polyhedral combinatorics [58].
While convex hulls have been traditionally studied in geometric spaces, they can also be defined on
graphs. In particular, given an undirected edge-weighted graph G = (V,E,w), where V is the set of vertices,
E is the set of edges, and for any edge e ∈ E, w(e) is the non-negative weight on it, and a subset of vertices
S ⊆ V, the graph convex hull of S in G is the smallest set of vertices CHG
S
that contains S and all vertices
that appear on any shortest path between any pair of vertices in CHG
S
. Procedurally, CHG
S
can be obtained
by the process of initializing it to S and iteratively adding until convergence all vertices on all shortest paths
between all pairs of vertices in CHG
S
of one iteration to constitute CHG
S
of the next iteration.
Modulo discretization, graphs are capable of representing complex manifolds and geometric spaces.
Hence, graph convex hulls “generalize” geometric convex hulls. For example, Figure 5.1 shows a graph
convex hull computed on a graph that represents a 2-dimensional Euclidean space with obstacles. Graph
69
convex hulls have many important properties and applications that are analogous to those of geometric
convex hulls. Figure 5.2 shows one such important analogous property: Every shortest path between any
two query vertices on the graph intersects a graph convex hull in a single continuum. Moreover, graph
convex hulls have many other applications in active learning [121] and identifying geodesic cores in social
networks [109], among others.
While geometric convex hulls can be computed efficiently, little is known about efficient algorithms
for computing graph convex hulls. In particular, it is well known that geometric convex hulls can be
computed with the following complexities: Given input points S, Qhull [8] can compute the geometric
convex hull with corners C in O(|S|log|C|) time in 2-dimensional and 3-dimensional Euclidean spaces
and in O(|S| f(|C|)/|C|) time in higher-dimensional Euclidean spaces. Here, the function f(|C|) returns
the maximum number of faces of a convex polytope with |C| corners and is given by the expression
f(|C|) = O(|C|
⌊κ/2⌋/⌊κ/2⌋!) for a κ-dimensional Euclidean space. In contrast, computing graph convex
hulls may not be that efficient. In fact, it is folklore that computing graph convex hulls on a general graph
G = (V,E,w) takes at least O(|V||E|) time [94]. Hence, brute-force approaches for computing graph convex
hulls exactly are prohibitively expensive on large graphs.
In this chapter, we present a novel algorithm for efficiently computing approximate graph convex hulls
based on FastMap. Since FastMap facilitates geometric interpretations of graph-theoretic problems, our
proposed approach utilizes the efficient algorithms mentioned above for computing the geometric convex
hull of a set of points in Euclidean space, particularly in 2-dimensional and 3-dimensional Euclidean spaces.
Although our FastMap-based transformation of the graph convex hull problem to the geometric convex
hull problem is promising, it does not guarantee exactness. Thus, as a further contribution in this chapter,
we advance this approach using an iterative refinement procedure. This procedure significantly improves the
recall without compromising the precision. Hence, our iterative FastMap-based algorithm has much better
Jaccard scores compared to the naive FastMap-based algorithm. It also runs several orders of magnitude
70
(a) geometric convex hulls (b) graph convex hulls
Figure 5.2: Illustrates an important property of graph convex hulls analogous to geometric convex hulls. (a)
shows that the straight line (shown in green) joining two points in Euclidean space intersects a geometric
convex hull (one shown in red and one shown in blue) in exactly one continuum (shown in orange). Here, the
red and blue dots indicate the set of points for which the red and blue geometric convex hulls are computed,
respectively. (b) shows that any shortest path (shown in green) between two vertices intersects a graph
convex hull (one shown in red and one shown in blue) in exactly one continuum (shown in orange). Here,
the larger red and blue dots indicate the set of vertices for which the red and blue graph convex hulls are
computed, respectively.
faster than the exact brute-force algorithm. Moreover, we compare our iterative FastMap-based algorithm
against the state-of-the-art approximation algorithm and demonstrate two advantages over it. First, our
approach is applicable to undirected edge-weighted graphs while the competing approach is applicable only
to undirected unweighted graphs. Second, even on undirected unweighted graphs, our iterative FastMapbased algorithm experimentally produces higher-quality solutions and is faster on large graphs.
5.2 FastMap-Based Algorithms for Graph Convex Hull Computations
In this section, we first introduce a naive version of our FastMap-based algorithm. We then improve it to
an iterative version using certain geometric intuitions, which, in turn, are also enabled by FastMap. This
iterative version of our FastMap-based algorithm is the final product we use in our experimental comparisons
against the state-of-the-art competing approach.
71
The naive version of our FastMap-based algorithm computes an approximation of the graph convex hull
as follows: (1) It embeds the vertices of the given graph G = (V,E,w) in a Euclidean space with κ dimensions, typically for κ = 2, 3, or 4; (2) It computes the geometric convex hull of the points corresponding to
the vertices in S; and (3) It reports all the vertices that map to the interior1 of this geometric convex hull as the
required approximation of the graph convex hull. This algorithm is very efficient, especially if κ = 2 or 3:
Step (1) runs in O(|E|+|V|log|V|) time; Step (2) runs in O(|S|log|C|) time; and Step (3) runs in O(|V||C|)
time.2
In Steps (2) and (3), |C| is upper-bounded by |S| since the computation is done in Euclidean space.
The foregoing algorithm is not guaranteed to return an under-approximation or an over-approximation
of the vertices in the graph convex hull. Hence, we introduce the measures of precision, recall, and Jaccard
score. Here, the precision refers to the fraction of reported vertices that belong to the ground-truth graph
convex hull. The recall refers to the fraction of vertices in the ground-truth graph convex hull that are
reported. The Jaccard score refers to the ratio of the number of reported vertices that are in the ground-truth
graph convex hull to the number of vertices that are either reported or in the ground-truth graph convex
hull. Empirically, we observe that even this naive version of our FastMap-based algorithm generally yields
very high precision values on a wide variety of graphs. However, there is a leeway for improving its recall
and, consequently, its Jaccard score. Towards this end, we design an iterative version of our FastMap-based
algorithm drawing intuitions from the geometric convex hull.
The iterative version of our FastMap-based algorithm computes an approximation of the graph convex
hull as follows: In the first iteration, it follows Steps (1) and (2) mentioned above. However, it does not
terminate by merely identifying the vertices mapped to the interior of the geometric convex hull. Instead,
it identifies the vertices mapped to the corners of the geometric convex hull and computes all shortest paths
between all pairs of them directly on the input graph G. Vertices on any of these shortest paths that are not
already in S are considered in addition to S for the second iteration of computing the geometric convex hull.
1The interior includes the boundaries and the corners of the geometric convex hull.
2To check if a given point is inside a convex polytope, we have to check its relationship to each of the convex polytope’s f(|C|)
faces. f(|C|) = O(|C|) for κ = 2 or 3.
72
The algorithm may not terminate by identifying the vertices mapped to the interior of the new geometric
convex hull produced in the second iteration, either. In such a case, it identifies the vertices mapped to the
corners of the new geometric convex hull and aims to compute all shortest paths between all pairs of them
directly on the input graph G. In doing so, it avoids any redundant computations for pairs of vertices with
cached results from the first iteration3
. The algorithm continues this process until convergence, that is, no
new vertices are added to S for the next iteration. Upon convergence, it reports all the vertices mapped to
the interior of the geometric convex hull from the last iteration as the required approximation of the graph
convex hull. Moreover, the algorithm is guaranteed to converge since: (a) The set of vertices inducted
into the graph convex hull in any iteration subsumes that of the previous iteration; and (b) G has a finite
number of vertices. Before convergence, the algorithm can also be terminated after a user-specified number
of iterations.
Overall, our iterative FastMap-based algorithm hybridizes the exact brute-force algorithm and the naive
FastMap-based algorithm. While the exact brute-force algorithm performs all its convex hull-related computations directly on the input graph G, the naive FastMap-based algorithm performs all its convex hull-related
computations on the Euclidean embedding of G. The iterative FastMap-based algorithm hybridizes them
and performs its convex hull-related computations partly on G and partly on its Euclidean embedding, interleaving them intelligently so that the shortest paths on G are computed only between pairs of vertices
mapped to the corners of the geometric convex hull obtained in the previous iteration. On the one hand, it
is significantly more efficient than the exact brute-force algorithm that computes all shortest paths between
all pairs of vertices in every iteration until convergence. On the other hand, it is more informed than the
naive FastMap-based algorithm, which may occasionally miss qualifying vertices—and, consequently, their
substantial downstream effects—when they are placed even marginally outside the geometric convex hull in
the Euclidean embedding of G.
3
a previous iteration, in general
73
(a) (e) (i)
(b) (f) (j)
(c) (g) (k)
(d) (h) (l)
Figure 5.3: Shows the behavior of our iterative FastMap-based algorithm on the running example from
Figure 5.1. The individual panels are explained in the main text of the chapter. The precision, recall, and
Jaccard score are reported after each iteration.
74
Figure 5.3 shows the step-wise working of our iterative FastMap-based algorithm on an example graph.
Figure 5.3a shows the input graph G with the set S indicated in red. Figure 5.3b indicates the geometric convex hull of the red points from Figure 5.3a in the FastMap embedding of G. The corners of the geometric
convex hull are shown in the overriding color blue. Figure 5.3c shows the interior points of the geometric
convex hull from Figure 5.3b. The blue points are carried over from Figure 5.3b and the orange points indicate the rest of the interior points. Figure 5.3d shows the interior (blue and orange) points from Figure 5.3c
identified on G in orange. This set of orange vertices is the algorithm’s approximation of CHG
S
after the first
iteration.
Figure 5.3e marks the blue points from Figure 5.3c as larger green vertices. It also shows all vertices
that appear on any of the shortest paths between any pair of these larger green vertices as regular green
vertices. However, this set of regular green vertices excludes the vertices in S, since we compute only the
incremental update to the set of vertices that are deemed to be in CHG
S
. Figure 5.3f shows the green vertices
from Figure 5.3e and the red vertices from Figure 5.3a as the vertices deemed to be in CHG
S
at this stage.
The corners of the geometric convex hull of the points corresponding to these vertices are shown in the
overriding color blue. Figure 5.3g is similar to Figure 5.3c and carries over the blue points from Figure 5.3f.
However, the new internal points identified in this iteration, compared to the previous one, are shown as
larger points. Figure 5.3h is similar to Figure 5.3d and is derived from Figure 5.3g. The new set of orange
vertices is the algorithm’s approximation of CHG
S
after the second iteration, where the new orange vertices
are larger.
Figure 5.3i marks the blue points from Figure 5.3g as larger green vertices. It also shows all vertices that
appear on any of the shortest paths between any pair of these larger green vertices as regular green vertices.
However, this set of regular green vertices excludes the vertices in S (red vertices from Figure 5.3a) and the
green vertices computed in the previous iterations (green vertices from Figure 5.3e), since we compute only
the incremental update to the set of vertices that are deemed to be in CHG
S
. Figure 5.3j shows the green
75
vertices from Figures 5.3i and 5.3e and the red vertices from Figure 5.3a as the vertices deemed to be in
CHG
S
at this stage. The corners of the geometric convex hull of the points corresponding to these vertices
are shown in the overriding color blue. Figure 5.3k is similar to Figure 5.3g and carries over the blue points
from Figure 5.3j. However, the new internal points identified in this iteration, compared to the previous
one, are shown as larger points. Figure 5.3l is similar to Figure 5.3h and is derived from Figure 5.3k. The
new set of orange vertices is the algorithm’s approximation of CHG
S
after the third iteration, where the new
orange vertices are larger. At this stage, the algorithm converges; and, in general, would continue either
until convergence or for a user-specified number of iterations.
Algorithm 5 shows the pseudocode for our iterative FastMap-based algorithm (FMGCH). It takes as
input the graph G = (V,E,w) and a set of vertices S ⊆ V, for which the graph convex hull needs to be
computed. The input parameters κ and ε are pertinent to the FastMap embedding, as in Algorithm 2 (from
Chapter 1). The output CHG
S
is the required graph convex hull or an approximation of it. The algorithm
initializes and maintains S˜ to represent the set of vertices deemed to be in CHG
S
. In addition, it initializes and
maintains CHS˜ to represent the corners of the geometric convex hull of S˜ in the FastMap embedding of G.
On Line 1, the algorithm calls Algorithm 2 and creates a κ-dimensional Euclidean embedding of G, with
vertex vi ∈ V mapped to point pi ∈ R
κ
. On Lines 3 and 4, the algorithm identifies the points corresponding
to the specified vertices in S and computes their geometric convex hull. Here, the function ConvexHull(·)
returns only the corners of the geometric convex hull. The algorithm then performs some initializations on
Lines 5 and 6 and begins the iterative process on Lines 7-26 until convergence is detected on Lines 7 or 21.
In each iteration, the old value of CHS˜, that is, CH′
S˜
, is first replaced by CHS˜ on Line 8. Subsequently, CHS˜
is updated on Lines 9-25. This update starts on Line 9 by identifying the necessary pairs of vertices between
which all vertices on all shortest paths need to be computed, while avoiding any redundant computations
with respect to the previous iterations. On Lines 12-17, the algorithm then computes the shortest-path
dictionary rooted at vi
, for each necessary pair (pi
, pj) with i < j, if this dictionary is not already cached
76
Algorithm 5: FMGCH (FastMap-Based Graph Convex Hull): A FastMap-based algorithm for
computing graph convex hulls.
1 Input: G = (V,E,w) and S ⊆ V
2 Parameter: κ and ε
3 Output: CHG
S
1: P ← FastMap(G,κ, ε).
2: S˜ ← S.
3: PS˜ ← {pi ∈ P : vi ∈ S˜}.
4: CHS˜ ← ConvexHull(PS˜).
5: CH′
S˜ ← {}.
6: Dicts ← dictionary().
7: while CHS˜ ̸= CH′
S˜ do
8: CH′
S˜ ← CHS˜.
9: pairs ← {(pi
, pj) : pi
, pj ∈ CH′
S˜
,i < j, and (pi
, pj) is not cached}.
10: S˜′ ← S˜.
11: for (pi
, pj) ∈ pairs do
12: if vi ∈ Dicts then
13: SPDvi ← Dicts[vi
].
14: else
15: SPDvi ← ShortestPathDictionary(G, vi).
16: Dicts[vi
] ← SPDvi
.
17: end if
18: S∆ ← VerticesOnAllShortestPaths(SPDvi
, vj).
19: S˜ ← S˜∪S∆.
20: end for
21: if S˜ = S˜′
then
22: break
23: end if
24: PS˜ ← {pi ∈ P : vi ∈ S˜}.
25: CHS˜ ← ConvexHull(PS˜).
26: end while
27: CHG
S ← {vi
: pi ∈ PointsWithinHull(CHS˜,P)}.
28: return CHG
S
.
in the ‘Dicts’ data structure. The function ShortestPathDictionary(·,·) returns a list of predecessors of
each vertex that lead to the root vertex vi along a shortest path.4 On Lines 18 and 19, the algorithm calls
the function VerticesOnAllShortestPaths(·,·) to gather all vertices that appear on any of the shortest paths
from vi
to vj and adds them incrementally to S˜. On Lines 21-23, the algorithm checks for convergence
and breaks the iterative loop if necessary.5 On Lines 24 and 25, it updates the geometric convex hull in
4
In Python3, this can be realized using the ‘dijkstra_predecessor_and_distance()’ function of NetworkX [49].
5
In such a case, the convergence condition on Line 7 is also satisfied. However, the work on Lines 24 and 25 can be avoided.
77
preparation for the next iteration. Upon termination of the iterative loop, on Lines 27 and 28, the algorithm
computes and returns the entire interior of the geometric convex hull from the last iteration. The function
PointsWithinHull(·,·) determines which of the specified points belong to the interior of a geometric convex
hull specified by its corners. It does so by computing the faces of the geometric convex hull from its corners
and querying each point with respect to these faces.
As analyzed before, the running time complexity of the naive FastMap-based algorithm is superior to
the folklore results for the exact computation of graph convex hulls. Although it is much harder to analyze
the running time complexity of our iterative FastMap-based algorithm, we provide an analysis here under
certain realistic assumptions. Let CHG
S
be the ground truth and κ = 2 or 3.6 We assume that the algorithm
runs for τ iterations, for a small constant τ, has a high precision in all iterations, with |CHG
S
| ⪅ |CHG
S
|, and
that CHS˜ has at most ¯c corners in all iterations, with ¯c ≪ |CHG
S
|. These assumptions have been observed to
be true in extensive experimental studies. Hence, in each iteration, the algorithm computes the geometric
convex hull in O(|S˜|log ¯c) time, in which |S˜| is upper-bounded by |CHG
S
|. In addition, in each iteration,
the algorithm computes all vertices on the shortest paths between all pairs of vertices corresponding to
the points in CHS˜. It does so in two phases. First, it computes the shortest-path dictionaries rooted at
each of the vertices corresponding to the points in CHS˜ in O(c¯(|E| + |V|log|V|)) time. Second, it postprocesses each such dictionary with respect to each of the destination vertices corresponding to the points
in CHS˜ in time linear in the size of the dictionary. This takes O(c¯
2
(|E|+|V|)) time. Finally, the algorithm
computes the interior of the geometric convex hull in O(c¯|V|) time. Therefore, the overall time complexity
is O(τ(log ¯c|CHG
S
|+c¯(|V|log|V|) +c¯
2
(|E|+|V|))). This complexity is still better than the folklore results
for the exact computation of graph convex hulls. Moreover, the real benefits of the iterative FastMap-based
algorithm become evident in the experimental studies presented in the next section.
6For higher values of κ, the analysis has to explicitly factor in the number of faces of κ-dimensional convex polytopes.
78
Algorithm 6: EXACTGRAPHCONVEXHULL: An exact algorithm with implementation-level optimizations for computing graph convex hulls.
1 Input: G = (V,E,w) and S ⊆ V
2 Output: CHG
S
1: CHG
S ← S.
2: Vunexpanded ← S.
3: Vexpanded ← {}.
4: Pairs ← {}.
5: while Vunexpanded ̸= {} do
6: Randomly select vs from Vunexpanded.
7: SPDvs ← ShortestPathDictionary(G, vs).
8: Vnew ← {}.
9: for vt ∈ CHG
S
do
10: if vs = vt or (vs
, vt) ∈ Pairs then
11: Continue.
12: end if
13: Vs,t ← VerticesOnAllShortestPaths(SPDvs
, vt).
14: Pairss,t ← PairsOnAllShortestPaths(SPDvs
, vt).
15: Pairs ← Pairs∪Pairss,t
.
16: for v ∈ Vs,t do
17: if v ∈/ Vexpanded and v ∈/ Vunexpanded then
18: Add v to Vunexpanded.
19: Add v to Vnew.
20: end if
21: end for
22: end for
23: CHG
S ← CHG
S ∪Vnew.
24: Add vs
to Vexpanded.
25: Remove vs from Vunexpanded.
26: end while
27: return CHG
S
.
5.3 An Efficient Implementation of the Exact Brute-Force Algorithm
In this section, we present an exact brute-force algorithm for computing graph convex hulls. This procedure not only serves as a baseline competing method but also produces the ground truth. Although it is
a brute-force algorithm, it incorporates several implementation-level optimizations derived from the use of
dictionaries and caching. Some of these optimizations are also used in FMGCH and, in fact, may further be
useful for other algorithms developed in the future.
79
Algorithm 6 presents the pseudocode of the brute-force procedure. It takes as input the graph G =
(V,E,w) and a set of vertices S ⊆V, for which the graph convex hull needs to be computed. The outputCHG
S
is the required graph convex hull. The algorithm uses shortest-path dictionaries as a critical data structure:
A shortest-path dictionary rooted at the vertex v is denoted by SPDv. It is similar to the shortest-path tree
rooted at v but represents all shortest paths from v to all other vertices. It does this by maintaining a list of
predecessors of each vertex that lead to the root vertex v along a shortest path. The algorithm also maintains
a set of vertices Vunexpanded, rooted at each of which it intends to compute a shortest-path dictionary. The set
Vexpanded gathers the vertices for which this task has been accomplished. Moreover, the algorithm uses Pairs
to maintain a set of pairs of vertices between which all shortest paths have been generated.
On Lines 1-4, Algorithm 6 initializes CHG
S
, Vunexpanded, Vexpanded, and Pairs. On Lines 5-26, the algorithm loops until convergence while computing a new shortest-path dictionary in each iteration. The new
shortest-path dictionary provides information on some vertices that provably belong to CHG
S
: These vertices
are gathered in the set Vnew; and all other data structures are appropriately updated. On Lines 6-8, the algorithm picks a random vertex vs from Vunexpanded, calls the function ShortestPathDictionary(·,·) to compute
SPDvs
, the shortest-path dictionary rooted at it, and initializes Vnew. On Lines 9-22, the algorithm considers
every possible vertex vt
that is currently in CHG
S
and pairs it with vs
. On Lines 10-12, the algorithm skips the
pair (vs
, vt) if vs = vt or (vs
, vt) has been processed before and, hence, has been recorded in Pairs. Otherwise,
on Line 13, the algorithm calls the function VerticesOnAllShortestPaths(·,·) to gather into Vs,t all vertices
that appear on any of the shortest paths from vs
to vt
.
Moreover, on Line 14, the algorithm calls the function PairsOnAllShortestPaths(·,·) to gather all pairs
of vertices that both appear on any of the shortest paths from vs
to vt
. This is achieved by: (1) Creating a
directed acyclic graph GDAG on all vertices that are reachable via predecessor (parent) relationships from
vt
in SPDvs
; (2) Repeating until there are no more vertices, the process of identifying a vertex vi with no
predecessors, computing all of its descendants via breadth-first search, recording the pair (vi
, vj) for each
80
descendant vj
, and removing vi with all its edges from GDAG; and (3) Returning all pairs of vertices recorded
in the previous step as the output. On Line 15, the algorithm updates Pairs accordingly: This is used in
future iterations on Line 10 to avoid redundant work.
On Lines 16-21, the algorithm considers each vertex v in Vs,t
. If v has not yet been considered for
computing a shortest-path dictionary rooted at it, that is, if v does not appear in Vunexpanded or Vexpanded, it is
added to Vunexpanded for future consideration. It is also added to Vnew. On Line 23, the algorithm adds Vnew
to CHG
S
, the correctness of which is proved by the following arguments. From Lines 18, 19, and 23, it is
easy to infer that any vertex added to Vunexpanded is also added to Vnew and CHG
S
. Hence, vs chosen on Line 6
belongs to CHG
S
. Moreover, from Line 9, vt also belongs to CHG
S
. Since all vertices in Vs,t and Vnew appear
on a shortest path from vs
to vt
, Line 23 correctly adds Vnew to CHG
S
in accordance with the definition of the
graph convex hull. On Lines 24 and 25, the algorithm updates Vexpanded and Vunexpanded, respectively. On
Line 27, it returns CHG
S
upon convergence.
5.4 Experimental Results
In this section, we present tabular experimental results that compare FMGCH, that is, Algorithm 5, against
the state-of-the-art algorithm for computing graph convex hulls, which is encapsulated within GCoreApproximation (GCA) [109]
7
. However, GCA is also an approximation algorithm. Hence, to produce the
ground truth (GT), we invoked Algorithm 6 presented in Section 5.3. We do not include our naive FastMapbased algorithm in the tabular results to avoid clutter. However, as expected, it is more efficient than
FMGCH, our iterative FastMap-based algorithm, but produces a lower recall and Jaccard score. We implemented FMGCH in Python3. For computing the geometric convex hull of a collection of points in
Euclidean space, we used the ‘Qhull’ library [8]. We invoked GCA using a simple Python wrapper function.
We conducted all experiments on a laptop with an Apple M2 Max chip and 96 GB memory.
7https://github.com/fseiffarth/GCoreApproximation
81
Instance Size (|V|, |E|) GT (s) GCA FMGCH κ = 2 κ = 3 κ = 4
time (s) jacc prec recall pre (s) query (s) jacc prec recall pre (s) query (s) jacc prec recall pre (s) query (s) jacc prec recall
miles1000 (128, 3216) 0.4754 0.0016 0.9922 1.0000 0.9922 0.0091 0.0146 1.0000 1.0000 1.0000 0.0172 0.0394 1.0000 1.0000 1.0000 0.0227 0.0657 1.0000 1.0000 1.0000
myciel7 (191, 2360) 0.5030 0.0009 1.0000 1.0000 1.0000 0.0074 0.0049 1.0000 1.0000 1.0000 0.0099 0.0151 1.0000 1.0000 1.0000 0.0131 0.0300 1.0000 1.0000 1.0000
queen16_16 (256, 6320) 1.1260 0.0034 1.0000 1.0000 1.0000 0.0154 0.0264 1.0000 1.0000 1.0000 0.0346 0.1073 1.0000 1.0000 1.0000 0.0392 0.1241 1.0000 1.0000 1.0000
le450_25d (450, 17425) 5.5958 0.0240 1.0000 1.0000 1.0000 0.0647 0.0601 0.9133 1.0000 0.9133 0.1187 0.2570 0.9889 1.0000 0.9889 0.0936 0.4376 0.9978 1.0000 0.9978
orz601d (1890, 3473) 48.2478 0.0307 0.7801 0.8664 0.8867 0.0156 0.0169 0.6456 0.6456 1.0000 0.0278 0.0862 0.7724 0.7731 0.9988 0.0373 0.2785 0.9115 0.9125 0.9988
lak106d (1909, 3589) 54.9086 0.0431 0.7510 0.8528 0.8628 0.0181 0.0410 0.8722 0.8736 0.9982 0.0319 0.1404 0.9104 0.9111 0.9991 0.0401 0.2190 0.9568 0.9576 0.9991
hrt001d (3705, 6914) 1342.2423 0.1715 0.6319 0.7551 0.7948 0.0276 0.0714 0.6375 0.6453 0.9815 0.0516 0.2105 0.7344 0.7344 1.0000 0.0595 0.6886 0.9316 0.9354 0.9956
orz000d (4057, 7744) 1936.5250 0.2134 0.9439 0.9711 0.9711 0.0315 0.0780 0.9683 0.9683 1.0000 0.0944 0.2351 0.9873 0.9884 0.9988 0.0774 0.9722 0.9891 0.9949 0.9941
ca-GrQc (4158, 13428) 26.1400 0.2220 0.1945 0.3254 0.3259 0.0548 0.1805 0.3307 0.3330 0.9792 0.1027 0.5365 0.3389 0.3403 0.9881 0.0959 1.1852 0.3554 0.3566 0.9903
ca-HepTh (8638, 24827) 219.1975 1.5639 0.2704 0.4258 0.4255 0.0975 0.2371 0.4219 0.4329 0.9432 0.2595 1.2079 0.4299 0.4313 0.9928 0.2274 3.5873 0.4372 0.4378 0.9967
wiki-Vote (7066, 100736) 862.9993 2.7505 0.5034 0.6698 0.6696 0.2689 0.4216 0.6173 0.6613 0.9027 0.4657 2.1628 0.6408 0.6550 0.9673 0.5820 5.9597 0.6696 0.6854 0.9666
ca-HepPh (11204, 117649) 799.8486 3.5258 0.3324 0.4989 0.4990 0.3205 1.1677 0.4352 0.4389 0.9810 0.6013 3.8834 0.4379 0.4404 0.9872 0.6396 13.0594 0.4446 0.4465 0.9907
wm04000 (4000, 20109) 207.3266 1.2004 0.9995 0.9997 0.9997 0.0651 0.1952 0.9945 0.9997 0.9947 0.1505 0.8467 0.9975 1.0000 0.9975 0.1726 2.7267 0.9992 1.0000 0.9992
wm08000 (8000, 40014) 1061.7808 5.3560 0.9998 0.9999 0.9999 0.2337 0.5732 0.9972 0.9999 0.9974 0.3414 2.3460 0.9986 0.9999 0.9987 0.3716 5.6938 0.9987 1.0000 0.9987
wm10000 (10000, 49980) 1755.0927 8.7739 0.9994 0.9997 0.9997 0.2436 0.5254 0.9991 0.9998 0.9993 0.3281 2.0855 0.9983 0.9997 0.9986 0.3831 7.6367 0.9987 0.9997 0.9990
wm12000 (12000, 59853) 2603.6225 12.9154 0.9991 0.9996 0.9995 0.2663 0.9198 0.9929 0.9997 0.9932 0.4050 3.1991 0.9992 0.9997 0.9994 0.4999 9.7890 0.9988 0.9999 0.9989
miles1000 (128, 3216) 0.2592 - - - - 0.0068 0.0056 0.9444 1.0000 0.9444 0.0130 0.0269 0.9683 1.0000 0.9683 0.0174 0.0696 0.9921 1.0000 0.9921
myciel7 (191, 2360) 0.2875 - - - - 0.0075 0.0099 0.8066 0.8957 0.8902 0.0096 0.0282 0.8268 0.9080 0.9024 0.0164 0.1033 0.9266 0.9266 1.0000
queen16_16 (256, 6320) 0.9702 - - - - 0.0127 0.0515 0.9922 1.0000 0.9922 0.0258 0.1423 0.9922 1.0000 0.9922 0.0370 0.2167 0.9961 1.0000 0.9961
le450_25d (450, 17425) 4.6087 - - - - 0.0580 0.1069 0.9376 1.0000 0.9376 0.0875 0.4325 0.9866 1.0000 0.9866 0.0978 0.6410 0.9889 1.0000 0.9889
orz601d (1890, 6746) 22.4946 - - - - 0.0255 0.0340 0.5961 0.5961 1.0000 0.0373 0.0930 0.7192 0.7192 1.0000 0.0516 0.3527 0.8906 0.8916 0.9987
lak106d (1909, 7029) 16.6713 - - - - 0.0268 0.0929 0.7819 0.7819 1.0000 0.0379 0.1915 0.8666 0.8675 0.9988 0.0571 0.7088 0.9330 0.9362 0.9964
hrt001d (3705, 13498) 318.5500 - - - - 0.0586 0.0769 0.8019 0.8109 0.9864 0.0801 0.4049 0.8297 0.8821 0.9332 0.1104 1.5441 0.9596 0.9629 0.9965
orz000d (4057, 15208) 439.4339 - - - - 0.0560 0.1739 0.6183 0.6183 1.0000 0.1362 0.7398 0.9270 0.9275 0.9995 0.1106 2.1894 0.9439 0.9439 1.0000
ca-GrQc (4158, 13428) 27.2310 - - - - 0.0560 0.0823 0.3304 0.3491 0.8604 0.0707 0.5626 0.3424 0.3454 0.9748 0.1090 1.6527 0.3586 0.3605 0.9859
ca-HepTh (8638, 24827) 219.0672 - - - - 0.0961 0.2631 0.4261 0.4339 0.9599 0.1570 1.0497 0.4288 0.4341 0.9723 0.1943 3.2234 0.4352 0.4379 0.9859
wiki-Vote (7066, 100736) 867.2150 - - - - 0.2691 0.2840 0.5799 0.6571 0.8316 0.4012 1.8959 0.6610 0.6735 0.9727 0.6367 5.1408 0.6642 0.6849 0.9566
ca-HepPh (11204, 117649) 805.3969 - - - - 0.3188 0.4995 0.4281 0.4569 0.8717 0.4761 4.9446 0.4385 0.4414 0.9853 0.6435 10.3627 0.4446 0.4466 0.9901
wm04000 (4000, 20109) 139.1687 - - - - 0.1148 0.2276 0.8095 0.9981 0.8107 0.1256 1.2747 0.9815 0.9980 0.9835 0.2004 3.2818 0.9900 0.9980 0.9920
wm08000 (8000, 40014) 628.2476 - - - - 0.1793 0.4662 0.8419 0.9988 0.8427 0.3319 3.6184 0.9895 0.9987 0.9907 0.3983 7.6990 0.9901 0.9989 0.9912
wm10000 (10000, 49980) 1021.6094 - - - - 0.2362 0.9538 0.9012 0.9979 0.9029 0.4013 4.3646 0.9922 0.9979 0.9943 0.5599 11.0225 0.9936 0.9980 0.9956
wm12000 (12000, 59853) 1544.6484 - - - - 0.4505 0.9459 0.8570 0.9986 0.8580 0.5633 6.6372 0.9795 0.9984 0.9811 0.6056 14.4132 0.9935 0.9984 0.9951
Table 5.1: Shows the performance results of FMGCH, GCA, and the GT procedure. FMGCH is shown with different values of κ. The columns
‘prec’, ‘recall’, and ‘jacc’ indicate the precision, recall, and Jaccard score, respectively. ‘GT (s)’, ‘time (s)’ under ‘GCA’, and ‘pre (s)’ and ‘query (s)’
under ‘FMGCH’ represent the running time of GT, the running time of GCA, the precomputation time of FMGCH, and the query time of FMGCH,
respectively, in seconds. The top half of the rows are unweighted graphs and the bottom half are their weighted counterparts. Within each half, the
rows are divided into the categories: DIMACS, movingAI, SNAP, and Waxman.
82
While FMGCH is applicable to both unweighted and (edge-)weighted graphs, GCA is applicable to only
unweighted graphs [109]. That is, FMGCH already has the advantage of being a more general algorithm
compared to GCA. Hence, we perform two kinds of experiments. First, we compare FMGCH against GCA
on unweighted graphs. Second, we study the performance of FMGCH on weighted graphs. In both cases,
the GT procedure is included as the baseline.
We used four datasets in our experiments from which both unweighted and weighted graphs can be
derived: the DIMACS, movingAI, SNAP, and the Waxman graphs. The DIMACS graphs8
are a standard
benchmark collection of unweighted graphs. They can be converted to weighted graphs by assigning an
integer weight chosen uniformly at random from the interval [1,10] to each edge. The movingAI graphs
model grid-worlds with obstacles [118]. They can be used as unweighted graphs if the grid-worlds are
assumed to be four-connected (with only horizontal and vertical connections) or as weighted graphs if the
grid-worlds are assumed to be eight-connected (with additional diagonal connections). The SNAP dataset
refers to the Stanford Large Network Dataset Collection [67], some graphs from which were chosen as
undirected unweighted graphs for experimentation in [109]. We use the same graphs for a fair comparison
with GCA. Moreover, these graphs can be converted to weighted graphs by assigning an integer weight
chosen uniformly at random from the interval [1,10] to each edge. Waxman graphs are used to generate
realistic communication networks [125]. Here, we generated Waxman graphs using NetworkX [49] with
parameter values α = 100/|V| and β = 0.1, within a rectangular domain of 100×100, and with the weight
on each edge set to the Euclidean distance between its endpoints. These graphs are naturally weighted but
can be made unweighted by ignoring the weights.
Table 5.1 shows the performance results of FMGCH, GCA, and the GT procedure. FMGCH is shown
with three subdivisions, for κ = 2, 3, and 4. In each case, the table reports the precomputation time and the
query time. The precomputation time is the time required by the FastMap component of FMGCH to generate
8https://mat.tepper.cmu.edu/COLOR/instances.html
83
the κ-dimensional Euclidean embedding. This precomputation is done only once per graph and can serve
the purpose of answering many queries on the same graph with different input S. Only representative results
are shown on selected graphs from each dataset. Queries were formulated on a given graph by randomly
choosing 10 vertices to constitute the input set S. In most cases, FMGCH required ≤ 10 iterations for
convergence.
In the first set of experiments (top half of Table 5.1), it is easy to observe that both FMGCH and GCA are
orders of magnitude faster than GT, with the efficiency gaps being more pronounced on large graphs. In fact,
FMGCH is also faster than GCA on large graphs. On the DIMACS dataset, both FMGCH and GCA produce
very high-quality solutions. On the movingAI dataset, GCA does not produce high-quality solutions. While
the same is true for FMGCH with κ = 2, the quality increases for κ = 3 and increases further for κ = 4.
In fact, FMGCH with κ = 4 produces very high-quality solutions. On the SNAP dataset, GCA produces
low-quality solutions and FMGCH produces better-quality solutions, particularly on recall. However, this
quality deterioration is not related to the larger sizes of the SNAP graphs. In fact, on the Waxman graphs,
which are also large, both FMGCH and GCA produce very high-quality solutions.
In the second set of experiments (bottom half of Table 5.1), GCA is not applicable at all. In contrast,
FMGCH is fully applicable and can be evaluated using GT. Even here, it is easy to observe that FMGCH is
orders of magnitude faster than GT. The qualities of the solutions that it produces follow the same patterns
as in the unweighted case. On some weighted graphs, FMGCH and/or GT may run faster than on their
unweighted counterparts. This happens because the number of shortest paths between a pair of vertices is
usually more on unweighted graphs and, consequently, computing all of them is more expensive.
5.5 Conclusions
The graph convex hull problem is an important graph-theoretic problem that is analogous to the geometric
convex hull problem. The two problems also share many important analogous properties and real-world
84
applications. Yet, while the geometric convex hull problem is very well studied, the graph convex hull problem has not received much attention thus far. Moreover, while geometric convex hulls can be computed very
efficiently in low-dimensional Euclidean spaces, folklore results for algorithms that compute graph convex
hulls exactly make them prohibitively expensive on large graphs. In this chapter, we presented a FastMapbased algorithm for efficiently computing approximate graph convex hulls. Our FastMap-based algorithm
utilizes FastMap’s ability to facilitate geometric interpretations. While the naive version of our algorithm
uses a single shot of such a geometric interpretation, the iterative version of our algorithm repeatedly interleaves the graph and geometric interpretations to reinforce one with the other. This iterative version was
encapsulated in our solver, FMGCH, and experimentally compared against the state-of-the-art solver, GCA.
On a variety of graphs, we showed that FMGCH not only runs several orders of magnitude faster than a
highly-optimized exact algorithm but also outperforms GCA, both in terms of generality and the quality of
the solutions produced. It is also faster than GCA on large graphs.
85
Appendix
5.A Table of Notations
Notation Description
G = (V,E,w) An undirected edge-weighted graph, where V is the set of vertices, E is the set of
edges, and for any edge e ∈ E, w(e) is the non-negative weight on it.
S A subset of the vertices V.
CHG
S The graph convex hull of S in G.
C The corners of a geometric convex hull.
f(|C|) The function that returns the maximum number of faces of a convex polytope
with |C| corners.
κ The user-specified number of dimensions of the FastMap embedding.
ε The threshold parameter in FastMap that is used to detect large values of κ that
have diminishing returns on the accuracy of approximating the pairwise distances
between the vertices.
CHG
S The required graph convex hull of S in G or an approximation of it.
CHS The corners of the geometric convex hull of a collection of points corresponding
to S.
Table 5.A.1: Describes the notations used in Chapter 5.
86
Chapter 6
FastMapSVM: Combining FastMap and Support Vector Machines
NNs and related DL methods are currently at the leading edge of technologies used for classifying complex objects. However, they generally demand large amounts of data and time for model training; and
their learned models can sometimes be difficult to interpret. In this chapter, we introduce FastMapSVM
as an interpretable lightweight alternative ML framework for classifying complex objects. FastMapSVM
combines the complementary strengths of FastMap and SVMs while extending the applicability of SVMs to
domains with complex objects. It is particularly useful when it is easier to measure the dissimilarity between
pairs of objects in the domain via a well-defined distance function on them than it is to identify and reason
about complex characteristic features of individual objects. As a case study, we demonstrate the success of
FastMapSVM in the Earthquake Science domain, where the objects are seismograms that need to be classified as earthquake signals or noise signals. We show that FastMapSVM outperforms other state-of-the-art
methods for classifying seismograms when training data or time is limited. FastMapSVM also provides an
insightful visualization of seismogram clustering behavior and the earthquake classification boundaries. We
expect it to be viable for classification tasks in many other real-world domains as well.
87
6.1 Introduction
Various ML and DL methods, such as NNs, are popularly used for classifying complex objects. For example, a Convolutional NN (CNN) is used for classifying Sunyaev-Zel’dovich galaxy clusters [71], a densely
connected CNN is used for classifying images [53], and a deep NN is used for differentiating the chest
X-rays of Covid-19 patients from other cases [27, 104]. However, they generally demand large amounts of
data and time for model training; and their learned models can sometimes be difficult to interpret.
In this chapter, we introduce FastMapSVM—an interpretable ML framework for classifying complex
objects—as a lightweight alternative to NNs for classification tasks in which training data or time is limited
and a suitable distance function can be defined. While most ML algorithms learn to identify characteristic
features of individual objects in a class, FastMapSVM leverages a domain-specific distance function on pairs
of objects. It does this by combining the strengths of FastMap and SVMs. In its first phase, FastMapSVM
invokes FastMap for mapping complex objects to points in a Euclidean space while preserving pairwise
domain-specific distances between them. In its second phase, it invokes SVMs and kernel methods for
learning to classify the points in this Euclidean space.
Our FastMapSVM framework is conceptually similar to the SupFM-SVM method [7], the application
of which to complex objects was anticipated by the original authors. Here, we present the first such application to complex objects by using FastMapSVM to classify seismograms. We compare the performance
characteristics of FastMapSVM against state-of-the-art NN alternatives in the Earthquake Science domain
on a commonly used benchmark dataset. We show that FastMapSVM outperforms these NN alternatives
when training data or time is limited. FastMapSVM also provides an insightful visualization of seismogram
clustering behavior and the earthquake classification boundaries. We further demonstrate that FastMapSVM
can be easily deployed for different real-world classification tasks in the seismogram domain.
88
Figure 6.1: Illustrates the overall architecture of FastMapSVM. The top half illustrates the training period
of FastMapSVM. During this period, it ingests training instances in the form of complex objects with their
associated labels. It first invokes FastMap with a desired domain-specific distance function for efficiently
mapping the complex objects to points in a Euclidean space; it also stores explicit references to the pivots
during this process. It then invokes and trains SVMs with kernel methods for learning to classify the resulting
points in this Euclidean space. The bottom half illustrates the test period of FastMapSVM. During this
period, it takes a given test object as input. It first uses the same distance function to compute the distances
between the test object and the stored pivots generated in the training period. Using these distances, it maps
the test object to a point in the same Euclidean space generated in the training period. It then invokes the
trained SVMs to classify the point—and, consequently, the test object—in the Euclidean space.
Backed by our results in the Earthquake Science domain, we propound FastMapSVM as a potentially
advantageous alternative to NNs in other real-world domains as well when training data or time is limited
and a suitable distance function can be defined.
6.2 FastMapSVM
In this section, we propose FastMapSVM as a novel supervised ML framework that elegantly combines
the strengths of FastMap and SVMs. FastMapSVM gets as input training instances in the form of complex
objects with their associated labels. During training, it operates in two phases. In the first phase, it invokes
FastMap for efficiently mapping the complex objects to points in a Euclidean space while preserving pairwise domain-specific distances between them. In the second phase, it invokes SVMs and kernel methods for
learning to classify the points in this Euclidean space. Later, for classifying a test object, FastMapSVM once
89
again operates in two phases. In the first phase, it calls the same distance function to compute the distances
between the given test object and the pivots generated during the training period. These distances allow it
to map the test object to a point in the same Euclidean space in which the training instances reside. In the
second phase, FastMapSVM classifies this point using the trained SVMs. Figure 6.1 illustrates the overall
architecture of FastMapSVM during the training and test periods. Because of its methodology, it has several
advantages over other ML frameworks.
First, FastMapSVM leverages domain-specific knowledge via a distance function. There are many realworld domains in which feature selection for individual objects is hard. While a domain expert can occasionally identify and incorporate domain-specific features of the objects to be classified, doing so becomes
increasingly hard with increasing complexity of the objects. Therefore, many existing ML algorithms for
classification find it hard to leverage domain knowledge when used off the shelf. However, in many realworld domains with complex objects, a distance function on pairs of objects is well defined and easy to
compute. In such domains, FastMapSVM is more easily applicable than other ML algorithms that focus on
the features of individual objects. FastMapSVM also enables domain experts to incorporate their domain
knowledge via a distance function instead of relying entirely on complex ML models to infer the underlying
structure in the data. As mentioned before, examples of such real-world objects include audio signals, seismograms, DNA sequences, ECGs, and MRIs. While these objects are complex and may have many subtle
features that are hard to recognize, there exists a well-defined distance function on pairs of objects that is
easy to compute. For instance, individual DNA sequences have many complex and subtle features but, as
discussed in Chapter 1, the edit distance [101] between two DNA sequences is well defined and easy to
compute. Similarly, as discussed in Chapter 1, the Minkowski distance [66] is well defined for images and
the cosine similarity [98] is well defined for text documents.
Second, FastMapSVM facilitates interpretability, explainability, and visualization. Many existing ML
algorithms produce results that are hard to interpret or explain. For example, in NNs, a large number of
90
interactions between neurons with nonlinearities makes a meaningful interpretation or explanation of the
results very hard. In fact, the very complexity of the objects in the domain can hinder interpretability and
explainability. FastMapSVM mitigates these challenges and thereby supports interpretability and explainability. While the objects themselves may be complex, FastMapSVM embeds them in a Euclidean space
by considering only the distance function defined on pairs of objects. In effect, it simplifies the description of the objects by assigning Euclidean coordinates to them. Moreover, since the distance function is
itself user-defined and encapsulates domain knowledge, FastMapSVM naturally facilitates interpretability
and explainability. In fact, it even provides a perspicuous visualization of the objects and the classification boundaries between them. This aids human interpretation of the data and results. It also enables a
human-in-the-loop framework for refining the processes of learning and decision making. As a hallmark,
FastMapSVM produces the visualization very efficiently since it invests only linear time in generating the
Euclidean embedding.
Third, FastMapSVM uses significantly smaller amounts of data and time for model training compared
to other ML algorithms. While NNs and other ML algorithms store abstract representations of the training data in their model parameters, FastMapSVM stores explicit references to the pivots. While making
predictions, objects in the test instances are compared directly to the pivots using the user-defined distance
function. Therefore, FastMapSVM obviates the need to learn a complex transformation of the input data
and thus significantly reduces the amount of data and time required for model training. Moreover, given N
training instances, that is, N objects and their classification labels, FastMapSVM leverages O(N
2
) pieces of
information via the distance function that is defined on every pair of objects. In contrast, ML algorithms
that focus on individual objects leverage only O(N) pieces of information.
Fourth, FastMapSVM extends the applicability of SVMs and kernel methods to complex objects. Generally speaking, SVMs are particularly good for classification tasks. When combined with kernel methods,
they recognize and represent complex nonlinear classification boundaries very elegantly [105]. Moreover,
91
soft-margin SVMs with kernel methods [91] can be used to recognize both outliers and inherent nonlinearities in the data. While the SVM machinery is very effective, it requires the objects in the classification task to
be represented as points in a Euclidean space. Often, it is very difficult to represent complex objects as precise geometric points without introducing inaccuracy or losing domain-specific representational features. In
such cases, deep NNs have gained more popularity compared to SVMs for the reason that it is unwieldy for
SVMs to represent all the features of complex objects in Euclidean space. However, FastMapSVM revives
the SVM approach by leveraging a distance function and creating a low-dimensional Euclidean embedding
of the complex objects.
6.3 Case Study: FastMapSVM for Classifying Seismograms
In this section, we illustrate the advantages of FastMapSVM in the context of classifying seismograms.
This is a particularly illustrative domain because seismograms are complex objects with subtle features
indicating diverse energy sources such as earthquakes, ocean-Earth interactions, atmospheric phenomena,
and human-related activities. There are two fundamental, perennial questions in seismology: (a) Does a
given seismogram record an earthquake? and (b) Which type of wave motion, that is, compressional (Pwave) versus shear strain (S-wave), is predominant in an earthquake seismogram? In Earthquake Science,
answering these questions is referred to as “detecting earthquakes” and “identifying phases”, respectively.
The development of efficient, reliable, and automated solution procedures that can be easily adapted to new
environments is essential for modern research and engineering applications in this field, such as in building
Earthquake Early Warning Systems. Moreover, an ML framework that imposes only modest demands on
the training data aids the analysis of signal classes, such as “icequakes”, stick-slip events at the base of landslides, and nuisance signals recorded during temporary seismometer deployments, for which large training
datasets are unavailable.
92
Figure 6.2: Illustrates some quantities in Equation 6.1. The longer waveform in orange is aligned with a
shorter waveform in blue to examine the normalized cross-correlation with respect to τ = 0. The quantity ℓ
measures about half the length of the shorter waveform.
Towards this end, we have shown that FastMapSVM is indeed a viable ML framework [126]. Through
experiments, we have also demonstrated that it (a) outperforms state-of-the-art NNs for classifying seismograms when training data or time is limited, (b) can be rapidly deployed for different real-world classification
tasks, and (c) is robust against noisy perturbations to the input signals. However, for the purposes of this
chapter, we avoid a deep dive into the details of the problem domain: Instead, we focus only on a few
exemplary results that illustrate the benefits of FastMapSVM. As mentioned before, in-depth details of the
application of FastMapSVM in the Earthquake Science domain can be found in [126].
6.3.1 Distance Function on Seismograms
The FastMap component of FastMapSVM requires the distance function to be symmetric, yield non-negative
values for all pairs of seismograms, and yield 0 for identical seismograms. In the subsection, we describe
one such appropriate distance function on seismograms that can be used in the Earthquake Science domain.
93
In Earthquake Science, the normalized cross-correlation operator, denoted here by ⋆, is popularly used
to measure similarity between two waveforms. For two 0-mean, single-component seismograms Oi and Oj
with lengths ni and nj
, respectively, and starting with index 0, the normalized cross-correlation is defined
with respect to a lag τ as follows:
(Oi ⋆Oj)[τ] ≜
1
σiσj
ni−1
∑
m=0
Oi
[m]Obj
[m+ℓ−τ], (6.1)
in which, without loss of generality, we assume that ni ≥ nj
. σi and σj are the standard deviations of Oi and
Oj
, respectively. Moreover, ℓ and Obj are defined as follows:
ℓ ≜
nj −nj (mod 2)
2
−(ni (mod 2))(1−nj (mod 2)) (6.2)
and
Obj
[m] ≜
Oj
[m] if 0 ≤ m < nj
0 otherwise
. (6.3)
The quantity ℓ in Equation 6.2 is defined as a subtraction. The first term is approximately half of nj
. The
second term is 0 or 1 depending on whether ni and nj are odd or even. Therefore, ℓ measures about half the
length of the shorter waveform and ensures at least 50% overlap between the two waveforms for computing
the normalized cross-correlation at any τ. Figure 6.2 shows a schematic illustration of some quantities in
Equation 6.1.
Equipped with this knowledge, we first define the following distance function that is appropriate for
waveforms with a single component:
D(Oi
,Oj) ≜ 1− max
0≤τ≤ni−1
(Oi ⋆Oj)[τ]
. (6.4)
94
Based on this, we define the following distance function that is appropriate for waveforms with L components:
D(Oi
,Oj) ≜ 1−
1
L
max
0≤τ≤ni−1
L
∑
l=1
(O
l
i ⋆O
l
j
)[τ]
. (6.5)
Here, each component O
l
i
of Oi
, or O
l
j
of Oj
, is a channel representing a 1-dimensional data stream. A
channel is associated with a single standalone sensor or a single sensor in a multi-sensor array. L is typically
equal to 3, since three-component (3C) seismograms are popularly used in earthquake datasets. The same
value of L is also extensively used in matched filters in Earthquake Science [41, 112, 113, 110].
We can also use a variety of other distance functions on seismograms. In fact, three other distance
functions, the Wasserstein distance, a maxflow-based distance, and the Minkowski distance, are studied
in [111]. We can even potentially derive new distance functions from the Jensen-Shannon [77] or the
Kullback-Leibler [122] measures of divergence. Furthermore, we can also encapsulate more domain-specific
knowledge in the distance functions, if required.
6.3.2 Earthquake Datasets
We assess the performance of FastMapSVM using seismograms from two datasets. All seismograms used
in this section record ground velocity at a sampling rate of 100 s−1 and are bandpass filtered between 1 Hz
and 20 Hz before analysis using a 0-phase Butterworth filter with four poles; we refer to this frequency band
as our passband.
6.3.2.1 The Stanford Earthquake Dataset
The Stanford Earthquake Dataset (STEAD) [81] is a benchmark dataset with more than 1.2×106
carefully
curated 3C 60 s seismograms for training and testing algorithms in Earthquake Science. We select a balanced subset of 65,536 seismograms from the STEAD comprising 32,768 earthquake seismograms and
95
32,768 noise seismograms. Earthquake seismograms record ground motions induced by a nearby earthquake; whereas noise seismograms record no known earthquake-related ground motions. We randomly
select earthquake seismograms using a selection probability that is inversely proportional to the kernel density estimate of the 5-dimensional joint distribution over (a) the epicentral distance, (b) the event magnitude,
(c) the event depth, (d) the time interval between the P- and S-wave arrivals, and (e) the signal-to-noise ratio
(SNR). This scheme is designed to yield a broad distribution of seismograms. All earthquake seismograms
are recorded by a seismometer within 100 km of the epicenter, have a hypocentral depth of less than 30 km,
and have P- and S-wave arrival times manually identified by a trained analyst. Distributions in this subset that are skewed towards shallow earthquakes with low magnitude and low SNR reflect the distribution
of natural earthquakes and their recordings. Noise seismograms are randomly selected to maximize the
geographic diversity of recording locations.
From this base dataset of 65,536 seismograms, we draw a simple random sample (SRS) of 16,384
earthquake seismograms and an equal-sized SRS of noise seismograms for model training. The 32,768
remaining seismograms make up the test dataset. The test seismograms are trimmed to 30 s per seismogram,
including an amount of time uniformly distributed between 4 s and 15 s preceding the P-wave arrival for
earthquake seismograms. We note that both the training dataset and the test dataset have balanced numbers
of earthquake seismograms and noise seismograms. We recursively draw SRSs from the training dataset to
create multiple smaller balanced training datasets, each of which is half the size of the set from which it is
drawn. Thus, we have nested balanced training datasets with sample sizes 215
,2
14
...2
6
.
6.3.2.2 Ridgecrest Dataset
The Ridgecrest dataset comprises data recorded by station CI.CLC of the Southern California Seismic
Network on July 5th, 2019, the first day of the aftershock sequence following the 2019 Ridgecrest, CA,
earthquake pair and on December 5th, 2019, five months after the mainshocks. We use the earthquake
96
catalog published by the Southern California Earthquake Data Center to extract 512 3C 8 s seismograms,
256 of which record both P- and S-wave phase arrivals from a nearby—that is, with epicentral distance between 4.5 km and 27.6 km—aftershock and the remaining 256 of which record only noise. All 512 of these
seismograms are recorded on July 5th, 2019. Earthquake magnitudes represented in the Ridgecrest dataset
range between 0.5 and 4.0, earthquake depths range between 900 m above sea level and 9.75 km below sea
level, and SNRs range between −8 dB and 73 dB. The maximum peak ground acceleration recorded in the
Ridgecrest dataset is 0.197 m/s2
.
6.3.3 Experimental Results
We now present experimental results on the STEAD and the Ridgecrest dataset. Whereas the analysis on the
STEAD demonstrates FastMapSVM’s performance on a benchmark, the analysis on the Ridgecrest dataset
provides an example of a more realistic use case of FastMapSVM: After handpicking a small number of
earthquake and noise signals—a task that even a novice analyst can perform in a few hours—continually
arriving seismic data can be automatically scanned for additional earthquake signals. Hence, even when
earthquake signals are difficult to discern by the human eye, FastMapSVM can often reliably detect them.
6.3.3.1 Results on the Stanford Earthquake Dataset
The EQTransformer model [80] is a DL model trained for simultaneously detecting earthquakes and identifying phase arrivals. It is arguably the most accurate publicly available model for this pair of tasks. The
authors of EQTransformer report perfect precision and recall scores for detecting earthquakes in 10% of the
STEAD waveforms after training its roughly 3.72×105 model parameters with 85% of the STEAD waveforms; 5% of the STEAD waveforms were reserved for model validation.1 The CRED model [82] is another
DL model trained for detecting earthquakes, which scored perfect precision and 0.96 recall using the same
1We note that the authors of EQTransformer used a version of the STEAD with 1 × 106
and 3 × 105
earthquake and noise
waveforms, respectively, which differs slightly from the newer version of the STEAD that we use.
97
Figure 6.3: Shows the performance statistics of FastMapSVM, EQTransformer, and CRED on the STEAD
with varying training data size. Error bars represent the standard deviation of the measurements. (The recall
is identical to the accuracy in our case.)
training and test data as EQTransformer [80]. However, the CRED model does not identify phase arrivals.
We choose these two DL models for comparison because EQTransformer is popularly used [54, 55] and
represents the state of the art in general practice, CRED is designed specifically for detecting earthquakes,
and the pretrained models are readily available through the SeisBench package [128].
To compare the performance of FastMapSVM against EQTransformer and CRED, we train multiple
instances of each model using varying amounts of training data and test them on the same test data: the
32,768 test seismograms selected from the STEAD, as described above. All models are trained and tested
using an NVIDIA RTX A6000 GPU. We train and test each model multiple times for each training data
size to estimate the statistics for the following performance measures: F1 score, accuracy, precision, recall,
training time, and testing time. For FastMapSVM, we set the number of trials to 20 for each training data
size; but, for EQTransformer and CRED, we limit the number of trials to 10 for each training data size
98
because training them is prohibitively time-consuming for large training data sizes. The training data for
FastMapSVM are trimmed to 30 s per seismogram, including 4 s of data preceding the P-wave arrival for
earthquake seismograms. The FastMapSVM model uses a 4-dimensional Euclidean embedding. We set
the probability threshold for the decision boundary to 0.5. The performance scores are averaged over both
labels. Figure 6.3 shows the statistics of the various performance measures.
FastMapSVM consistently outperforms EQTransformer and CRED using less training time for all
training data sizes. FastMapSVM training times are 1-3 orders of magnitude smaller than those for EQTransformer and CRED. The respective performances of EQTransformer and CRED approach that of
FastMapSVM as the training data size increases; however, they do so at the cost of rapidly increasing
training times. The testing times of EQTransformer and CRED are 1.33-2.08 and 3.17-4.82 times the testing time of FastMapSVM, respectively. FastMapSVM also exhibits more stable performance, that is, less
variance between trials, than the other two models because the final performance of these models is sensitive
to the initial random values of their model parameters.
The performance of FastMapSVM can be further improved by increasing the dimensionality of the
Euclidean embedding, as demonstrated in [126].
6.3.3.2 Results on the Ridgecrest Dataset
We demonstrate a use case-inspired application of FastMapSVM using a 32-dimensional model trained on
256 earthquake seismograms and 256 noise seismograms from the Ridgecrest dataset. We apply the model
to 8 s windows extracted from a 24 h continuous 3C seismogram with 25% overlap and register detections for
windows with detection probability > 0.95. The test seismogram was recorded by station CI.CLC between
00:00:00 and 23:59:59 (UTC) on December 5th, 2019. We also apply the pretrained EQTransformer [80]
and CRED [82] models on the same data.
99
Figure 6.4: Shows a comparison of the automatic scanning results produced by EQTransformer, CRED, and
FastMapSVM on the Ridgecrest dataset. The panels show the joint distribution of the maximum SNR and
the maximum normalized cross-correlation coefficient for detections registered by EQTransformer, CRED,
and FastMapSVM in an automatic scan of 24 h of data.
For each detection, we compute two quantities: (1) the maximum SNR and (2) the maximum normalized
cross-correlation coefficient measured against the 256 earthquake seismograms used to train FastMapSVM.
We define the SNR as follows:
SNR = 10log10
Psignal
Pnoise
. (6.6)
Here, Psignal and Pnoise represent the average power of the signal and noise, respectively, which are measured
in 1 s and 10 s sliding windows, respectively.
Figure 6.4 shows a comparison of the automatic scanning results produced by EQTransformer, CRED,
and FastMapSVM. CRED registers the largest number of detections (1,831), EQTransformer registers the
fewest (805), and FastMapSVM registers an intermediate number (1,453). Although CRED registers the
largest number of detections, many of them correspond to signals with very low SNR (< 2.5) and low
100
normalized cross-correlation coefficients (< 0.2). This implies that many of its detections are likely false
positives. Indeed, visual inspection of an SRS of these detections confirms this. FastMapSVM also registers
a significant number of detections with relatively low SNR (< 5); however, these detections are generally
associated with higher normalized cross-correlation coefficients (> 0.2) and marginally higher SNR (> 2.5).
The majority of detections registered by FastMapSVM are associated with low to moderate SNR between 2.5 and 10 and normalized cross-correlation coefficients between 0.2 and 0.4. This is an expected
consequence of the Gutenberg-Richter statistics that describe earthquake magnitude-frequency distributions.
However, perhaps surprisingly, FastMapSVM also registers a greater number of detections with high SNR
(> 10) than both EQTransformer and CRED. Visual inspection confirms that FastMapSVM seldom misses a
high-SNR event detected by EQTransformer or CRED, whereas EQTransformer and CRED do occasionally
miss high-SNR events detected by FastMapSVM.
Results presented in Figure 6.4 suggest that (1) EQTransformer has the lowest detection and false detection rates, (2) CRED has the highest detection and false detection rates, (3) FastMapSVM has relatively
high detection and low false detection rates, and (4) FastMapSVM detects high-SNR events with greater
fidelity than EQTransformer and CRED.
6.4 Conclusions
In this chapter, we introduced FastMapSVM as an interpretable ML framework that combines the complementary strengths of FastMap and SVMs. We posited that it is an advantageous, lightweight alternative
to existing methods, such as NNs, for classifying complex objects when training data or time is limited.
FastMapSVM offers several advantages. First, it enables domain experts to incorporate their domain knowledge using a distance function. This avoids relying on complex ML models to infer the underlying structure
in the data entirely. Second, because the distance function encapsulates domain knowledge, FastMapSVM
naturally facilitates interpretability and explainability. In fact, it even provides a perspicuous visualization of
101
the objects and the classification boundaries between them. Third, FastMapSVM uses significantly smaller
amounts of data and time for model training compared to other ML algorithms. Fourth, it extends the
applicability of SVMs and kernel methods to domains with complex objects.
We demonstrated the efficiency and effectiveness of FastMapSVM in the context of classifying seismograms. On the STEAD, we showed that FastMapSVM performs comparably to state-of-the-art NN models
in terms of the precision, recall/accuracy, and the F1 score. It also uses significantly smaller amounts of
data and time for model training compared to other methods. Moreover, it can also be faster at testing time.
On the Ridgecrest dataset, we demonstrated its ability to reliably detect new “microearthquakes” that are
otherwise difficult to detect even by the human eye.
102
Chapter 7
FastMapSVM in the Constraint Satisfaction Problem Domain
Recognizing the satisfiability of CSPs is NP-hard. Although several ML approaches have attempted this
task by casting it as a binary classification problem, they have had only limited success for a variety of
challenging reasons. First, the NP-hardness of the task does not make it amenable to straightforward approaches. Second, CSPs come in various forms and sizes while many ML algorithms impose the same form
and size on their training and test instances. Third, the representation of a CSP instance is not unique since
the variables and their domain values are unordered. In this chapter, we demonstrate the success of the
FastMapSVM framework—proposed in the previous chapter—on the task of predicting the satisfiability of
CSP instances. Since FastMapSVM leverages a distance function on pairs of objects in the problem domain,
we define a novel distance function between two CSP instances using maxflow computations. This distance
function is well defined for CSPs of different sizes. It is also invariant to the ordering on the variables and
their domain values. Therefore, our framework has broader applicability compared to other approaches. We
discuss various representational and combinatorial advantages of FastMapSVM. Through experiments, we
also show that it outperforms other state-of-the-art ML approaches.
103
7.1 Introduction
Constraints constitute a very natural and general means for formulating regularities in the real world. A
fundamental combinatorial structure used for reasoning with constraints is that of the CSP. The CSP formally
models a set of variables, their corresponding domains, and a collection of constraints between subsets of the
variables. Each constraint restricts the set of allowed combinations of values of the participating variables.
A solution of a given CSP instance is an assignment of values to all the variables from their respective
domains such that all the constraints are satisfied. Technologies for efficiently solving CSPs bear immediate
and important implications on how fast we can solve computational problems that arise in several other areas
of research, including Computer Vision, spatial and temporal reasoning, model-based diagnosis, planning
and scheduling, and language understanding.
Unfortunately, solving CSPs is NP-hard since they generalize the Satisfiability (SAT) problem. Although
many technologies have been developed for solving CSPs in practice [25], they do not sufficiently harness
the power of ML techniques. While there have been a lot of attempts to apply ML techniques to CSPs, none
of these attempts have yielded spectacular results: They do not consistently produce high-quality outcomes.
Some ML approaches used in the CSP domain include the application of SVMs [4], linear regression [131],
decision tree learning [45, 39], clustering [97, 57], k-nearest neighbors [88], and others [61]. However, most
of these approaches have had limited success for a variety of challenging reasons.
First, from a complexity theory perspective, the NP-hardness of the task does not make it amenable to
straightforward ML approaches. For example, since an NN is essentially a continuous differentiable form of
a circuit, it is not straightforward to make NNs effective in the CSP domain. Second, CSPs come in various
forms and sizes while many ML algorithms use an architectural framework that imposes the same form and
size on their training and test instances. For example, an NN may have a fixed input layer that it uses for
the training and test instances alike. Third, the representation of a CSP instance is not unique since the
variables and their domain values are unordered. This poses a significant combinatorial challenge for ML
104
algorithms since they have to learn the permutation invariance with respect to orderings on the variables and
their domain values. For example, an NN may pose the overhead of having to be trained on all permutations
of the same CSP instance to become effective.
In this chapter, we consider the problem of predicting the satisfiability of CSP instances using ML. In
ML terminology, this is essentially a binary classification problem defined on CSPs with the two possible
classification labels ‘satisfiable’ and ‘unsatisfiable’. This classification problem is a cornerstone task for
addressing the combinatorics of CSPs. It also serves as a stepping stone for the task of solving CSPs. In
fact, any ML framework expected to be viable for solving CSPs should likely first demonstrate its success
on solving the aforementioned classification problem on CSPs.
We propose to solve the above classification problem on binary CSPs1 using FastMapSVM. In applying
FastMapSVM to the CSP domain, we define a novel distance function between two CSP instances. This
distance function uses maxflow computations and is well defined for CSP instances of different sizes. It
is also invariant to the ordering on the variables and their domain values in the CSP instances. Therefore,
FastMapSVM has broader applicability compared to other ML approaches in the CSP domain. Moreover,
since it encapsulates the intelligence of FastMap, SVMs, kernel methods, and maxflow computations, it
is able to significantly outperform competing ML approaches. It is also able to outperform procedures that
invest polynomial time in establishing local consistency—such as arc-consistency—to discover unsatisfiable
CSP instances. This demonstrates that a trained FastMapSVM model acquires an intelligence beyond that
of polynomial-time procedures.2 We discuss various other representational and combinatorial advantages of
FastMapSVM and, through experiments, we also demonstrate its superior performance.
1Binary CSPs have at most two variables per constraint but are allowed to have non-Boolean variables. Binary CSPs are
representationally as powerful as general CSPs.
2This is an important hallmark of an ML algorithm. In [129], a deep NN model is presented to recognize the satisfiability of
CSP instances with Boolean variables and binary constraints. However, this class of CSP instances is equivalent to 2-SAT and can
be solved in polynomial time, diminishing the advantages of an ML framework over polynomial-time reasoning.
105
Figure 7.1: Shows the (0,1)-matrix representation of a constraint and a CSP instance. The left panel shows
the (0,1)-matrix representation of a single constraint on X1 and X2. The right panel shows the (0,1)-matrix
representation of an entire binary CSP instance.
7.2 Preliminaries and Definitions
A CSP instance is defined by a triplet ⟨X ,D,C⟩, where X = {X1,X2 ...XN} is a set of variables and C =
{C1,C2 ...CM} is a set of constraints on subsets of them. Each variable Xi
is associated with a discretevalued domain Di ∈ D, and each constraint Ci
is a pair ⟨Si
,Ri⟩ defined on a subset of variables Si ⊆ X ,
called the scope of Ci
. Ri ⊆ DSi
(DSi = ×Xj∈SiDj) denotes all compatible tuples of DSi
allowed by the
constraint. The absence of a constraint on a certain subset of the variables is equivalent to a constraint on
the same subset of the variables that allows all combinations of values to them. A solution of a CSP instance
is an assignment of values to all the variables from their respective domains such that all the constraints are
satisfied. A binary CSP instance has at most two variables per constraint. Binary CSPs are representationally
as powerful as general CSPs [24].
A binary CSP is arc-consistent if and only if for all variables Xi and Xj
, and for every instantiation of
Xi
, there exists an instantiation of Xj such that the direct constraint between them is satisfied. Similarly, a
binary CSP is path-consistent if and only if for all variables Xi
, Xj and Xk, and for every instantiation of Xi
and Xj
that satisfies the direct constraint between them, there exists an instantiation of Xk such that the direct
constraints between Xi and Xk and between Xj and Xk are also satisfied.
106
For a given binary CSP instance, we can build a matrix representation for it using a simple mechanism.
First, we assume that the domain values of each variable are ordered in some way. (We can simply use the
order in which the domain values of each variable are specified.) Under such an ordering, we can represent
each binary constraint as a 2-dimensional matrix with all its entries set to either 1 or 0 based on whether the
corresponding combination of values to the participating variables is allowed or not by that constraint. The
left panel of Figure 7.1 shows the (0,1)-matrix representation of a binary constraint between two variables
X1 and X2 with domain sizes of 5 each. The combination of values (X1 ← d12,X2 ← d21) is an allowed
combination, and the corresponding entry in the matrix is therefore set to 1. However, the combination of
values (X1 ← d14,X2 ← d22) is a disallowed combination, and the corresponding entry is therefore set to 0.
In general, dip denotes the p-th domain value of Xi assuming an index ordering on the domain values of Xi
.
The (0,1)-matrix representation of an entire binary CSP instance can be constructed simply by stacking
up the (0,1)-matrix representations of the individual constraints into a bigger “block” matrix. The right panel
of Figure 7.1 illustrates how a binary CSP instance on three variables X1, X2 and X3 can be represented as a
“mega-matrix” with three sets of rows and three sets of columns. Each block-entry inside this mega-matrix is
the (0,1)-matrix representation of the direct constraint between the corresponding row and column variables.
In essence, therefore, the (0,1)-matrix representation of an entire binary CSP instance has ∑
N
i=1
|Di
| rows
and ∑
N
i=1
|Di
| columns.
7.3 Distance Function on Constraint Satisfaction Problems
In this section, we describe a novel distance function on binary CSPs. As explained in Chapter 6, the
FastMap component of FastMapSVM requires the distance function to be symmetric, yield non-negative
values for all pairs of CSP instances, and yield 0 for identical CSP instances. Satisfying these requirements,
our distance function is based on maxflow computations and is illustrated in Figure 7.2. It is well defined
for CSP instances I1 and I2 that may have different sizes. The maxflow computations are utilized in: (a) a
107
1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 0
0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1
0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1
1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 1
0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0
0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 0
1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1
1 0 1 1 0 1 0 1 0 0 0 0 1 1 1 1
1 1 0 1 1 0 0 0 1 0 0 0 0 1 0 1
1 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1
0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 0
1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1
0 0 0 1 0 1 0 1 0 1 1 0 1 0 0 0
0 1 0 0 0 1 0 1 1 0 1 0 0 1 0 0
1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0
0 1 1 1 0 0 1 1 1 1 0 1 0 0 0 1
�! �" �# �$
�!
�"
�#
�$
1 0 1 1 0 0 1 0 1 1
0 1 0 1 0 1 0 0 0 1
1 0 1 0 0 0 0 0 1 0
1 1 0 1 0 0 0 0 1 1
0 0 0 0 1 0 0 0 0 1
0 1 0 0 0 1 0 0 1 1
1 0 0 0 0 0 1 0 0 1
0 0 0 0 0 0 0 1 1 0
1 0 1 1 0 1 0 1 1 0
1 1 0 1 1 1 1 0 0 1
�!
% �"
% �#
%
�!
%
�"
%
�#
%
1 0 1 1 0 0 1 0 1 1 1
0 1 0 1 0 1 0 0 0 1 1
1 0 1 0 0 0 0 0 1 0 1
1 1 0 1 0 0 0 0 1 1 1
0 0 0 0 1 0 0 0 0 1 1
0 1 0 0 0 1 0 0 1 1 1
1 0 0 0 0 0 1 0 0 1 1
0 0 0 0 0 0 0 1 1 0 1
1 0 1 1 0 1 0 1 1 0 1
1 1 0 1 1 1 1 0 0 1 1
1 1 1 1 1 1 1 1 1 1 1
�!
% �"
% �#
%
�!
%
�"
%
�#
%
�$
%
�$
%
�!!
�!"
�!#
�"!
�""
�"#
�"$
�#!
�#"
�##
�#$
�#%
�$!
�$"
�$#
�$$
�!! �!" �!# �"! �"" �"# �"$ �#! �#" �## �#$ �#% �$! �$" �$# �$$
�!!
&
�!"
&
�"!
&
�""
&
�"#
&
�"$
&
�"%
&
�"'
&
�#!
&
�#"
&
�#"
& �#!
& �"'
& �"%
& �"$
& �"#
& �""
& �"!
& �!"
& �!!
&
�!!
&
�!"
&
�"!
&
�""
&
�"#
&
�"$
&
�"%
&
�"'
&
�#!
&
�#"
&
�$!
&
�$!
& �#"
& �#!
& �"'
& �"%
& �"$
& �"#
& �""
& �"!
& �!"
& �!!
&
Figure 7.2: Illustrates the distance function between two binary CSP instances. The top panel shows two
CSP instances with variables {X1,X2,X3,X4} (left) and {X
′
1
,X
′
2
,X
′
3
} (middle), respectively. A dummy variable X
′
4 with a singleton domain is added to the CSP instance with fewer variables (right). The bottom panel
(left) shows how a ‘maximum matching of minimum cost’ problem is posed on a complete bipartite graph
with the variables of the two CSP instances in each partition. The cost annotating the edge between Xi and
X
′
j
is itself derived from a ‘maximum matching of minimum cost’ problem posed on the domain values of
Xi and X
′
j
. The bottom panel (right) shows this ‘maximum matching of minimum cost’ problem for the
variables X1 and X
′
2
. It is posed on a complete bipartite graph with the domain values {d11,d12,d13} and
{d
′
21,d
′
22,d
′
23,d
′
24,d
′
25,d
′
26} constituting the two partitions. The cost annotating the edge between d11 and d
′
21
is the absolute value of the difference between the average compatibility of d11 and that of d
′
21.
single high-level ‘maximum matching of minimum cost’ problem posed on the variables of I1 and I2; and
(b) multiple low-level ‘maximum matching of minimum cost’ problems posed on the domain values of pairs
of variables, one from each of I1 and I2.
The high-level ‘maximum matching of minimum cost’ problem is posed on a complete bipartite graph,
in which the two partitions of the bipartite graph correspond to the variables of I1 and I2, respectively. If the
number of variables in I1 does not match the number of variables in I2, dummy variables are added to the
108
CSP instance with fewer variables. The top panel of Figure 7.2 illustrates this for I1 and I2 with variables
{X1,X2,X3,X4} and {X
′
1
,X
′
2
,X
′
3
}, respectively. A dummy variable X
′
4
is added to I2. The dummy variable
has a single domain value that is designed to be consistent with all domain values of all other variables, since
this does not change the CSP instance.
The distance between I1 and I2 is defined to be the cost of the ‘maximum matching of minimum cost’
on the high-level bipartite graph. This bipartite graph has an edge between every Xi
in I1 and every X
′
j
in
I2. The cost annotating an edge between Xi and X
′
j
is itself set to be the cost of the ‘maximum matching
of minimum cost’ posed at the low level on the domain values of Xi and X
′
j
. The bottom-left panel of
Figure 7.2 shows the high-level bipartite graph and highlights an edge between X1 and X
′
2
for explanation of
the low-level ‘maximum matching of minimum cost’.
The low-level ‘maximum matching of minimum cost’ problem posed on the domain values of Xi and
X
′
j
also uses a complete bipartite graph. The two partitions consist of the domain values of Xi and X
′
j
,
respectively. The cost annotating the edge between dip and d
′
jq is the absolute value of the difference between
the average compatibility of dip and the average compatibility of d
′
jq. The bottom-right panel of Figure 7.2
shows the low-level ‘maximum matching of minimum cost’ problem posed on the domain values of X1 and
X
′
2
. The domains of X1 and X
′
2
are {d11,d12,d13} and {d
′
21,d
′
22,d
′
23,d
′
24,d
′
25,d
′
26}, respectively. Consider the
edge between d11 and d
′
21. The average compatibility of d11 is the fraction of ‘1’s in the column ‘d11’ in the
(0,1)-matrix representation of I1. This fraction is equal to 8/16. The average compatibility of d
′
21 is the
fraction of ‘1’s in the column ‘d
′
21’ in the (0,1)-matrix representation of I2 after adding the dummy variable
X
′
4
. This fraction is equal to 4/11. Therefore, the cost annotating the edge between d11 and d
′
21 is equal to
|8/16−4/11|.
The ‘maximum matching of minimum cost’ problems in the high level and the low level are posed on
bipartite graphs. Since the two partitions of any bipartite graph can be viewed interchangeably without
affecting the ‘maximum matching of minimum cost’, the overall distance function is symmetric. Moreover,
109
since the cost annotating any edge in the high level or the low level is non-negative, the distance function
always yields non-negative values. Similarly, it is also easy to observe that the distance function yields 0
for two identical CSP instances. These properties satisfy all the conditions on the distance function required
by the FastMap component of FastMapSVM. In addition, the high-level bipartite graph is invariant to the
orderings on the elements within each partition. That is, it is invariant to the orderings on the variables of I1
and I2. For a similar reason, all low-level bipartite graphs are also invariant to the orderings on the domain
values of the participating variables. Therefore, the overall distance function has the additional property of
being invariant to variable-orderings as well as domain value-orderings.
The above properties of the distance function allow us to bypass data augmentation methods typically
required for training other ML models. Data augmentation refers to transforming data without changing
their labels, known as label-preserving transformations. For example, to generate more training data serving
object recognition tasks in Computer Vision applications, an image can be augmented by translating it or
reflecting it horizontally without changing its label [62]. In the context of CSPs, a CSP instance is typically
augmented by changing the ordering on its variables or the ordering on the domain values of individual
variables. However, doing so generates an exponential number of CSP training instances within the same
equivalence class. This drawback of traditional ML algorithms of having to learn equivalence classes is
now intelligently addressed within the framework of FastMapSVM by utilizing a distance function that is
invariant to both variable-orderings and domain value-orderings.
We note that the above distance function could have been defined in many other ways. For example,
we could have introduced dummy domain values in the low-level ‘maximum matching of minimum cost’
problems to equalize the domain sizes of the participating variables. We could have also chosen not to use
dummy variables in the high-level ‘maximum matching of minimum cost’ problem. In addition, we could
have defined the costs annotating the edges of the bipartite graphs using many other characteristics of the
CSP instances. These variations of the distance function are not of fundamental importance to this chapter.
110
1 0 0 1 1 1 1 1 1 1
0 1 0 1 1 0 1 1 0 0
0 0 1 0 1 1 0 0 1 1
1 1 0 1 0 0 1 1 0 1
1 1 1 0 1 0 1 1 1 1
1 0 1 0 0 1 1 1 1 1
1 1 0 1 1 1 1 0 1 1
1 1 0 1 1 1 0 1 1 0
1 0 1 0 1 1 1 1 1 0
1 0 1 1 1 1 1 0 0 1
�!
�"
�#
�$
�!!
�!"
�!#
�"!
�""
�"#
�#!
�#"
�$!
�$"
�!! �!" �!# �"! �"" �"# �#! �#" �$! �$"
�! �" �# �$
(�!, �!!)
(�!, �!")
(�!, �!#) (�", �"!)
(�", �"")
(�", �"#)
(�#, �#!)
(�#, �#")
(�$, �$!)
(�$, �$")
Figure 7.3: Shows the graphical representation of a binary CSP instance obtained from its (0,1)-matrix
representation. The left panel shows the (0,1)-matrix representation of a binary CSP instance. The right
panel shows its graphical representation. The vertices represent domain values and are clustered into four
groups, corresponding to the four variables {X1,X2,X3,X4}.
Instead, in this chapter, we focus on the advantages of the FastMapSVM framework as a whole. The study
of more refined distance functions is delegated to future work.
7.4 Experimental Results
In this section, we describe the comparative performance of FastMapSVM against other state-of-the-art ML
approaches for predicting CSP satisfiability.
7.4.1 Experimental Setup
We evaluate FastMap against three competing approaches. The first is a state-of-the-art deep graph convolutional neural network (DGCNN) [134]. The second is a state-of-the-art graph isomorphism network
(GIN) [130]. Both these networks ingest a CSP instance in the form of a graph, as shown in Figure 7.3. In
the graphical representation of a binary CSP instance, a vertex represents a domain value and is tagged with
the name of the variable that it belongs to. Information in these tags is utilized by DGCNN and GIN. An
edge between two vertices v1 and v2 with tags Xi and Xj
, respectively, represents the compatible combination
111
(Xi ← v1,Xj ← v2) allowed by the direct constraint between Xi and Xj
.
3 DGCNN and GIN do not require
the CSP training and test instances to be of the same size.
The third is a polynomial-time algorithm based on establishing arc-consistency. This algorithm first establishes arc-consistency and then checks whether any variable’s domain is annihilated. If so, it declares the
CSP instance to be ‘unsatisfiable’. Otherwise, it declares the CSP instance to be ‘satisfiable’. This algorithm
is used in our evaluation to demonstrate that FastMapSVM’s capabilities go beyond that of a polynomialtime algorithm.4 Of course, a polynomial-time algorithm based on establishing path-consistency also could
have been used. But we excluded this algorithm since arc-consistency already provides the required proof
of concept and establishing path-consistency is prohibitively expensive.
We implemented FastMapSVM and arc-consistency in Python3 and ran them on a laptop with an Apple
M2 chip and 16 GB memory. We ran DGCNN and GIN on a Linux system with an Intel(R) Xeon(R)
Silver 4216 CPU at 2.10 GHz. The different platforms are inconsequential to the comparative performances
of these algorithms with respect to effectiveness. For each dataset, we trained DGCNN and GIN for 100
epochs with a learning rate of 0.0001 and a minibatch size of 100 to obtain representative results.
7.4.2 Instance Generation
We generate the binary CSP instances for both training and testing using the Model A method in [114, 129].
We generate a CSP instance by first picking the number of variables N uniformly at random to be an integer
within the range [1,100]. Then, we pick the domain size of each variable independently and uniformly
at random to be an integer within the range [1,10]. We use a probability parameter P1 to independently
determine the existence of a direct constraint between every pair of distinct variables. That is, for every pair
of distinct variables Xi and Xj
, we introduce a direct constraint between them with probability P1. Moreover,
3The graphical representation of a binary CSP is obtained by parsing its (0,1)-matrix representation. Thus, we correctly represent the compatible tuples of domain values between every pair of variables, even if there does not exist a direct constraint between
those variables.
4This is done to avoid the pitfalls of [129], as mentioned before.
112
we use a probability parameter P2 to determine the compatible tuples of a direct constraint. For a pair of
variables Xi and Xj with a direct constraint between them, each tuple (Xi ← dip,Xj ← djq) is independently
deemed to be compatible with probability 1−P2. We set P1 = 1 and P2 = 0.4 to obtain representative results
for all approaches.
Model A has a tendency to produce mostly unsatisfiable CSP instances with increasing N [114, 129].
Therefore, we use a “hidden solution” method to generate satisfiable CSP instances whenever required.
In this method, a set of hidden solutions of the CSP instance are chosen a priori.5 A hidden solution
(X1 ← d1p1
,X2 ← d2p2
...XN ← dN pN
) is utilized as follows: While generating the direct constraints using
Model A, a direct constraint between variables Xi and Xj reserves the tuple (Xi ← dipi
,Xj ← djpj
) as being
compatible before the other tuples are set using the probability parameter P2. Therefore, (X1 ← d1p1
,X2 ←
d2p2
...XN ← dN pN
) satisfies all the direct constraints and, consequently, qualifies as a solution. Similarly,
multiple hidden solutions can be utilized with the following modification in the generation procedure: A
direct constraint between variables Xi and Xj reserves multiple tuples as being compatible. For generating
satisfiable CSP instances, we pick the number of hidden solutions uniformly at random to be an integer
within the range [1,10]. We pick a hidden solution itself by assigning a domain value chosen independently
and uniformly at random for each variable from its domain.
We generate three datasets: Dataset-1, Dataset-2, and Dataset-3. For each dataset, we generate 1,000
training instances and 1,000 test instances. Each training and test set has an equal number of satisfiable and
unsatisfiable instances.
In Dataset-1, we generate the instances using Model A. Since Model A frequently generates unsatisfiable
instances, we use a complete CSP solver to identify and collect such instances. We generate the satisfiable
instances using the hidden solution method, as described above.
5The CSP instance can have other solutions as well.
113
In Dataset-2, we generate the satisfiable instances as in Dataset-1. However, we design and generate the
unsatisfiable instances to be more challenging. We do this by hiding two complementary pseudo-solutions
(X1 ← d1p1
,X2 ← d2p2
...XN ← dN pN
) and (X1 ← d1q1
,X2 ← d2q2
...XN ← dNqN
). We identify a pair of
distinct variables Xi and Xj such that dipi
̸= diqi
and djpj
̸= djqj
. All direct constraints between distinct
variables Xs and Xt such that {Xs
,Xt} ̸= {Xi
,Xj} are generated as before by reserving the tuples (Xs ←
dsps
,Xt ← dt pt
) and (Xs ← dsqs
,Xt ← dtqt
) as being compatible. However, the direct constraint between
Xi and Xj reserves the tuples (Xi ← dipi
,Xj ← djqj
) and (Xi ← diqi
,Xj ← djpj
) as being compatible and
reserves the tuples (Xi ← dipi
,Xj ← djpj
) and (Xi ← diqi
,Xj ← djqj
) as being not compatible. We finally use
a complete CSP solver to verify that such a CSP instance is indeed unsatisfiable.6
In Dataset-3, we generate the satisfiable instances as in Dataset-1. However, we design and generate
the unsatisfiable instances differently from in Dataset-2. We do this by first hiding two complementary
pseudo-solutions (X1 ← d1p1
,X2 ← d2p2
...XN ← dN pN
) and (X1 ← d1q1
,X2 ← d2q2
...XN ← dNqN
), as in
Dataset-2. However, we gather all variables Xr1
,Xr2
...XrM¯
for which the two pseudo-solutions have different
assignments of domain values, that is, drm prm
̸= drmqrm
, for all 1 ≤ m ≤ M¯ . For any two distinct variables
Xi and Xj
in {Xr1
,Xr2
...XrM¯ }, we reserve the tuples (Xi ← dipi
,Xj ← djqj
) and (Xi ← diqi
,Xj ← djpj
) as
being not compatible. Finally, we pick two distinct variables Xs and Xt from {Xr1
,Xr2
...XrM¯ } and reserve
the tuples (Xs ← dsps
,Xt ← dtqt
) and (Xs ← dsqs
,Xt ← dt pt
) as being compatible and reserve the tuples
(Xs ← dsps
,Xt ← dt pt
) and (Xs ← dsqs
,Xt ← dtqt
) as being not compatible. As before, we use a complete
CSP solver to verify that such a CSP instance is indeed unsatisfiable.
7.4.3 Results
We show three sets of results pertaining to FastMapSVM. First, we show the 2-dimensional and the 3-
dimensional embeddings that FastMapSVM produces to aid visualization. Second, we show the behavior
6This procedure frequently generates unsatisfiable instances, as required. However, satisfiable instances that are generated
occasionally are filtered out by the CSP solver.
114
of FastMapSVM with respect to the hyperparameter κ, that is, the number of dimensions and with respect
to the size of the training data. Third, we show the comparative performance of FastMapSVM against
DGCNN, GIN, and arc-consistency.
Figure 7.4 shows a perspicuous visualization of the CSP test instances for all three datasets. This visualization capability is unique to FastMapSVM. We note that while the accuracy, recall, precision, and the F1
score of FastMapSVM typically increase with increasing κ, κ = 2 and κ = 3 are the only two values that
support visualization. Still, in most cases, Figure 7.4 shows a clear separation between the satisfiable and
unsatisfiable instances. Moreover, the separation is clearer in the 3-dimensional embeddings compared to
their 2-dimensional counterparts.
Figure 7.5a shows the behavior of FastMapSVM with respect to the number of dimensions κ on Dataset1. Its behavior on the other datasets is similar. The performance metrics, that is, the accuracy, recall, precision, and the F1 score, improve with increasing κ. This is intuitively expected since the distances between
the CSP instances can be embedded with lower distortion in higher dimensions. However, Figure 7.5a also
shows that a point of diminishing returns is attained rather quickly at around κ = 8. This shows that κ = 8,
9, or 10 is good enough for the CSP domain. Finally, Figure 7.5a also shows that the improvements in the
performance metrics are significant between κ = 2 and κ = 8.
Figure 7.5b shows the behavior of FastMapSVM with respect to the size of the training data on Dataset1. Its behavior on the other datasets is similar. The performance metrics improve with increasing size of
the training data. Figure 7.5b also shows that the improvements in the performance metrics are significant
between 128 and 256 training data instances. Further improvements are gradual between 256 and 1,000
training data instances. This shows that FastMapSVM has the capability to achieve good performance from
relatively small amounts of training data and training time.
115
(a) Dataset-1 CSP test instances embedded in a 2-
dimensional Euclidean space by FastMapSVM
(b) Dataset-1 CSP test instances embedded in a 3-
dimensional Euclidean space by FastMapSVM
(c) Dataset-2 CSP test instances embedded in a 2-
dimensional Euclidean space by FastMapSVM
(d) Dataset-2 CSP test instances embedded in a 3-
dimensional Euclidean space by FastMapSVM
(e) Dataset-3 CSP test instances embedded in a 2-
dimensional Euclidean space by FastMapSVM
(f) Dataset-3 CSP test instances embedded in a 3-
dimensional Euclidean space by FastMapSVM
Figure 7.4: Shows the 2-dimensional and 3-dimensional Euclidean embeddings produced by FastMapSVM
for classifying CSP instances. Mostly, there is a clear separation of satisfiable instances (blue) and unsatisfiable instances (red).
116
80
85
90
95
100
0 4 8 12 16 20 24 28 32
Percentage (%)
Number of Dimensions
Accuracy
Recall
Precision
F1
(a) influence of the number of dimensions on the performance of FastMapSVM
70
75
80
85
90
95
100
2^7 2^8 2^9 2^10
Percentage (%)
Training Size
Accuracy
Recall
Precision
F1
(b) influence of the size of the training data on the performance of FastMapSVM
Figure 7.5: Shows the behavior of FastMapSVM in the CSP domain with respect to the number of dimensions and with respect to the size of the training data. The performance metrics include the accuracy, recall,
precision, and the F1 score.
117
Dataset Model Accuracy Recall Precision F1
Dataset-1
FastMapSVM 96.7% 97.0% 96.4% 96.7%
AC 99.3% 100.0% 96.7% 98.3%
DGCNN (unlabeled) 94.2% 92.2% 96.0% 94.1%
DGCNN (labeled) 82.3% 82.0% 82.5% 82.2%
GIN (unlabeled) 84.1% 67.0% 99.4% 80.0%
GIN (labeled) 56.4% 52.6% 56.9% 54.7%
Dataset-2
FastMapSVM 82.9% 72.8% 91.2% 81.0%
AC 50.1% 100.0% 50.1% 66.8%
DGCNN (unlabeled) 73.4% 61.4% 80.8% 69.8%
DGCNN (labeled) 53.9% 51.2% 54.1% 52.6%
GIN (unlabeled) 71.9% 49.0% 90.4% 63.6%
GIN (labeled) 54.4% 51.6% 54.7% 53.1%
Dataset-3
FastMapSVM 95.4% 94.4% 96.3% 95.3%
AC 50.0% 100.0% 50.0% 66.7%
DGCNN (unlabeled) 90.3% 86.6% 93.5% 89.9%
DGCNN (labeled) 74.7% 71.8% 76.2% 73.9%
GIN (unlabeled) 78.4% 53.6% 98.5% 69.4%
GIN (labeled) 57.4% 53.8% 58.0% 55.8%
Table 7.1: Shows all performance metrics of all competing methods on all datasets. ‘AC’ represents arcconsistency.
Table 7.1 shows a comparison of all the competing methods on all three datasets with respect to all of
the performance metrics. It uses κ = 8 for FastMapSVM. It also shows two versions of DGCNN and GIN:
the ‘labeled’ version and the ‘unlabeled’ version.
We recollect that in the graphical representation of a binary CSP instance, a vertex represents a domain
value and is tagged with the name of the variable that it belongs to. Information in these tags is available to
be utilized by DGCNN and GIN. The labeled versions of DGCNN and GIN utilize this information while the
unlabeled versions ignore this information. Table 7.1 shows that the unlabeled versions perform better than
their labeled counterparts. While this is a little surprising, it is likely that the unlabeled versions perform
permutation reasoning on the tags (names of variables) much more efficiently.
Table 7.1 shows that FastMapSVM generally outperforms all other competing methods by a significant
margin. Even on a particular dataset where it is not the top performer with respect to a particular performance
metric, it is a close second. Overall, Dataset-1 seems to be the easiest for all methods and Dataset-2 seems
to be the hardest for all methods.
118
In comparison to arc-consistency, FastMapSVM is significantly better on Dataset-2 and Dataset-3. On
these datasets, arc-consistency declares all test instances as being ‘satisfiable’, leading to a perfect recall
score but very poor accuracy, precision, and F1 scores. On the one hand, this shows that arc-consistency is
ineffective in recognizing unsatisfiable CSP instances. On the other hand, it also shows that CSP instances
generated as in Dataset-1 are insufficient to conclusively evaluate competing ML methods. In contrast,
FastMapSVM performs well on all three datasets.
In comparison to DGCNN and GIN, FastMapSVM is significantly better on all three datasets. On the
accuracy, recall, and F1 scores, FastMapSVM is better than DGCNN, which in turn is better than GIN. GIN
generally has high precision scores but very poor recall scores. This shows that it is poor in identifying
satisfiable instances but is mostly correct when it does so. FastMapSVM does not have this drawback.
Moreover, on the accuracy and F1 scores, FastMapSVM outperforms the closest competitor (DGCNN) by
larger margins with increasing hardness of the CSP instances, that is, in the order of Dataset-1, Dataset-3,
and Dataset-2.
Even on the metric of efficiency, FastMapSVM outperforms DGCNN and GIN.7 For example,
FastMapSVM took only 2,965 s for training and testing on Dataset-1 while DGCNN and GIN took 5,440 s
and 10,865 s, respectively, for the same task.
7.5 Conclusions
In this chapter, we demonstrated the success of FastMapSVM on the task of predicting CSP satisfiability.
FastMapSVM overcomes the hurdles faced by other ML approaches in the CSP domain. It leverages a
distance function on CSPs that is defined via maxflow computations. FastMapSVM is applicable to CSP
training and test instances of different sizes and is invariant to both variable-orderings and domain valueorderings. This allows it to bypass the onus of having to learn equivalence classes of CSP instances and,
7DGCNN and GIN ran on a different platform compared to FastMapSVM. However, the ballpark results are still conclusive.
119
therefore, requires significantly smaller amounts of data and time for model training compared to other
ML algorithms. FastMapSVM also encapsulates the intelligence of FastMap, SVMs, kernel methods, and
maxflow computations, accounting for its superior empirical performance, even over state-of-the-art graph
neural networks. Moreover, it facilitates a perspicuous visualization of the CSP instances, their distribution,
and the classification boundaries between them. Overall, the FastMapSVM framework for CSPs has broader
applicability and various representational and combinatorial advantages compared to other ML approaches.
120
Appendix
7.A Table of Notations
Notation Description
⟨X ,D,C⟩ A CSP instance, where X = {X1,X2 ...XN} is a set of variables and C =
{C1,C2 ...CM} is a set of constraints on subsets of them. Each variable Xi
is
associated with a discrete-valued domain Di ∈ D.
dip The p-th domain value of variable Xi assuming an index ordering on the domain
values of Xi
.
I A binary CSP instance.
κ The user-specified number of dimensions of the FastMap embedding.
Table 7.A.1: Describes the notations used in Chapter 7.
121
Chapter 8
Conclusions and Future Work
In this chapter, we present our conclusions on the foregoing chapters of this dissertation and the directions
of our future work that are intended to go beyond this dissertation. The ‘Summary’ section reiterates our
contributions and validates the overall hypothesis of this dissertation: The ability of FastMap to efficiently
embed complex objects or graphs in a Euclidean space harnesses many powerful algorithmic techniques
developed in AI, ML, Computational Geometry, Mathematics, Operations Research, and Theoretical Computer Science towards efficiently and effectively solving important combinatorial problems that arise in
various real-world problem domains. The ‘Future Work’ section lists some major directions in which further relevant research can be conducted. The ‘Concluding Remarks’ section presents some final words on
this dissertation.
8.1 Summary
In this section, we summarize our contributions from each of the previous chapters for the benefit of the
reader. These are as follows:
• In Chapter 1, we revisited the popular Data-Mining version of FastMap and a graph version of it:
also conveniently referred to as FastMap. FastMap’s ability to embed the vertices of a graph as
points in a Euclidean space is complemented by the ability of LSH to map any point of interest
122
in the Euclidean space back on the graph. Hence, we proposed our FastMap+LSH framework as
an important way to harness the intelligence of algorithms that work in Euclidean space towards
solving combinatorial problems on graphs. Such algorithms include clustering algorithms from ML,
algorithms from Computational Geometry, and analytical techniques, among others.
• In Chapter 2, we studied four representative FLPs: the MAM, VKM, WVKM, and the CVKM problems. We used FastMap to reformulate FLPs defined on a graph to FLPs defined in a Euclidean space
without obstacles. Subsequently, we used standard clustering algorithms to solve the problems in the
resulting Euclidean space and LSH to interpret the solutions back on the original graph. We showed
that our FastMap+LSH approach produces high-quality solutions with orders-of-magnitude speedup
over state-of-the-art competing algorithms.
• In Chapter 3, we generalized various measures of centrality on explicit graphs to corresponding measures of projected centrality on implicit graphs. We used our FastMap+LSH framework to compute
the top-K pertinent vertices with the highest projected centrality values. We designed different distance functions to be used by FastMap for different measures of projected centrality and invoked
various procedures for computing analytical solutions in the FastMap embedding. We experimentally demonstrated that our FastMap+LSH framework is both efficient and effective for many popular
measures of centrality and their generalizations to projected centrality. In addition, and unlike other
methods, our FastMap framework is not tied to a specific measure of projected centrality.
• In Chapter 4, we proposed FMBM, a FastMap-based algorithm for block modeling. In the first
phase, FMBM adapts FastMap to embed a given undirected unweighted graph into a Euclidean space
such that the pairwise Euclidean distances between vertices approximate a probabilistically-amplified
graph-based distance function between them. In the second phase, FMBM uses GMM clustering for
identifying clusters (blocks) in the resulting Euclidean space. We showed that FMBM empirically
123
outperforms other state-of-the-art methods like FactorBlock, Graph-Tool, DANMF, and CPLNS on
many benchmark and synthetic test cases. FMBM also enables a perspicuous visualization of the
blocks in the graphs, not provided by other methods.
• In Chapter 5, we presented a FastMap-based algorithm for efficiently computing approximate graph
convex hulls by utilizing FastMap’s ability to facilitate geometric interpretations of graphs. While
the naive version of our algorithm uses a single shot of such a geometric interpretation, the iterative
version of our algorithm repeatedly interleaves the graph and geometric interpretations to reinforce
one with the other. This iterative version was encapsulated in our solver, FMGCH, and experimentally
compared against the state-of-the-art solver, GCA. On a variety of graphs, we showed that FMGCH
not only runs several orders of magnitude faster than a highly-optimized exact algorithm but also
outperforms GCA, both in terms of generality and the quality of the solutions produced. It is also
faster than GCA on large graphs.
• In Chapter 6, we introduced FastMapSVM as an interpretable ML framework that combines the complementary strengths of FastMap and SVMs. We posited that it is an advantageous, lightweight alternative to existing methods, such as NNs, for classifying complex objects when training data or time
is limited. FastMapSVM offers several advantages. First, it enables domain experts to incorporate
their domain knowledge using a distance function. This avoids relying on complex ML models to
infer the underlying structure in the data entirely. Second, because the distance function encapsulates
domain knowledge, FastMapSVM naturally facilitates interpretability and explainability. In fact, it
even provides a perspicuous visualization of the objects and the classification boundaries between
them. Third, FastMapSVM uses significantly smaller amounts of data and time for model training
compared to other ML algorithms. Fourth, it extends the applicability of SVMs and kernel methods
to domains with complex objects. We demonstrated the efficiency and effectiveness of FastMapSVM
in the context of classifying seismograms using significantly smaller amounts of data and time for
124
model training compared to other methods. We also demonstrated its ability to reliably detect new
microearthquakes that are otherwise difficult to detect even by the human eye.
• In Chapter 7, we demonstrated the success of FastMapSVM on the task of predicting CSP satisfiability
by leveraging a distance function on CSPs that is defined via maxflow computations. In this context,
our FastMapSVM framework encapsulates the intelligence of FastMap, SVMs, kernel methods, and
maxflow computations, accounting for its superior empirical performance, even over state-of-the-art
graph neural networks. Moreover, it is applicable to CSP training and test instances of different sizes
and is invariant to both variable-orderings and domain value-orderings. This allows it to bypass the
onus of having to learn equivalence classes of CSP instances and, therefore, requires significantly
smaller amounts of data and time for model training compared to other ML algorithms.
8.2 Future Work
In this section, we describe some directions of future work. First, we describe a few directions that are
relevant to each of the previous chapters. Second, we describe four broader directions that are relevant to
the dissertation as a whole. The chapter-wise future directions are as follows:
• In future work relevant to Chapter 2, we will use our FastMap framework to solve many other kinds
of FLPs. We will also consider FLPs that arise in the real world. Moreover, we will try to use our
techniques for supply chain optimization in manufacturing and distribution management.
• In future work relevant to Chapter 3, we will apply our FastMap framework to various other measures
of projected centrality not discussed in that chapter. Such measures may include the betweenness
centrality, Katz centrality, and the page rank centrality.
125
• In future work relevant to Chapter 4, we will generalize FMBM to work on directed graphs and multiview graphs. We will also apply FMBM and its generalizations to real-world graphs from various
domains, including social and biological networks.
• In future work relevant to Chapter 5, we will enhance FMGCH with other geometric intuitions derived
from the FastMap embedding of the graphs. We will also explore the importance of the graph convex
hull problem in relation to other graph-theoretic problems. This can be done by following the analogy
of the importance of the geometric convex hull problem in Computational Geometry.
• In future work relevant to Chapter 6, we will apply FastMapSVM in Earthquake Science to analyze
and learn from data obtained during temporary deployments of large-N nodal arrays and distributed
acoustic sensing. The efficiency and effectiveness of FastMapSVM also make it suitable for real-time
deployment in dynamic environments in applications such as Earthquake Early Warning Systems. In
Computational Astrophysics, we anticipate the use of FastMapSVM for identifying galaxy clusters
based on cosmological observations.
• In future work relevant to Chapter 7, we will enhance our current distance function on CSPs with
local consistency algorithms. We will also apply FastMapSVM to optimization variants of CSPs and
generally facilitate the integration of constraint reasoning and ML algorithms.
8.2.1 Downsampling Graphs
Downsampling generally refers to the idea of reducing high-resolution spatial and/or temporal data to a
lower resolution that is determined by storage, transmission bandwidth, or other computational restrictions
of the problem domain. As a general concept, it is extensively used in many fields: in the analysis of time
series data for summarization [115], in image processing for size reduction of images or videos [70], and in
acoustic and signal processing for approximation and compression [48], among many others.
126
(a) input graph with 168 vertices and 546 edges (b) FastMap embedding of the input graph
(c) downsampled FastMap embedding (d) downsampled graph with 68 vertices and 185 edges
Figure 8.1: Illustrates the process of downsampling graphs via FastMap. (a) shows the input graph. (b)
shows the 3-dimensional FastMap embedding of the input graph. (c) shows the downsampled points obtained from (b) by using a point cloud downsampling procedure on it with a voxel size of 2. (d) shows a
graph reconstructed from (a) and (c). End to end, (d) represents a downsampled version of the input graph
from (a). We color each vertex of the downsampled graph using a randomly chosen color from the set red,
orange, green, blue, purple, and black. The same colors are traced back from (d) to (c), from (c) to (b),
and from (b) to (a), for each vertex. The colors are used to aid a visual mapping of various regions on the
input graph to their corresponding regions on the downsampled graph. In (c) and (d), the radius of a dot is
proportional to the number of vertices of the input graph that are mapped to it. In (d), the length of an edge
represents the minimum length over all edges of the input graph that are mapped to it.
127
Downsampling can also be defined on graphs. Given an undirected edge-weighted graph G = (V,E,w),
where w(e) is the non-negative weight on edge e ∈ E, the downsampling task is to create an undirected
edge-weighted graph G
d = (V
d
,E
d
,w
d
), where w
d
(e
d
) is the non-negative weight on edge e
d ∈ E
d
, and a
vertex mapping function f
d
: V →V
d
, such that: (a) |V
d
| is within a user-specified range ≤ |V|; (b) the edge
(v
d
s
, v
d
t
) ∈ E
d
summarizes all edges (vi
, vj) ∈ E, where f
d
(vi) = v
d
s
and f
d
(vj) = v
d
t
; and (c) G
d
retains all
the fundamental “characteristics” of G. The characteristics of G that need to be retained in G
d depend on the
problem domain and may include graph-theoretic measures such as the diameter of the graph, the pairwise
distances between its vertices, its spectral properties, and the behavior of certain kinds of algorithms on the
graph, among many others.
Although some work has been done in [40, 84, 87] for downsampling graphs, we propose a FastMapbased framework for doing so as one major direction of our future work. Figure 8.1 illustrates our proposed
framework. The main idea is to first embed the given graph G in a 3-dimensional Euclidean space using
FastMap and then view it as a point cloud for downsampling.
A point cloud is a collection of data points in a 3-dimensional Euclidean space that often represent point
locations on the surface of an object or a landscape obtained by optical scanning or photogrammetry [127].
Downsampling of point clouds while retaining their fundamental shape characteristics is well studied in
Computer Vision [135]. Such procedures require a user-specified parameter referred to as the voxel size:
All points that are within the same voxel are aggregated to a single point in the downsampling procedure.
Such a procedure applied on the FastMap embedding of G would readily yield the downsampled points,
that is, the vertices V
d of the downsampled graph G
d
. However, the edges E
d of G
d
should be constructed
in cognizance of the edges E of G. There are many ways to do this. One way is to iterate over each edge
(vi
, vj) ∈ E and incorporate it into the aggregate edge (f
d
(vi), f
d
(vj)) ∈ E
d
. In doing so, w
d
((v
d
s
, v
d
t
)) can be
defined in many ways. For example, w
d
((v
d
s
, v
d
t
)) can be defined to be the sum, average, or minimum over
all w(vi
, vj) such that f
d
(vi) = v
d
s
and f
d
(vj) = v
d
t
.
128
We will use our FastMap-based downsampling algorithm on graphs for various applications. First, if
w
d
((v
d
s
, v
d
t
)) is defined using the ‘minimum’ operator, G
d
can be used to boost shortest-path computations.
In particular, the all-pairs shortest-path distances on G
d
can be precomputed and used as admissible and
consistent heuristic values to boost A
∗
search at query time. Second, if w
d
((v
d
s
, v
d
t
)) is defined using the
‘average’ operator, G
d
can potentially be used to significantly speed up the computation of the centrality
values of the vertices of G albeit with a little distortion. Third, our ability to downsample graphs can also
overcome some of the most important limitations of ML algorithms on them. Many existing ML and DL
frameworks, including deep NNs, require the training and test graph instances to be of the same size (in
terms of the number of vertices). This serious limitation can be overcome by using our downsampling
method as a normalization procedure for the training and test graph instances. Fourth, our downsampling
method can be used to create an appropriate testbed of smaller graphs on which various algorithms can
be evaluated before their deployment on larger graphs. For instance, different network slicing algorithms
can first be comparatively evaluated on a small substrate network with 50-60 compute nodes; and the best
algorithm can then be chosen for deployment on the real-world substrate network, which may be several
orders of magnitude larger.
8.2.2 FastMap Enhancements
In another major direction of our future work, we will consider various enhancements to the algorithmic
core of FastMap.
One such enhancement has been presented in [75] for the L1 version of FastMap. The L1 version of
FastMap serves as a preprocessing technique to boost A
∗
search for shortest-path computations [21]. In this
context, it differs slightly from Algorithm 2 (from Chapter 1) to ensure the admissibility and the consistency
of the Euclidean distances that approximate the shortest-path distances on the graph. The enhancement to
this L1 version of FastMap comes from the incorporation of differential heuristics in the last iteration [75].
129
In future work, we will adapt the same enhancement to Algorithm 2 and evaluate it for better accuracy of
the Euclidean embedding. If such an enhancement is indeed beneficial, we will likely be able to improve
our performance metrics on all the problems discussed in the previous chapters.
A second enhancement has been presented in [43] for directed graphs. Since Euclidean distances are
inherently symmetric, Euclidean embeddings cannot be used as such for directed graphs. Hence, FastMapD [43] generalizes FastMap to directed graphs by using a potential field to capture the asymmetry in the
pairwise distances between their vertices. It uses a self-supervised ML module and learns a potential function, by which it defines the potential field. Our future work will be motivated by FastMap-D: We will
consider generalizing our FastMap-based algorithms presented in the previous chapters to directed graphs.
A third possible enhancement is impelled by the task of having to embed graphs in manifolds. It is well
known that many kinds of real-world graphs can be embedded in nonlinear manifolds with better accuracy
than in Euclidean spaces. For example, hyperbolic spaces are more suitable for embedding social networks
compared to Euclidean spaces [123]. In future work, we will generalize FastMap to generate manifold
embeddings of graphs. A promising way to do this is to examine every step of the FastMap algorithm and
accomplish it using only dot products. If this can be done successfully, kernel functions popularly used with
SVMs, can also be used with FastMap to implicitly create nonlinear transformations required for complex
manifold embeddings.
8.2.3 Mirroring and Solving Computational Geometry Problems on Graphs
Although the field of Computational Geometry is in many ways as old as Geometry itself, it has evolved
rapidly in the past thirty years. These rapid strides have come from advancements in the design and analysis
of algorithms and their interplay with Complexity Theory, Geometry, and other mathematical techniques.
There are many cornerstone problems in Computational Geometry, the studies of which have been motivated
by their frequent occurrence in various real-world problem domains. Hence, Computational Geometry has
130
many important problem formulations as well as techniques to offer to other areas of research. Indeed,
such problems and techniques have already found plenty of relevance in AI and ML [74], Robotics [50],
Graphics [106], Computer-Aided Design [32], and Statistics [103], among others.
In the third major direction of our future work, we will mirror and solve Computational Geometry
problems on graphs. First, we will formulate the cornerstone problems from Computational Geometry in
graph-theoretic terms. A good example of this is the graph convex hull problem that was already discussed
in Chapter 5. We will consider the graph-theoretic counterparts of other Computational Geometry problems
such as the generation of Voronoi diagrams, coresets1
, triangulations, meshes, and other tessellations. Second, we will study their relevance in real-world problem domains across different areas of research. For
example, graph Voronoi diagrams may have applications in identifying the closest points of interest on a
traffic network or the closest influencers on a social network; and graph coresets may be intimately related
to downsampling graphs. Third, we will study the complexity of solving them directly on the input graph
and develop baseline methods for them. Fourth, we will develop FastMap-based algorithms for solving them
and compare our algorithms to the baseline methods and other state-of-the-art procedures. Finally, we will
develop various connections between the graph counterparts of these problems by following the analogy of
how they relate to each other in Computational Geometry.
8.2.4 FastMapSVM: Further Applications and Enhancements
FastMapSVM is nascent: It allows for the use of very sophisticated distance functions and has a very fertile
ground for its general applicability. Hence, in the fourth major direction of our future work, we will not only
apply FastMapSVM in many other problem domains but will also enhance it to be able to produce complex
outputs when required.
1
In Computational Geometry, a coreset is a smaller set of points that approximates the shape of a larger set of points.
131
Figure 8.2: Illustrates a possible FastMapSVM enhancement that not only maps the input space to a Euclidean space but also maps the output space to a Euclidean space. The complex objects in the input space
are mapped to a Euclidean space R
κ using the distance function D(·,·). Similarly, the complex objects in the
output space are mapped to a Euclidean space R
κ
′
using the distance function D
′
(·,·). The overall framework reduces the task of learning the mapping from the complex objects in the input space to the complex
objects in the output space to the simpler task of learning a function that maps R
κ
to R
κ
′
. For a test instance,
the point in R
κ
′
identified by the learned function can be transformed to a complex object in the required
output space via a local search procedure that is guided by D
′
(·,·).
In the field of Optimization, we will apply FastMapSVM on Weighted CSPs since they can represent a
wide range of optimization problems [63]. A FastMapSVM-based classifier on Weighted CSPs can yield an
algorithm selector: a meta-level procedure that can be used to choose from a portfolio of algorithms a specific algorithm that is best suited to solve a given instance of the problem. Similarly, a FastMapSVM-based
regressor on Weighted CSPs can yield guidance for branch-and-bound search. In the more fundamental
sciences, such as in Biochemistry, FastMapSVM can play a key role in facilitating ML techniques for tasks
such as structure prediction and automated drug discovery. In such problem domains, a primary reason for
its potential advantage over other methods is its ability to utilize “chemical” distance functions that have
been developed by experts in those sciences [44].
While classifiers output discrete classification labels and regressors output real-valued numbers, there are
many problem domains in which the output is also required to be a complex object. A good example of such
a domain is Multi-Agent Path Finding (MAPF) [117]. In the MAPF problem, we are given a team of agents
and an undirected graph that models their shared environment. Each agent has to move from a distinct start
vertex to a distinct goal vertex while avoiding collisions with the other agents. Solving the MAPF problem
optimally for minimum cost or minimum makespan is NP-hard [72]. In this problem domain, the input space
has complex objects in the form of graphs and the concomitant agents’ start and goal vertices. Moreover,
the output space also has complex objects in the form of entire plans for the coordinating team of agents.
132
In the previous chapters, while we used FastMapSVM to simplify the representation of the complex
objects in the input space, the output space was restricted to be a set of classification labels. However, in
future work, we will enhance FastMapSVM on the output side as well to make it effective in MAPF and other
domains where the output space also has complex objects. That is, we will not only use FastMap to simplify
the representation of the complex objects in the input space but will also use it to simplify the representation
of the complex objects in the output space. While the input space can be simplified to the Euclidean space
R
κ using a distance function D(·,·), the output space can be simplified to a different Euclidean space R
κ
′
using a different distance function D
′
(·,·). Figure 8.2 illustrates this possible enhancement.
The above generalization of FastMapSVM has an important potential benefit: It reduces the task of
learning the mapping from the complex objects in the input space to the complex objects in the output
space to the simpler task of learning a function that maps R
κ
to R
κ
′
. Hence, it may require significantly
smaller amounts of data and time for model training compared to other ML and DL methods. For a test
instance in this framework, the learned function identifies a point in R
κ
′
that still needs to be transformed to
a complex object in the required output space. Towards this end, we can use a local search procedure that
explores the subspace of the output space that maps to a point close to the identified point in R
κ
′
. Local
search operators in the output space can be used to morph a complex object to “nearby” but valid complex
objects in the same space, neighborhoods in which can be defined using the distance function D
′
(·,·). This
FastMapSVM framework can also potentially serve as a “generative” AI framework because of its ability to
produce complex objects.
8.3 Concluding Remarks
In this dissertation, we showed several advantages of FastMap and successfully leveraged them towards
many real-world applications. However, the big question on whether FastMap is the holy grail for solving
large-scale combinatorial problems on graphs, or on complex objects in other domains, still remains open.
133
We hope that future research will enable us to conclusively answer this big question. At the same time, it is
also evident that FastMap’s ability to efficiently generate simplified representations of the vertices of a graph,
or complex objects in other domains, enables many powerful downstream algorithms developed in diverse
research communities such as AI, ML, Computational Geometry, Mathematics, Operations Research, and
Theoretical Computer Science. Hence, we envision that FastMap can facilitate and harness the confluence
of these algorithms and find future applications in many other problem domains that are not necessarily
discussed here.
134
Bibliography
[1] Emmanuel Abbe. “Community detection and stochastic block models: Recent developments”. In:
Journal of Machine Learning Research (2017).
[2] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. “Practical
and optimal LSH for angular distance”. In: Advances in Neural Information Processing Systems
(2015).
[3] Chris G. Antonopoulos. “Dynamic range in the C. elegans brain network”. In: Chaos: An
Interdisciplinary Journal of Nonlinear Science (2016).
[4] Alejandro Arbelaez, Youssef Hamadi, and Michele Sebag. “Continuous search in constraint
programming”. In: Proceedings of the 22nd IEEE International Conference on Tools with Artificial
Intelligence. 2010.
[5] Dor Atzmon, Ariel Felner, Jiaoyang Li, Shahaf Shperberg, Nathan Sturtevant, and Sven Koenig.
“Conflict-tolerant and conflict-free multi-agent meeting”. In: Artificial Intelligence (2023).
[6] Franz Aurenhammer. “Voronoi diagrams–A survey of a fundamental geometric data structure”. In:
ACM Computing Surveys (1991).
[7] Tao Ban, Youki Kadobayashi, and Shigeo Abe. “Sparse kernel feature analysis using FastMap and
its variants”. In: Proceedings of the International Joint Conference on Neural Networks. 2009.
[8] C. Bradford Barber, David P. Dobkin, and Hannu Huhdanpaa. “The quickhull algorithm for convex
hulls”. In: ACM Transactions on Mathematical Software (1996).
[9] Elisabetta Bergamini, Michele Borassi, Pierluigi Crescenzi, Andrea Marino, and
Henning Meyerhenke. “Computing top-k closeness centrality faster in unweighted graphs”. In:
ACM Transactions on Knowledge Discovery from Data (2019).
[10] Elisabetta Bergamini and Henning Meyerhenke. “Fully-dynamic approximation of betweenness
centrality”. In: Proceedings of Algorithms-ESA 2015: The 23rd Annual European Symposium.
2015.
135
[11] Elisabetta Bergamini, Henning Meyerhenke, and Christian L. Staudt. “Approximating betweenness
centrality in large evolving networks”. In: Proceedings of the 17th Workshop on Algorithm
Engineering and Experiments. 2015.
[12] Paolo Boldi and Sebastiano Vigna. “Axioms for centrality”. In: Internet Mathematics (2014).
[13] Phillip Bonacich. “Power and centrality: A family of measures”. In: American Journal of Sociology
(1987).
[14] Francesco Bonchi, Gianmarco De Francisci Morales, and Matteo Riondato. “Centrality measures
on big graphs: Exact, approximated, and distributed algorithms”. In: Proceedings of the 25th
International Conference Companion on World Wide Web. 2016.
[15] Paul S. Bradley, Kristin P. Bennett, and Ayhan Demiriz. Constrained k-means clustering. Tech. rep.
Microsoft Research, Redmond, 2000.
[16] Ulrik Brandes. “A faster algorithm for betweenness centrality”. In: Journal of Mathematical
Sociology (2001).
[17] Ulrik Brandes. “On variants of shortest-path betweenness centrality and their generic
computation”. In: Social Networks (2008).
[18] Ulrik Brandes and Daniel Fleischer. “Centrality measures based on current flow”. In: Proceedings
of the 22nd Annual Symposium on Theoretical Aspects of Computer Science. 2005.
[19] Jeffrey Chan, Wei Liu, Andrey Kan, Christopher Leckie, James Bailey, and
Kotagiri Ramamohanarao. “Discovering latent blockmodels in sparse and noisy graphs using
non-negative matrix factorisation”. In: Proceedings of the 22nd ACM International Conference on
Information & Knowledge Management. 2013.
[20] Edith Cohen, Daniel Delling, Thomas Pajor, and Renato F. Werneck. “Computing classic closeness
centrality, at scale”. In: Proceedings of the 2nd ACM Conference on Online Social Networks. 2014.
[21] Liron Cohen, Tansel Uras, Shiva Jahangiri, Aliyah Arunasalam, Sven Koenig, and
T. K. Satish Kumar. “The FastMap algorithm for shortest path computations”. In: Proceedings of
the 27th International Joint Conference on Artificial Intelligence. 2018.
[22] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. “Locality-sensitive hashing
scheme based on p-stable distributions”. In: Proceedings of the 20th Annual Symposium on
Computational Geometry. 2004.
[23] Tim Davis. USAir97. 2014. URL:
https://www.cise.ufl.edu/research/sparse/matrices/Pajek/USAir97.
[24] Rina Dechter. “Constraint networks”. In: Artificial Intelligence (1992).
[25] Rina Dechter. Constraint processing. Morgan Kaufmann, 2003.
136
[26] Roland Durier and Christian Michelot. “Geometrical properties of the Fermat-Weber problem”. In:
European Journal of Operational Research (1985).
[27] Mohamed Elgendi, Muhammad Umer Nasir, Qunfeng Tang, Richard Ribon Fletcher,
Newton Howard, Carlo Menon, Rabab Ward, William Parker, and Savvas Nicolaou. “The
performance of deep neural networks in differentiating chest X-rays of COVID-19 patients from
other bacterial and viral pneumonias”. In: Frontiers in Medicine (2020).
[28] David Eppstein and Joseph Wang. “Fast approximation of centrality”. In: Graph Algorithms and
Applications (2006).
[29] Christos Faloutsos and King-Ip Lin. “FastMap: A fast algorithm for indexing, data-mining and
visualization of traditional and multimedia datasets”. In: Proceedings of the 1995 ACM SIGMOD
International Conference on Management of Data. 1995.
[30] Reza Zanjirani Farahani, Samira Fallah, Rubén Ruiz, Sara Hosseini, and Nasrin Asgari. “OR
models in urban service facility location: A critical review of applications and future
developments”. In: European Journal of Operational Research (2019).
[31] Reza Zanjirani Farahani and Masoud Hekmatfar. Facility location: Concepts, models, algorithms
and case studies. Springer Science & Business Media, 2009.
[32] T. H. Fay. Computational geometry and computer-aided design. National Aeronautics, Space
Administration, Scientific, and Technical Information Branch, 1985.
[33] Sándor P. Fekete, Joseph S. B. Mitchell, and Karin Beurer. “On the continuous Fermat-Weber
problem”. In: Operations Research (2005).
[34] Robert W. Floyd. “Algorithm 97: Shortest path”. In: Communications of the ACM (1962).
[35] Michael L. Fredman and Robert Endre Tarjan. “Fibonacci heaps and their uses in improved
network optimization algorithms”. In: Journal of the ACM (1987).
[36] Linton C. Freeman. “A set of measures of centrality based on betweenness”. In: Sociometry (1977).
[37] Linton C. Freeman. “Centrality in social networks conceptual clarification”. In: Social Networks
(1979).
[38] Robert Geisberger, Peter Sanders, and Dominik Schultes. “Better approximation of betweenness
centrality”. In: Proceedings of the 10th Workshop on Algorithm Engineering and Experiments.
2008.
[39] Ian Philip Gent, Christopher Anthony Jefferson, Lars Kotthoff, Ian James Miguel,
Neil Charles Armour Moore, Peter Nightingale, and Karen Petrie. “Learning when to use lazy
learning in constraint solving”. In: Proceedings of the 19th European Conference on Artificial
Intelligence. 2010.
[40] David Gfeller and Paolo De Los Rios. “Spectral coarse graining of complex networks”. In:
Physical Review Letters (2007).
137
[41] Steven J. Gibbons and Frode Ringdal. “The detection of low magnitude seismic events using
array-based waveform correlation”. In: Geophysical Journal International (2006).
[42] Michelle Girvan and Mark E. J. Newman. “Community structure in social and biological
networks”. In: National Academy of Sciences (2002).
[43] Sriram Gopalakrishnan, Liron Cohen, Sven Koenig, and T. K. Satish Kumar. “Embedding directed
graphs in potential fields using FastMap-D”. In: Proceedings of the 13th International Symposium
on Combinatorial Search. 2020.
[44] A. V. Grigoryan, Irina Kufareva, Maxim Totrov, and R. A. Abagyan. “Spatial chemical distance
based on atomic property fields”. In: Journal of Computer-Aided Molecular Design (2010).
[45] Alessio Guerri and Michela Milano. “Learning techniques for automatic algorithm portfolio
selection”. In: Proceedings of the 16th European Conference on Artificial Intelligence. 2004.
[46] Lin Guo-Hui and Guoliang Xue. “K-center and K-median problems in graded distances”. In:
Theoretical Computer Science (1998).
[47] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual. 2023. URL:
https://www.gurobi.com.
[48] Igor Guskov, Wim Sweldens, and Peter Schröder. “Multiresolution signal processing for meshes”.
In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques.
1999.
[49] Aric Hagberg, Pieter J. Swart, and Daniel A. Schult. Exploring network structure, dynamics, and
function using NetworkX. Tech. rep. Los Alamos National Lab, Los Alamos, NM (United States),
2008.
[50] Dan Halperin, Lydia E. Kavraki, and Kiril Solovey. “Robotics”. In: Handbook of Discrete and
Computational Geometry. CRC Press, 2017.
[51] Daniel Damir Harabor, Alban Grastien, Dindar Öz, and Vural Aksakalli. “Optimal any-angle
pathfinding in practice”. In: Journal of Artificial Intelligence Research (2016).
[52] Bernard Harris. “Mathematical models for statistical decision theory”. In: Optimizing Methods in
Statistics. Elsevier, 1971.
[53] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. “Densely connected
convolutional networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2017.
[54] Ce Jiang, Lihua Fang, Liping Fan, and Boren Li. “Comparison of the earthquake detection abilities
of PhaseNet and EQTransformer with the Yangbi and Maduo earthquakes”. In: Earthquake Science
(2021).
138
[55] Chengxin Jiang, Ping Zhang, Malcolm C. A. White, Robert Pickle, and Meghan S. Miller. “A
detailed earthquake catalog for Banda Arc–Australian plate collision zone using machine-learning
phase picker and an automated workflow”. In: The Seismic Record (2022).
[56] Charles R. Johnson. “Normality and the numerical range”. In: Linear Algebra and its Applications
(1976).
[57] Serdar Kadioglu, Yuri Malitsky, Meinolf Sellmann, and Kevin Tierney. “ISAC–Instance-specific
algorithm configuration”. In: Proceedings of the 19th European Conference on Artificial
Intelligence. 2010.
[58] Naoki Katoh. “Bicriteria network optimization problems”. In: IEICE Transactions on
Fundamentals of Electronics, Communications and Computer Sciences (1992).
[59] Leo Katz. “A new status index derived from sociometric analysis”. In: Psychometrika (1953).
[60] L. Kaufman and Peter J. Rousseeuw. “Clustering by means of medoids”. In: Proceedings of the
Statistical Data Analysis Based on the L1 Norm Conference, Neuchatel, Switzerland. 1987.
[61] Lars Kotthoff. “Algorithm selection for combinatorial search problems: A survey”. In: Data
Mining and Constraint Programming: Foundations of a Cross-Disciplinary Approach (2016).
[62] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet classification with deep
convolutional neural networks”. In: Communications of the ACM (2017).
[63] T. K. Satish Kumar. “A framework for hybrid tractability results in Boolean weighted constraint
satisfaction problems”. In: Proceedings of the 14th International Conference on Principles and
Practice of Constraint Programming. 2008.
[64] Gregory F. Lawler. Random walk and the heat equation. American Mathematical Soc., 2010.
[65] Juyong Lee, Steven P. Gross, and Jooyoung Lee. “Improved network community structure
improves function prediction”. In: Scientific Reports (2013).
[66] David Legland, Kiên Kiêu, and Marie-Françoise Devaux. “Computation of Minkowski measures
on 2D and 3D binary images”. In: Image Analysis & Stereology (2007).
[67] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection.
2014. URL: http://snap.stanford.edu/data.
[68] Jiaoyang Li, Ariel Felner, Sven Koenig, and T. K. Satish Kumar. “Using FastMap to solve graph
problems in a Euclidean space”. In: Proceedings of the 29th International Conference on
Automated Planning and Scheduling. 2019.
[69] Shuyang Lin, Qingbo Hu, Guan Wang, and Philip S. Yu. “Understanding community effects on
information diffusion”. In: Proceedings of the 19th Pacific-Asia Conference on Knowledge
Discovery and Data Mining. 2015.
139
[70] Weisi Lin and Li Dong. “Adaptive downsampling to improve image compression at low bit rates”.
In: IEEE Transactions on Image Processing (2006).
[71] Zhen Lin, Nicholas Huang, Camille Avestruz, W. L. Kimmy Wu, Shubhendu Trivedi,
João Caldeira, and Brian Nord. “DeepSZ: Identification of Sunyaev–Zel’dovich galaxy clusters
using deep learning”. In: Monthly Notices of the Royal Astronomical Society (2021).
[72] Hang Ma, Craig Tovey, Guni Sharon, T. K. Satish Kumar, and Sven Koenig. “Multi-agent path
finding with payload transfers and the package-exchange robot-routing problem”. In: Proceedings
of the 30th AAAI Conference on Artificial Intelligence. 2016.
[73] Martin Maechler. “Cluster: Cluster analysis basics and extensions”. In: R Package Version 2.0.7–1
(2018).
[74] Rafael Magdalena-Benedicto, Sonia Pérez-Díaz, and Adrià Costa-Roig. “Challenges and
opportunities in machine learning for geometry”. In: Mathematics (2023).
[75] Reza Mashayekhi, Dor Atzmon, and Nathan Sturtevant. “Analyzing and improving the use of the
FastMap embedding in pathfinding tasks”. In: Proceedings of the 37th AAAI Conference on
Artificial Intelligence. 2023.
[76] Alex Mattenet, Ian Davidson, Siegfried Nijssen, and Pierre Schaus. “Generic constraint-based
block modeling using constraint programming”. In: Journal of Artificial Intelligence Research
(2021).
[77] M. L. Menéndez, J. A. Pardo, L. Pardo, and M. C. Pardo. “The Jensen-Shannon divergence”. In:
Journal of the Franklin Institute (1997).
[78] Nenad Mladenovic, Martine Labbé, and Pierre Hansen. “Solving the p-Center problem with Tabu ´
Search and Variable Neighborhood Search”. In: Networks: An International Journal (2003).
[79] Peter R. Monge and Noshir S. Contractor. Theories of communication networks. Oxford University
Press, USA, 2003.
[80] S. Mostafa Mousavi, William L. Ellsworth, Weiqiang Zhu, Lindsay Y. Chuang, and
Gregory C. Beroza. “Earthquake transformer–An attentive deep-learning model for simultaneous
earthquake detection and phase picking”. In: Nature Communications (2020).
[81] S. Mostafa Mousavi, Yixiao Sheng, Weiqiang Zhu, and Gregory C. Beroza. “STanford EArthquake
dataset (STEAD): A global data set of seismic signals for AI”. In: IEEE Access (2019).
[82] S. Mostafa Mousavi, Weiqiang Zhu, Yixiao Sheng, and Gregory C. Beroza. “CRED: A deep
residual network of convolutional and recurrent units for earthquake signal detection”. In: Scientific
Reports (2019).
[83] Kevin P. Murphy. Machine learning: A probabilistic perspective. The MIT Press, 2012.
140
[84] Sunil K. Narang and Antonio Ortega. “Downsampling graphs using spectral theory”. In:
Proceedings of the 36th IEEE International Conference on Acoustics, Speech and Signal
Processing. 2011.
[85] Mark E. J. Newman. “Finding community structure in networks using the eigenvectors of
matrices”. In: Physical Review E (2006).
[86] Mark E. J. Newman and Duncan J. Watts. “Renormalization group analysis of the small-world
network model”. In: Physics Letters A (1999).
[87] Ha Q. Nguyen and Minh N. Do. “Downsampling of signals on graphs via maximum spanning
trees”. In: IEEE Transactions on Signal Processing (2014).
[88] Eoin O’Mahony, Emmanuel Hebrard, Alan Holland, Conor Nugent, and Barry O’Sullivan. “Using
case-based reasoning in an algorithm portfolio for constraint solving”. In: Proceedings of the 19th
Irish Conference on Artificial Intelligence and Cognitive Science. 2008.
[89] Susan Hesse Owen and Mark S. Daskin. “Strategic facility location: A review”. In: European
Journal of Operational Research (1998).
[90] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation
ranking: Bringing order to the web. Tech. rep. Stanford InfoLab, 1999.
[91] Arti Patle and Deepak Singh Chouhan. “SVM kernel functions for classification”. In: Proceedings
of the International Conference on Advances in Technology and Engineering. 2013.
[92] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, and E. Duchesnay. “Scikit-learn: Machine learning in Python”. In: Journal of Machine
Learning Research (2011).
[93] Tiago P. Peixoto. “Efficient Monte Carlo and greedy heuristic for the inference of stochastic block
models”. In: Physical Review E (2014).
[94] Ignacio M. Pelayo. Geodesic convexity in graphs. Springer, 2013.
[95] S. Unnikrishna Pillai, Torsten Suel, and Seunghun Cha. “The Perron-Frobenius theorem: Some of
its applications”. In: IEEE Signal Processing Magazine (2005).
[96] Victor V. Prasolov. Polynomials. Springer Science & Business Media, 2004.
[97] Luca Pulina and Armando Tacchella. “A multi-engine solver for quantified Boolean formulas”. In:
Proceedings of the 13th International Conference on Principles and Practice of Constraint
Programming. 2007.
[98] Faisal Rahutomo, Teruaki Kitasuka, and Masayoshi Aritsugi. “Semantic cosine similarity”. In:
Proceedings of the 7th International Student Conference on Advanced Science and Technology.
2012.
141
[99] Rishabh Ramteke, Peter Stuckey, Jeffrey Chan, Kotagiri Ramamohanarao, James Bailey,
Christopher Leckie, and Emir Demirovic. “Improving single and multi-view blockmodelling by ´
algebraic simplification”. In: Proceedings of the International Joint Conference on Neural
Networks. 2020.
[100] Gerhard Reinelt. “TSPLIB–A traveling salesman problem library”. In: ORSA Journal on
Computing (1991).
[101] Eric Sven Ristad and Peter N. Yianilos. “Learning string-edit distance”. In: IEEE Transactions on
Pattern Analysis and Machine Intelligence (1998).
[102] Peter J. Rousseeuw. “Silhouettes: A graphical aid to the interpretation and validation of cluster
analysis”. In: Journal of Computational and Applied Mathematics (1987).
[103] Peter J. Rousseeuw and Mia Hubert. “Statistical depth meets computational geometry: A short
survey”. In: arXiv preprint arXiv:1508.03828 (2015).
[104] Monjoy Saha, Sagar B. Amin, Ashish Sharma, T. K. Satish Kumar, and Rajiv K. Kalia. “AI-driven
quantification of ground glass opacities in lungs of COVID-19 patients using 3D computed
tomography imaging”. In: PLoS One (2022).
[105] V. David Sánchez A. “Advanced support vector machines and kernel methods”. In:
Neurocomputing (2003).
[106] Philip Schneider and David H. Eberly. Geometric tools for computer graphics. Elsevier, 2002.
[107] Erich Schubert and Peter J. Rousseeuw. “Fast and eager k-medoids clustering: O(k) runtime
improvement of the PAM, CLARA, and CLARANS algorithms”. In: Information Systems (2021).
[108] Raimund Seidel. “Convex hull computations”. In: Handbook of Discrete and Computational
Geometry. Chapman and Hall/CRC, 2017.
[109] Florian Seiffarth, Tamás Horváth, and Stefan Wrobel. “A fast heuristic for computing geodesic
cores in large networks”. In: arXiv preprint arXiv:2206.07350 (2022).
[110] Nader Shakibay Senobari, Gareth J. Funning, Eamonn Keogh, Yan Zhu, Chin-Chia Michael Yeh,
Zachary Zimmerman, and Abdullah Mueen. “Super-efficient cross-correlation (SEC-C): A fast
matched filtering code suitable for desktop computers”. In: Seismological Research Letters (2019).
[111] Kushal Sharma, Ang Li, Malcolm C. A. White, and T. K. Satish Kumar. “A study of distance
functions in FastMapSVM for classifying seismograms”. In: Proceedings of the 22nd International
Conference on Machine Learning and Applications. 2023.
[112] David R. Shelly, Gregory C. Beroza, and Satoshi Ide. “Non-volcanic tremor and low-frequency
earthquake swarms”. In: Nature (2007).
142
[113] David R. Shelly, William L. Ellsworth, and David P. Hill. “Fluid-faulting evolution in high
definition: Connecting fault structure and frequency-magnitude variations during the 2014 Long
Valley Caldera, California, earthquake swarm”. In: Journal of Geophysical Research: Solid Earth
(2016).
[114] Barbara M. Smith and Martin E. Dyer. “Locating the phase transition in binary constraint
satisfaction problems”. In: Artificial Intelligence (1996).
[115] Sveinn Steinarsson. “Downsampling time series for visual representation”. PhD thesis. 2013.
[116] Karen Stephenson and Marvin Zelen. “Rethinking centrality: Methods and examples”. In: Social
Networks (1989).
[117] Roni Stern, Nathan Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne Walker, Jiaoyang Li,
Dor Atzmon, Liron Cohen, T. K. Satish Kumar, Eli Boyarski, and Roman Barták. “Multi-agent
pathfinding: Definitions, variants, and benchmarks”. In: Proceedings of the 12th International
Symposium on Combinatorial Search. 2019.
[118] Nathan Sturtevant. “Benchmarks for grid-based pathfinding”. In: Transactions on Computational
Intelligence and AI in Games (2012).
[119] Nathan Sturtevant, Ariel Felner, Max Barrer, Jonathan Schaeffer, and Neil Burch. “Memory-based
heuristics for explicit state spaces”. In: Proceedings of the 21st International Joint Conference on
Artificial Intelligence. 2009.
[120] Omkar Thakoor, Ang Li, Sven Koenig, Srivatsan Ravi, Erik Kline, and T. K. Satish Kumar. “The
FastMap pipeline for facility location problems”. In: Proceedings of the 24th International
Conference on Principles and Practice of Multi-Agent Systems. 2022.
[121] Maximilian Thiessen and Thomas Gärtner. “Active learning of convex halfspaces on graphs”. In:
Advances in Neural Information Processing Systems (2021).
[122] Tim Van Erven and Peter Harremos. “Rényi divergence and Kullback-Leibler divergence”. In:
IEEE Transactions on Information Theory (2014).
[123] Kevin Verbeek and Subhash Suri. “Metric embedding, hyperbolic space, and social networks”. In:
Proceedings of the 30th Annual Symposium on Computational Geometry. 2014.
[124] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. “A theoretical analysis of NDCG
type ranking measures”. In: Proceedings of the 26th Annual Conference on Learning Theory. 2013.
[125] Bernard M. Waxman. “Routing of multipoint connections”. In: IEEE Journal on Selected Areas in
Communications (1988).
[126] Malcolm C. A. White, Kushal Sharma, Ang Li, T. K. Satish Kumar, and Nori Nakata. “Classifying
seismograms using the FastMap algorithm and support-vector machines”. In: Communications
Engineering (2023).
143
[127] H. Woo, E. Kang, Semyung Wang, and Kwan H. Lee. “A new segmentation method for point cloud
data”. In: International Journal of Machine Tools and Manufacture (2002).
[128] Jack Woollam, Jannes Münchmeyer, Frederik Tilmann, Andreas Rietbrock, Dietrich Lange,
Thomas Bornstein, Tobias Diehl, Carlo Giunchi, Florian Haslinger, Dario Jozinovic,´
Alberto Michelini, Joachim Saul, and Hugo Soto. “SeisBench–A toolbox for machine learning in
seismology”. In: Seismological Society of America (2022).
[129] Hong Xu, Sven Koenig, and T. K. Satish Kumar. “Towards effective deep learning for constraint
satisfaction problems”. In: Proceedings of the 24th International Conference on Principles and
Practice of Constraint Programming. 2018.
[130] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. “How powerful are graph neural
networks?” In: arXiv preprint arXiv:1810.00826 (2018).
[131] Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. “SATzilla: Portfolio-based
algorithm selection for SAT”. In: Journal of Artificial Intelligence Research (2008).
[132] Fanghua Ye, Chuan Chen, and Zibin Zheng. “Deep autoencoder-like nonnegative matrix
factorization for community detection”. In: Proceedings of the 27th ACM International Conference
on Information and Knowledge Management. 2018.
[133] Yuichi Yoshida. “Almost linear-time algorithms for adaptive betweenness centrality using
hypergraph sketches”. In: Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. 2014.
[134] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. “An end-to-end deep learning
architecture for graph classification”. In: Proceedings of the 32nd AAAI Conference on Artificial
Intelligence. 2018.
[135] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. “Open3D: A modern library for 3D data
processing”. In: arXiv preprint arXiv:1801.09847 (2018).
144
Appendices
A Table of Abbreviations
Abbreviation Expansion
3C three-component
AI Artificial Intelligence
CNN Convolutional Neural Network
CSPs Constraint Satisfaction Problems
CVKM Capacitated Vertex K-Median
DGCNN deep graph convolutional neural network
DL Deep Learning
DNA deoxyribonucleic acid
ECGs electrocardiograms
FMBM FastMap-Based Block Modeling
FMGCH FastMap-Based Graph Convex Hull
GIN graph isomorphism network
GMM Gaussian Mixture Model
ILP Integer Linear Programming
LSH Locality Sensitive Hashing
MAM Multi-Agent Meeting
MAPF Multi-Agent Path Finding
ML Machine Learning
MRIs magnetic resonance images
nDCG normalized Discounted Cumulative Gain
NLP Natural Language Processing
NMI Normalized Mutual Information
NNs Neural Networks
PAM Partition Around Medoids
PASPD probabilistically-amplified shortest-path distance
PCA Principal Component Analysis
SAT Satisfiability
SNR signal-to-noise ratio
SRS simple random sample
STEAD Stanford Earthquake Dataset
SVMs Support Vector Machines
VKC Vertex K-Center
VKM Vertex K-Median
WVKM Weighted Vertex K-Median
Table A.1: Describes the abbreviations used in the dissertation.
145
Abstract (if available)
Abstract
FastMap was first introduced in the Data Mining community for generating Euclidean embeddings of complex objects. In this dissertation, we first present FastMap to generate Euclidean embeddings of graphs in near-linear time: The pairwise Euclidean distances approximate a desired graph-based distance function on the vertices. We then apply the graph version of FastMap to efficiently solve various graph-theoretic problems of significant interest in AI: including facility location, top-K centrality computations, community detection and block modeling, and graph convex hull computations. We also present a novel learning framework, called FastMapSVM, by combining FastMap and Support Vector Machines. We then apply FastMapSVM to predict the satisfiability of Constraint Satisfaction Problems and to classify seismograms in Earthquake Science.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Advancements in understanding the empirical hardness of the multi-agent pathfinding problem
PDF
Applications of topological data analysis to operational research problems
PDF
Speeding up multi-objective search algorithms
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Speeding up path planning on state lattices and grid graphs by exploiting freespace structure
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Expander Cayley graphs over finite strings and pseudorandomness
PDF
Minor embedding for experimental investigation of a quantum annealer
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Socially-informed content analysis of online human behavior
PDF
Learning social sequential decision making in online games
PDF
Fast and label-efficient graph representation learning
PDF
Applications of explicit enumeration schemes in combinatorial optimization
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Emphasizing the importance of data and evaluation in the era of large language models
Asset Metadata
Creator
Li, Ang
(author)
Core Title
Revisiting FastMap: new applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
07/11/2024
Defense Date
06/20/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
complex object classification,FastMap,FastMapSVM,geometric interpretations of graph problems,graph embedding,locality sensitive hashing,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Thittamaranahalli, Satish (
committee chair
), Carlsson, John (
committee member
), Ferrara, Emilio (
committee member
), Koenig, Sven (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
ali355@usc.edu,leon.angli0404@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113997L9P
Unique identifier
UC113997L9P
Identifier
etd-LiAng-13198.pdf (filename)
Legacy Identifier
etd-LiAng-13198
Document Type
Dissertation
Format
theses (aat)
Rights
Li, Ang
Internet Media Type
application/pdf
Type
texts
Source
20240712-usctheses-batch-1179
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
complex object classification
FastMap
FastMapSVM
geometric interpretations of graph problems
graph embedding
locality sensitive hashing