Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards the efficient and flexible leveraging of distributed memories
(USC Thesis Other)
Towards the efficient and flexible leveraging of distributed memories
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
TOWARDS THE EFFICIENT AND FLEXIBLE LEVERAGING OF
DISTRIBUTED MEMORIES
by
Chao Wang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
December 2022
Copyright 2022 Chao Wang
I dedicate this thesis to my family
for their endless caring and support.
ii
Acknowledgements
The author acknowledges the partial financial support of NSF to complete this dissertation.
I would like to thank my advisor Xuehai, Qian for helping me clarify my ideas and justify the
underlying logic behind the designs. When necessary, he is always trying his best to revise my
manuscripts to the best he can. I learned a lot from Qian not only academically, but also from his
view of life, which reshapes many of my aspects regarding how to succeed academically, how to
collaborate with others, and what good research work should be like. I will not be able to complete
this dissertation without Qian’s financial support and a large amount of valuable advice.
I would like to thank my family. They are always caring about me from beyond the ocean via
phone calls and video calls. They are backing me up throughout the years I am studying abroad.
When I am struggling with my experiments, they do not understand exactly what I do but they are
still there listening to me talking about my study and research. It is from my parents and sister that
I obtained never-ending courage and strength to move forward in my research.
Finally, I would like to thank all of my friends for their support during my Ph.D. study. They
offer their brilliant minds and energy to help me tackle difficult situations in my life and research.
From them, I learned many skills like how to do good time management, how to manage stress.
Without their support, this dissertation will surely be delayed.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2: Leveraging RDMA - using distributed transaction systems as a case study 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 RDMA Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 RCC System Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Transaction Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 RDMA Communication and Optimizations . . . . . . . . . . . . . . . . . 10
2.4 RDMA-Based Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Transaction Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 NOWAIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3 WAITDIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 MVCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.5 SUNDIAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.6 CALVIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Hybrid Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Implementation Requirements . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.1 Workloads and Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.2 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.3 Effect of Co-routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.4 Effect of Contention Level . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iv
2.6.5 Effect of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.6 Scalability of QPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 3: Enabling memory/compute dis-aggregation on distributed graph systems . 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 SlimGraph Execution Paradigm . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 D-Gemini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Delegation-based propagation model . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Delegating SparseSlot Computation . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.4 Optimized Edge Requesting of D-Gemini . . . . . . . . . . . . . . . . . . 49
3.5 D-Kudu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.1 Explore the potential of one-sided primitives . . . . . . . . . . . . . . . . . 54
3.5.2 Displacement Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.3 Delegating Memory Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.1 Platform Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.2 Datasets and Evaluated Applications . . . . . . . . . . . . . . . . . . . . . 59
3.7.3 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7.4 Elasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7.6 Impact of Bitmap Caching, Index Caching, and Edge Caching . . . . . . . 64
3.7.7 Communication V olume Reduction . . . . . . . . . . . . . . . . . . . . . 65
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Chapter 4: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
v
List of Tables
2.1 Stage-wise hybrid strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Lines of code modified to Gemini applications( ⋆) and Kudu applications(†). . . . . 59
3.2 Graph Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
vi
List of Figures
2.1 RCC Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Two-sided versus one-sided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Legend for RCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 NOWAIT Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 WAITDIE: the FETCH stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 MVCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 SUNDIAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 CALVIN : Txn Input Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 RDMA-enabled Buffer Organization . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11 Latency breakdown: RPC vs. one-sided . . . . . . . . . . . . . . . . . . . . . . . 26
2.12 Performance of all stage-wise hybrid implementations compared with two-sided
or one-sided implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.13 Overall throughput, latency, abort rate and # of network round trips . . . . . . . . . 28
2.14 Throughput (M txns/s) and latency (ms) when varying #co-routines . . . . . . . . 31
2.15 CALVIN throughput (K txns/s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.16 Effect of contention level on throughput . . . . . . . . . . . . . . . . . . . . . . . 33
2.17 Effect of computation on throughput . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.18 Throughput on emulated large EDR clusters. . . . . . . . . . . . . . . . . . . . . . 34
3.1 Graph Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 An overview of SlimGraph in the distributed graph framework stack . . . . . . . . 42
3.3 The compute/memory dis-aggregated execution paradigm of SlimGraph . . . . . . 43
vii
3.4 D-Gemini’s push update propagation model . . . . . . . . . . . . . . . . . . . . . 45
3.5 D-Gemini’s delegation-based workload propagation model . . . . . . . . . . . . . 46
3.6 D-Gemini’s optimized edge requesting . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 D-Gemini’s overlapping ofSparseSlot computations with communication. . . . 53
3.8 D-Kudu’s communication pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9 An example of Neighbors(v) displacement broadcasting . . . . . . . . . . . . . . 56
3.10 Communication volume reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.11 Performance of D-Gemini and D-Kudu with full compute nodes . . . . . . . . . . 61
3.12 Compute/Memory Dis-aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.13 Scalability of D-Gemini and D-Kudu . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.14 The impact of bitmap cache (BC), index cache (IC), and edge cache (EC) as
important optimizations in D-Gemini. . . . . . . . . . . . . . . . . . . . . . . . . 64
3.15 D-Kudu’s Communication V olume. . . . . . . . . . . . . . . . . . . . . . . . . . 65
viii
Abstract
Nowadays data volume is growing exponentially beyond the storage capability of single machines
in the cloud era. With the thriving of a broad spectrum of distributed systems, it is critical to access
remote data effectively and efficiently, and in a flexible manner, such that compute and memory
resources can be easily disaggregated. We investigated two typical types of distributed systems in
our case study. a) distributed transaction systems. b) distributed graph systems.
To allow ultra-low-latency remote data access in distributed transaction systems, we revisited
the idea of Remote Direct Memory Access (RDMA) and its effectiveness in terms of building clas-
sical and state-of-the-art concurrent control protocols. Specifically, we developed RCC, the first
unified and comprehensive RDMA-enabled distributed transaction processing framework contain-
ing six serializable concurrency control protocols. Based on the insights obtained from analyzing
pure two-sided and one-sided protocol implementations, RCC is able to prototype more efficient
hybrid deployments automatically.
The efficient leveraging of remote memory provides an opportunity for compute/memory dis-
aggregation. We proposed a novel execution paradigm for processing graph workloads where
compute and memory resources can be disproportionately scaled. Furthermore, we designed Slim-
Graph, a layer for transforming Gemini and Kudu - two state-of-the-art distributed graph frame-
works, into ones supporting compute/memory dis-aggregation. Our work shows that with all types
of optimizations enabled, SlimGraph can enable the use of fewer compute machines to complete
distributed graph computation with reasonable performance degradation.
ix
Chapter 1
Introduction
1.1 Background
Data volume are growing exponentially nowadays. According to www.statista.com, the world’s
data will grow from about 59 zettabytes (ZB) in 2020 to 149 ZB by 2024. Due to huge data
volume, many applications today are suffering from handling large inputs on the local machine
due to a surge in input data, which causes memory pressure due to a limited local memory on local
host. There has been many research trying to solve this problems to allow applications to handle
huge input beyond local memory boundaries.
There are basically two lines of research: 1) partition data set so that each partition can reside
on the local memory of a host. In this direction, each host processes its own partition of data,
making the whole application or framework distributed. Many applications or frameworks follows
this paradigm for example, distributed graph processing/analytic frameworks [51, 27, 85, 33], dis-
tributed transaction processing systems [39, 75, 40] etc. 2) place a portion of data remotely on
another machine and allows a host to access the remote data. The placement of remote data can ei-
ther be transparent or close to transparent to applications: applications either do not need to worry
about where to access the remote data or only need to expose very limited information to the lower
level frameworks or runtimes. The remote access can be implemented by leveraging OS-based
paging mechanism [29, 3]: when the CPU access an address in a page that is not located in local
memory, a page fault will be triggered and the page fault handler will retrieve that page from either
1
secondary storage or from the remote memory of another host. This simple method can provide
full transparency to any applications running on top by taking advantage of Linux virtual memory
subsystem. Another simple method for implementing the remote access is leveraging user-space
runtime to manipulate remote data and avoid the overhead of context switching that paging based
method has introduced [74, 61].
1.2 Motivation
Given a certain distributed systems, there are two broad research topics that are of my interest in
this dissertation. a) How to make a distributed system faster in the context of using Remote Direct
Memory Access (RDMA)? b) How to dis-aggregate the computation and memory resources of a
distributed system so that it is more adaptive to the cloud setting?
In chapter 2, we investigated issues of leveraging RDMA in distributed applications - the first
line of research. We choose the distributed transaction systems as a case study. Specifically,
chapter 2 builds a framework that can answer one important question that has been discussed for
many years in the database community: which types of RDMA primitives are better? we built
RCC and obtained definitive answers for RDMA-based concurrency control protocols. Moreover,
inspired by the hybrid OCC in DrTM+H [75], we built RCC such that it can generate any stage-
wise hybrid implementation of any protocols given a specific workload profile.
In addition to performance issues, it is difficult for many current distributed systems to be
adaptive to disproportional computation and memory resources, which is a norm of today’s cloud.
Although some techniques have been proposed for a single host to access remote data on another
host like OS-based paging and user-space runtime, allowing distributed systems to use remote
memory data causes extra challenges. Exemplified by two recent distributed graph frameworks,
chapter 3 proposed an extra computation paradigm and designed a new graph framework layer
called SlimGraph to reach the computation and memory dis-aggregation of existing distributed
graph frameworks.
2
Chapter 2
Leveraging RDMA - using distributed transaction systems as a
case study
2.1 Introduction
Online transaction processing (OLTP) has ubiquitous applications in many important domains,
including banking, stock marketing, e-commerce, etc. Many cloud infrastructure providers such as
Microsoft Azure, Amazon Web Services and Google Cloud Platform have established their cloud
database services to support OLTP. With data volume growing exponentially, the cloud database
is partitioned and managed by different instance servers. Upon receiving a request, each server is
responsible for accessing a subset of the database. Partitioning data such that all queries of one
request access only one partition is challenging [19, 58]. Therefore, a transaction inevitably has to
access multiple partitions in a distributed manner.
Distributed transactions should guarantee two key properties: (a) atomicity: either all or none
of the machines agree to apply the updates; and (2) serializability: all transactions must commit in
some serializable order. To ensure these properties, concurrency control protocols have been inves-
tigated for decades [1, 6, 7, 43, 52, 73]. The challenge of multi-partition serializable concurrency
control protocols is the significant performance penalty due to communication and coordination
among distributed machines [45, 67, 71]: when a transaction accesses multiple records over the
3
network, it needs to be serialized with all conflicting transactions [5], making a high-performance
network critical.
Remote Direct Memory Access (RDMA) is a technology that enables the network interface
controller (NIC) to access the memory of remote servers. It can be used in many domains such as
data prefetching [36]. Due to its high bandwidth and low latency, RDMA has been recently used
to support distributed transaction systems [37, 76, 39, 12, 22], and has enhanced the performance
by orders of magnitude compared to traditional systems using TCP. RDMA network supports both
TCP-like two-sided communication using primitives SEND/RECV, and one-sided communication
using primitives READ/WRITE/ATOMIC, which are capable of accessing remote memory while by-
passing traditional network stack, the kernel, and the remote CPUs.
Extensive studies have been conducted in understanding the performance implication of each
primitive using micro-benchmarks [37, 21, 22, 75, 72]. Moreover, RDMA has been used to im-
plement the Optimistic Concurrency Control (OCC) protocol [39, 21, 75]. Two takeaways from
DrTM+H [75] are: (1) the best performance of OCC cannot be simply achieved by solely using
two-sided or one-sided communication; and (2) different communication primitives are best suited
for each execution stage.
We claim that the state-of-the-art RDMA-based system DrTM+H [75] is not sufficient for two
important reasons. First, in real-world applications, various concurrency control protocols [57, 52,
60, 24, 18, 16, 25] are used, the understanding of RDMA implications on OCC may not transfer
to other protocols. Second, building the standalone framework for each individual protocol does
not allow the fair and unbiased cross-protocol performance comparison. In a complete system for
distributed transaction execution, concurrency control protocol is only one component, the system
organization, optimizations, and transaction execution model can vary a lot. Having a common
execution environment for all various protocols is critical to draw any meaningful conclusions [78,
79]. Clearly, DrTM+H does not provide such capability. Compared to DrTM+H, Deneva [31]
studied six concurrency control protocols based on TCP, affirming the importance of cross-protocol
comparison. However, Deneva is not based on RDMA.
4
In this chapter, we take the important step to close the gap. We develop RCC, the first uni-
fied and comprehensive RDMA-enabled distributed transaction processing framework supporting
multiple concurrency control protocols with different properties. It includes protocols in a wide
spectrum: (1) classical protocols such as two-phase-locking (2PL), i.e.,NOWAIT [6],WAITDIE [6],
andOCC [43]; (2) more advanced protocols such asMVCC [7], which has been adopted by many
modern cloud database system providers, and the recent SUNDIAL [80], that allows dynamically
adjustment of commit order with logical lease to reduce abort; and (3) CALVIN [71], a shared-
nothing protocol that ensures deterministic transaction execution.
RCC enables us to perform unbiased and fair comparison of the protocols in a common exe-
cution environment with the concurrency control protocol being the only changeable component.
We develop the correct and efficient RDMA-based implementation using known techniques, i.e.,
co-routines, outstanding requests, and doorbell batching, with two-sided and one-sided communi-
cation primitives. To validate the benefits of RDMA, RCC also provides reference implementations
based on TCP. As a common infrastructure for RDMA-enabled transaction execution, RCC allows
the fast prototyping of existing protocols or new implementations.
With RCC, we are able to answer several key questions that cannot be answered before. First,
while the protocol specifications are known, we answer the question of how to leverage RDMA
to construct different protocols with concrete, executable, and efficient implementations. Second,
we can perform both apple-to-apple cross-protocol comparisons and the stage-level same-protocol
study on performance and various execution characteristics in the context of the same system
organization. The answers of which primitives being best suited for which execution stage can be
used to further optimize the performance. Third, forCALVIN , which is a shared-nothing protocol
and has never been studied in the context of RDMA, we answer the question of whether the one-
sided primitives would bring the similar benefits as other shared-everything protocols.
The implementation of the current RCC with the six protocols has around 25,000 lines of codes
written in C++. We intend to open-source the framework in the near future. We try our best to fairly
optimize the performance of each without bias using known techniques such as co-routines [39],
5
outstanding requests [75], doorbell batching [38]. We evaluate all protocol designs on a cluster
with ConnectX-4 EDR InfiniBand RDMA support using three typical workloads: SmallBank [69],
TPC-C [17], and YCSB [15]
Using RCC, we conduct the first cross-protocol performance comparison with RDMA and
observe a number of interesting performance characteristics: 1) OCC does not always achieve
the best performance. In fact, the simple 2PL protocols such asNOWAIT andWAITDIE outperform
other more complicated protocols with a high performance RDMA network on a network-intensive
workload (SmallBank). This indicates the promise of enhancing RDMA network stack instead of
fine-tuning CC protocols in future transaction processing systems.
We obtain the execution stage latency breakdowns with one-sided and two-sided primitive for
each protocol for all three workloads, and they are analyzed to develop hybrid implementations,
which achieves equivalent or better performance under the given workload characteristics. Our
experiment shows that by cherry-picking the communication type that incurs lower latency for each
protocol stage, we can find new protocol implementations that reaches at most 17.8% throughput
speedup, compared to the better side of RPC or one-sided implementations.
To provide user-friendly interface, we designed and implemented simple interface in RCC that
allows both common and advanced users to quickly evaluate any hybrid implementation for an
existing or new protocol given a workload characteristic. In addition, for a given protocol, RCC
can exhaustively enumerate all combinations of hybrid protocols and provide substantial evidence
that a certain hybrid design is the best among all possibilities when varying stage communication
styles. Our experiments show that throughput-wise hybrid designs are better than RPC or one-sided
counterparts by 32.2% on average and up to 67%.
We believe that RCC is a significant advance over state-of-the-art as it can both provide per-
formance insights and be used as the common infrastructure for fast prototyping new implementa-
tions.
6
2.2 RDMA Basics
Remote Direct Memory Access (RDMA) is a network technology featuring high bandwidth and
low latency data transfer with low CPU overhead. It is particularly suitable for large data cen-
ters. RDMA operations, i.e., verbs, can be classified into two types: (1) two-sided primitives
SEND/RECV; and (2) one-sided primitivesREAD/WRITE/ATOMIC. The latter provides the unique ca-
pability to directly access the memory of remote machines without involving remote CPUs. This
feature makes one-sided operations suitable for distributed applications with high CPU utilization.
Although having similar semantics with TCP’s send/receive over bound sockets, RDMA two-sided
operations bypass the traditional network stack and the OS kernel, making the performance of RPC
implementation over RDMA much higher than that over TCP.
To perform RDMA communication, queue pairs (QPs) must be set up. A QP consists of a send
queue and a receive queue. When a sender posts a one-sided RDMA request to the send queue, the
local QP will transfer data to some remote QP, and the sender can poll for completion information
from the completion queue associated with the QP. The receiver’s CPU is not aware of the one-
sided operations performed by the receiver’s RNIC without checking the changes in memory. For
a sender to post a two-sided operation, the receiver QP has to post RECV for the corresponding
SEND in advance. It polls the receive queue to obtain the data. To set up a reliable connection, a
node has to maintain at least a cluster-size number of QPs in its RDMA-enabled NIC (RNIC), each
connected with one remote node.
2.3 RCC System Organization
2.3.1 Overall Architecture
Figure 2.1 shows the overview of RCC, which runs on multiple symmetric distributed nodes, each
containing a configurable number of server threads to process transactions. A client thread sends
transaction requests to a random local or remote transaction processing thread in the cluster. The
stats thread is used to collect the statistics (e.g., the number of committed transactions) generated
7
Client
thread
Stats thread
QP
thread
Protocol-specific
Meta-data
Protocol-specific
RDMA buffer
Server thread
Mem
Client
thread
Stats thread
QP
thread
Server thread
Mem
Client
thread
Stats thread
QP
thread
Server thread
Mem
Client
thread
Stats thread
QP
thread
Server thread
Mem
key-value
store
Event
handling
co-routine
Txn
coordination
co-routine
Figure 2.1: RCC Framework Overview
by each processing thread. The QP thread is used to bootstrap RDMA connections by establishing
the QP pairing by TCP connections.
RCC uses co-routines as an essential optimization technique [39] to hide network latency.
Specifically, each thread starts an event handling co-routine and some transaction coordination
co-routines. An event handling co-routine continuously checks and handles network-triggered
events such as polled completions or memory-triggered events such as the release of a lock. A
transaction coordination co-routine is where a transaction logically executes.
In RCC, the distributed in-memory database is implemented as a distributed key-value store
that can be accessed either locally or remotely via a key and table ID. we leveraged DrTM+H’s[75]
key-value store as RCC’s back-end. Since RCC supports multiple protocols, each protocol has its
protocol-specific metadata and RDMA buffer to ensure the correct execution leveraging RDMA
primitives.
2.3.2 Transaction Execution Model
RCC employs a symmetric model to execute transactions: each node serves as both a client and
a transaction processing server. As shown in Figure 2.1, each transaction coordination co-routine
8
is responsible for executing a transaction at any time. We use coordinator to refer to the co-
routine that receives transaction requests from some local or remote client thread and orchestrates
transactional activities in the cluster. We use participant to refer to a machine where there is a
record to be accessed by some transaction. When a participant receives an RPC request, its event
handling co-routine will be invoked to process the request locally. When a participant receives an
RDMA one-sided operation, its RNIC is responsible for accessing the memory without interrupting
the CPU.
In RCC, A record refers to the actual data; and a tuple refers to a record associated with the
relevant metadata. All tuples are located in RDMA-registered memory. A distributed in-memory
key-value store keeps all tuples partitioned among all machines. Since one-sided operations can
only access remote memory by leveraging the pre-computed remote offsets, to reduce the num-
ber of one-sided operations involved in retrieving metadata, the metadata are placed physically
together with the record as shown in Figure 2.3. Currently, RCC only supports fixed record size
and variable-sized record can be supported by placing an extra pointer in the record field pointing
to an RDMA-registered region, similar to[81]. With one-sidedREAD, the remote offset of a tuple is
fetched before the actual tuple is fetched and the offset is then cached locally to avoid unnecessary
one-sided operations.
A transaction has a read set (RS) and a write set (WS) that are known before the execution.
The records in RS are read-only. The execution of a transaction is conceptually divided into three
primary stages: fetching: get the tuples of records in RS and WS, the metadata is used for protocol
operations; execution: a transaction performs the actual computation locally using the fetched
record; and commit: a transaction checks if it is serializable, if so, logs all writes to remote backup
machines for high availability and recovery, and updates remote records. Our implementations
apply to transactions with one or more fetching and execution stages.
9
RPC Request
Participant
CPU RNIC
Coodinator
RDMA
READ/WRITE/ATOMIC
Coodinator Participant
CPU RNIC
two-sided RPC
one-sided
RPC reply RDMA Completion RPC Handling
CPU RNIC RNIC CPU
MMIO Poll Completion DMA
Figure 2.2: Two-sided versus one-sided
2.3.3 RDMA Communication and Optimizations
We use one-sided RDMA primitives likeREAD,WRITE,ATOMIC with RC QPs and two-sided RDMA
primitives like SEND and RECEIVE over UD QPs to implement RCC. From [39], two-sided prim-
itives over UD QPs outperform one-sided primitives in symmetric transaction systems, and UD
mode is much more reliable than expected with RDMA network’s lossless link layer. [75] further
confirms the unsuitability of one-sided primitives to implement fast RPC.
Figure 2.2 illustrates the two types of communications in RCC employed by each concurrency
control protocol. In two-sided RPC, a coordinator first sends a memory-mapped IO (MMIO) to the
RNIC, which in turn SENDs an RPC request to the receiver’s RNIC. After the corresponding par-
ticipant RECVs the request, its CPU polls a completion event, which later triggers a pre-registered
handler function to process the request and send back a reply using similar verbs. In one-sided
communication, after the participant receives a one-sided request, i.e., READ, WRITE, ATOMIC, its
RNIC will access local memory using a Direct Memory Access (DMA). The completion is sig-
naled when the coordinator polls if it is interested in the completion event.
MMIO is an expensive operation to notify RNIC of a request fetching event. Using one MMIO
for a batch of RDMA requests can effectively save PCIe bandwidth and improve the performance
10
of transaction systems [75], which is called doorbell batching. Meanwhile, having multiple out-
standing requests on the fly can save the waiting time of request completion, thus reducing the
latency of remote transactions [75]. Leveraging co-routines serve to interleave RDMA communi-
cation with computation. RCC uses similar techniques as important optimizations.
2.4 RDMA-Based Concurrency Control
In RCC, we implement six concurrency control (CC) protocols with two-sided and one-sided
RDMA primitives. Among these protocols, NOWAIT [6] and WAITDIE [6] are examples of two-
phase locking (2PL) [6] CC algorithms. They differ in conflict resolution,i.e., how conflicts are
resolved to ensure serialization. Compared to 2PL, Optimistic Concurrency Control (OCC) [43]
reads records speculatively without locking and validates data upon transaction commits—the only
time to use locks. MVCC [7] optimizes the performance of read-heavy transactions by allowing the
read of the correct recently committed records instead of aborting. SUNDIAL [80] leverages the dy-
namically adjustable logical leases to order transaction commits and reduce aborts. CALVIN [71]
introduces determinism with a shared-nothing protocol, demonstrating very different communica-
tion behavior.
tts rts wts[
1..N
] record[
1..N
]
lock record
lock record seq
lock rts wts record
NOWAIT
OCC
MVCC
Sundial
tts record
WAITDIE
Figure 2.3: Metadata
While the protocols themselves are known, we rethink their correct and efficient implementa-
tions in the context of RDMA. Each protocol requires techniques to implement specific protocol
requirements, particularly atomic tuple read (for MVCC) and update (for SUNDIAL). We describe
11
RPC reply RDMA Completion
Local Op
Initiate RDMA Op
Actual Computation
Abort
ON FAIL
Lock FAIL
Abort &
Update TS
ON FAIL
Figure 2.4: Legend for RCC
two implementations of each protocol: RPC version, which mostly uses RPCs enabled by RDMA’s
two-sided communication primitives; and one-sided version, which mainly uses RDMA’s unique
one-sided communication primitives.
2.4.1 Transaction Operations
We consider the following common operations used in one or multiple concurrency control proto-
cols. They can be implemented with either RPC or one-sided primitives.
Fetching Tuples are fetched during transaction execution. The read-only records are fetched
into RS, other accessed records are fetched into WS.
Locking All RCC protocols need locking to enforce certain logical serialization order. For
remote locking, the better implementation choice is affected by the load of remote threads which
execute transaction co-routines. The higher load may affect the capability of handling RPC, thus
one-sided primitives can be better.
Validation This operation is needed in OCC, MVCC, and SUNDIAL in different stages. The
RPC implementation typically requires only one network operation, while the one-sided version
may lead to one or more requests. Similar to locking, the best primitive choice is determined by
the workload of remote co-routines.
12
Logging To support high availability and recovery, each protocol logs its updates to some
backup servers. Similar to DrTM+H and FaSST, RCC employs coordinator log[66] for two-phase-
commit. Only after the successful logging and reception of acknowledgments from all replica, can
the transaction writes the updates back to the remote machine. Logs are lazily reclaimed in the
background of backup machines when they are notified by the coordinator using two-sided RPC.
Logging strongly prefers one-sidedWRITE to log to backup servers forOCC according to [75]. Our
stage-wise latency results support this claim for other protocols.
Update It writes back the updated data and metadata. Two-sided RPCs can finish this update
in one round trip; one-sided primitives need two without doorbell batching. The index of the write
set entries can be cached in advance to reduce the overhead of this operation when using one-sided
primitives. Next, we describe the implementation of each protocol in RCC except forOCC, which
is implemented based on DrTM+H [75]. We choose to base ourOCC implementation on DrTM+H
because it outperforms other RDMA-based OCC implementations by [21] and [39]. Figure 2.4
shows the legend for protocol operations in this section.
2.4.2 NOWAIT
NOWAIT [6] is a basic concurrency control algorithm based on 2PL that prevents deadlocks. A
transaction in NOWAIT tries to lock all the records accessed; if it fails to lock a record that is
already locked by another transaction, the transaction aborts immediately and releases all locks
that have been successfully acquired. Figure 2.5 shows the operations of NOWAIT for both RPC
and one-sided implementations.
With RPC, a coordinator locks records by sending RPC locking request to the corresponding
participant, the RPC handler locks the record using a local CAS. If the CAS fails, a failure message
is sent back to the coordinator which will release all read and write locks by posting RPC release
requests before aborting the transaction. Otherwise, the participant’s handler has already locked
the tuple locally, and it returns a success message with the record in response. On transaction
13
SUCCESS &
RECORD OR
FAIL
Lock Request
FETCH COMMIT EXEC
Participant
CPU RNIC
Coodinator
Write Back &
Unlock
RDMA CAS &
RDMA READ
Coodinator
RDMA WRITE * 2
Write Back & Unlock
Participant
CPU RNIC
RPC
one-sided
CPU RNIC RNIC CPU
Figure 2.5: NOWAIT Implementations
commit, with all locks acquired, a write-back request associated with the updated records is sent to
each participants, where an RPC handler performs write-back of the record and releases the lock.
With one-sided primitive, we use the doorbell batching mechanism as an efficient way to issue
multiple outstanding requests from the sender. With this optimization, only one yield is needed
after the last request is posted, thus reducing latency and context switching overhead. On locking,
the coordinator needs to perform two operations—RDMA CAS and READ—to lock and read the
remote record. Logically, they should be performed one after another, but in fact, the coordinator
can issue READ immediately after CAS to overlap the communication. It is because the two will
be performed in the issue order remotely, and if the lock acquire fails, the coordinator can simply
ignore the returned data of READ. Note that the read offsets are collected and cached by the coor-
dinator before transaction execution starts and do not incur much overhead. With high contention,
the optimization tends to add wasted network traffic. However, for network-intensive applications
with low contention, i.e., SmallBank, the throughput increases by 25.1% while average latency
decreases by 22.7%. Similarly, two RDMA WRITEs are posted to update and unlock the record
14
at the commit stage. Only the second RDMA write is signaled to avoid sending multiple MMIOs
and wasting PCIe bandwidth. Different from lock & read, the doorbell batched update & unlock is
always beneficial.
Key RCC insight: Doorbell-batched CAS and READ fetching a tuple immediately after the
CAS operation, can effectively save network round trips and improve performance.
2.4.3 WAITDIE
Lock Request (ts)
Participant
CPU RNIC
Coodinator
RDMA CAS &
RDMA READ
Coodinator Participant
CPU RNIC
RPC one-sided
CPU RNIC RNIC CPU
SUCCESS &
RECORD
RDMA CAS &
RDMA READ
Figure 2.6: WAITDIE: the FETCH stage
Different from NOWAIT, which unconditionally aborts any transaction accessing conflicting
records, WAITDIE resolves conflicts with a global consensus priority. On start, each transaction
obtains a globally unique timestamp, which can be stored in the lock records it accessed. Upon
detecting a conflict, the timestamp logged in the lock is compared with the current transaction’s
timestamp to determine whether to immediately abort the transaction or let it wait. In RCC, we
construct the unique timestamp of a transaction by appending the machine ID, thread ID, and
coroutine ID to the low-order bits of the local clock time [7]. This avoids the overhead of global
clock synchronization as in NTP [56] and PTP [46]. The timestamp is stored in the 64-bit lock
record.
Compared to NOWAIT, the new operation in WAITDIE is transaction wait. Figure 2.6 shows
the WAITDIE operations in the fetch stage. With RPC, it can be implemented intuitively: when
an accessed record is locked, the lock request handler decides based on the request transaction’s
15
timestamp whether to let it wait until unlocked, or send back a failure immediately. Note that
the handler does not busy-wait for the lock, which blocks other incoming requests. Instead, the
transaction is added to the lock’s waiting list, which is checked in the event loop periodically by
the handler thread. On lock release, the handler thread removes the transaction from the waiting
list and replies to the coordinator with a success message and the locked record.
With one-sided primitive, the implementation is less straightforward. The key difference is that
the current transaction needs to obtain the record’s timestamp—even if it is locked—and decides
to abort or wait by itself. Similar toNOWAIT, we use an RDMACAS followed by an RDMAREAD
to retrieve the remote lock together with its timestamp and record, as seen in Figure 2.6. If the
record is not locked, the CAS succeeds and atomically writes the transaction’s timestamp on the
remote lock, and returns 0. If the CAS fails,i.e.,the record is locked, rather than abort immediately,
the current transaction compares its timestamp with the returned timestamp, which indicates the
lock-holding transaction, to determine whether to abort itself or wait. If the decision is to wait, the
co-routine keeps posting RDMA CAS with READ requests and yields after every unsuccessful trial
until it succeeds.
Our current one-sided implementation of WAITDIE is not starvation-free for old transactions:
when the oldest transaction fails to lock, the lock may be released and reacquired by another
younger transaction, making the oldest transaction starve. One potential solution may be that
a counter is put along with the timestamp and initialized to be 0. When an old transaction de-
tects failure after the first CAS & Read, it increments the counter once by issuing an RDMA
FETCH AND ADD operation, all future younger transactions accessing the record will then abort
until the counter is reset to 0. Another FETCH AND ADD is needed to decrement the counter when
the old transaction successfully grabs the lock.
Key RCC insight: With the use of co-routines, keep posting doorbell-batched CAS and READ
until success does not prevent CPU from processing requests in other co-routines.
16
(a) Write
FETCH
Participant Coodinator
RDMA READ META
Coodinator
RDMA CAS &
RDMA READ
Participant
RPC one-sided
Lock Request (ts)
SUCCESS &
META &
RECORD
OR FAIL &
Updated TS
FAIL & Updated
TS
Coodinator
RPC
Read Request
(ts)
SUCCESS
OR
FAIL &
Updated
TS
Participant
(b) Read
RDMA READ
Coodinator Participant
one-sided
RDMA READ
RDMA CAS
Figure 2.7: MVCC
2.4.4 MVCC
MVCC (Multi-Version Concurrent Control) [7] reduces read-write conflicts by keeping multiple
versions of the record and providing a recently committed version when possible. Shown in Fig-
ure 2.3, the metadata of each tuple inMVCC consists of three parts: 1. write lock, which contains
the timestamp of the current transaction holding the lock that has not committed yet (tuple.tts);
2. read timestamp (tuple.rts), which is the latest (largest) transaction timestamp that has suc-
cessfully read the record; and 3. write timestamps (tuple.wts), which are the timestamps of
recently committed transactions that have performed writes on the record. These versions are kept
in the participants. We also denote the timestamp of the current transaction trying to fetch records
asctts.
To access a record in RS, we check Cond R1: there is a proper record version based on the
tuple.wts of recently committed transactions—it should choose the largest tuple.wts smaller
than ctts; and Cond R2: tuple.tts is 0 or larger than ctts. Cond R2 means there is no un-
committed transaction writing the record, or the write happens after the read, in which the read
transaction can still correctly gets one of the committed versions of the record. If both Cond R1
and R2 are satisfied, the version from Cond R2 can be returned.
To access a record in WS, we check Cond W1: transaction’s timestamp is larger than the
maximum tuple.wts and the current tuple.rts; and Cond W2: the record is not locked. If
17
either is failed, the transaction is aborted; otherwise, the record is locked withtuple.tts updated
toctts, a new record is created and sent back.
Conceptually, MVCC maintains the following properties. A write of transaction ctts cannot
be “inserted” among the committed transactions indicated bytuple.wts; and the write should be
ordered after any performed read. A read should always return the most recent committed version
of a record. The key requirement for correctness is that the condition check for RS and WS record
should be atomic.
The originalMVCC requires using a linked list to maintain a set of record versions. However,
the nature of one-sided primitive makes it costly to traverse a remote linked list—in the worst
case, the number of one-sided operations for a single remote read is proportional to the number of
versions in the list. Thus, we use a static number of memory slots allocated for each record to store
the stale versions. A transaction will simply abort when it cannot find a suitable version among the
slots available for a read operation. The number of slots determines the trade-off between the extra
read aborts and reduced memory/traversal overhead. We choose four slots because our preliminary
experiments show that at most 4.2% of read aborts are due to slot overflow. A typical one-sided
MVCC run over SmallBank reads the four version slots in a likelihood of 80% , 13%, 3% and 2%.
While our design incurs extra space to allocate a static number of version slots compared to using
a linked list, it avoids the more expensive performance penalty for multiple round-trips: 20% of
the requests incur more than one round-trips.
InMVCC, we use the same timestamp organization asWAITDIE. The local clock reduces band-
width overhead of a global clock but may introduce significant bias. While not affecting correct-
ness, the large time gap between different machines may lead to a long waiting time. To mitigate
the issue, each transaction co-routine maintains a local time and adjusts the local time whenever it
finds a larger tuple.wts or tuple.rts in any tuple received. The encapsulated remote time on
the tuple.wts or tuple.rts is extracted and local time is adjusted accordingly if the extracted
remote time is larger. This mechanism limits the gap of local timer between machines, and reduces
the chance of abort due to the lack of suitable version among the fixed version slots.
18
While it is not hard to conceptually understandMVCC, the implementation with RDMA needs
to ensure atomicity. Let us first consider accessing records in WS. One way is to first check Cond
W2 and lock the record, at this point, the metadata cannot be accessed by other writes, we can
reliably check Cond W1. If it is not satisfied, the lock is released and the write transaction aborts.
However, in this way we need to perform a lock for every write, even if the write transaction
cannot be properly serialized. It is particularly a problem for one-sided primitives, because the
lock is implemented with an RDMA ATOMIC CAS. The better approach is to first check Cond W2
and then acquire the lock. However, a subtle issue raises because Cond W1 and Cond W2 are not
done atomically. Between the point that Cond W1 is satisfied and the point the lock is acquired,
another transaction that writes the record can lock the record and commit (unlock). According to
the protocol property, the current transaction should be aborted, but it will find both Cond W1 and
Cond W2 satisfied. To ensure atomicity while avoiding the overhead of locking. We propose the
double-read mechanism. After the lock is acquired, Cond W1 should be checked again, if it is still
satisfied, the write can proceed, otherwise, it is aborted.
As in Figure 2.7 (a), with RPC, the write protocol can be implemented by the handler on the
participant. With one-sided primitive, the coordinator posts an RDMA READ to read the metadata
of the record—tuple.rts andtuple.wts—on the participant, then checks Cond W1 locally. If it
is satisfied, the coordinator posts an RDMA ATOMIC CAS to lock the record, and a second RDMA
READ to fetch the tuple. Cond W1 can be checked again based on the just returnedtuple.rts and
tuple.wts, if it still holds, the returned record tuple.record is kept locally in the coordinator.
Otherwise, the transaction aborts and the lock is released.
When accessing records in RS, the tuples need to be fetched atomically. The separate double-
read mechanism discussed before can be generalized to two consecutive reads of the same tuple.
If the contents of each returned data are the same, then we are sure that atomicity is not violated.
Based on the atomically read tuple, Cond R1 and Cond R2 can be checked to generate the appro-
priate committed version for the record in RS. If the second tuple returned is different from the
first, then the transaction is simply aborted. We apply a small optimization to reduce unnecessary
19
abort: among the two versions of metadata, we only need to ensure the match oftuple.wts. The
tuple.tts can be different since a transaction corresponds to the first tuple.tts can be aborted
between the two reads. But as long as Cond R2 is satisfied, the read can still get a version among
tuple.wts.
As in Figure 2.7 (b), with RPC, the read procedure can be implemented in a straightforward
manner with the handler on participant. With one-sided primitives, the two reads are implemented
by two doorbell batched RDMAREADs. The only additional operation is to use an RDMAATOMIC
CAS to update rts of the record in the participant. If it fails, we can simply retry until succeed.
Note that it does not imply conflict, but just multiple concurrent reads.
On commit, with one-sided primitive, the coordinator locally overwrites the oldest wts with
its own ctts, and updates the corresponding record to the locally created one for write. Then
it posts two RDMA WRITEs. The first write puts the locally prepared new record+metadata to
the participant; the second write releases the lock. With RPC, the procedure can be implemented
similarly.
Garbage collection & memory management Since our MVCC uses a static number of slots
instead of employing a linked list, all slots are pre-allocated both for the use of two-sided RPC
function calls and for one-sided RDMA access. Therefore, it is unnecessary to garbage collect
stale versions when they are out of visibility of any read/write.
Clock synchronization MVCC uses local clock plus adjustment instead of global clock syn-
chronization to avoid wasting network bandwidth. Global synchronization protocols like NTP [56]
are typically used to keep machines synchronized with the Internet within milliseconds skew.
PTP [46] can synchronize network computers within sub-milliseconds skew by employing a Best
Master Clock (BMC) algorithm. We integrate the adjustment within theMVCC protocol, making
the synchronization on demand.
Key RCC insight: One-sided primitive makes traversing a remote linked list costly; using a
static number of version slots can effectively reduce communication overhead.
20
FETCH
Participant Coodinator Coodinator Participant
RPC
one-sided
Write Request (ts)
SUCCESS &
META &
RECORD OR
FAIL
RDMA CAS &
RDMA READ
(a) Write
Coodinator
RDMA READ
Coodinator
RPC one-sided
Read Request
SUCCESS &
Data OR
FAIL
RDMA READ
Participant
(b) Read
Participant
Figure 2.8: SUNDIAL
2.4.5 SUNDIAL
SUNDIAL [80] is an elegant protocol based on logical leases to avoid unnecessary aborts while still
maintaining serialization by dynamically adjusting the timestamp of transactions or commit order.
Based on the tuple format in Figure 2.3, the lease of a tuple is specified by tuple.[wts,rts].
Each transaction has a commit tts, which indicates the required timestamp of the transaction
to satisfy the current lease of accessed records. When accessing a record in RS, the transaction
atomically reads the tuple and updatecommit tts toMax(commit tts,tuple.wts). It is because
to correctly read the record, the transaction has to be logically ordered after the most recent writer
transaction. When accessing a record in WS, the transaction tries to lock the tuple, and if it is
also in RS, checks whether tuple.wts is the same as the RS[key].wts. The second condition
ensures that there is no transaction writing the record committed since the read. If both conditions
pass, the transaction’scommit tts is updated toMax(commit tts,tuple.rts+1). It ensures that
the transaction is logically ordered after the current lease of the record. Since other transactions
may have read the record during the lease, without such update, the transaction would have to be
aborted.
Although the update ofcommit tts during execution will try to satisfy the current lease based
on individual record, at the commit time, the transaction needs to be validated to ensure its cur-
rent commit tts falls into all leases of records in RS. If it is not satisfied, SUNDIAL allows the
21
transaction attempt to renew the lease by adjusting the tuple.rts in the data store at partici-
pant
1
. The renew is failed if (1) the current wts is not the same as current tuple.wts, meaning
that there is a later committed transaction writing the record, which invalidate the previous read
record; or (2) the record is locked, meaning that there is a transaction trying to write the record,
which prevents the lease extension. Otherwise, the transaction can adjust the lease by updating
tuple.rts to commit tts. The key requirement is that the lease renewal operation should be
performed atomically. If all RS records are validated, and all necessary lease renewals are suc-
cessful, the transaction is committed, updatingtuple.wts andtuple.rts of all records in WS to
becommit tts.
For records in WS with one-sided primitives, the tuples can be easily checked after a doorbell
batchedCAS andREAD to lock and retrieve the tuple, as in Figure 2.8 (a). Yet to implementSUNDIAL
in RCC, we need to solve two problems. First, for records in RS, the tuple needs to be accessed
atomically. This can be done using the double doorbell batched reads with one-sided primitives or
simply double read with RPC introduced inMVCC, as shown in Figure 2.8 (b). Second, we need
to ensure the atomic lease renewal, which is more challenging than atomic read. To implement
this, we first atomically read the tuple from participant, then use an atomic operation to update
tuple.rts. With these two ideas, we can implement RPC and one-sidedSUNDIAL.
In RPC version, the atomic tuple read and lease renewal are all performed by the handler in the
participant. The coordinator just poses the read and renewal requests and processes the responses
according to the protocol. In one-sided version, the fetch of tuples in RS and WS is similar to
MVCC with double doorbell batched reads. Based on the fetched tuples, the coordinator locally
performs the SUNDIAL protocol operations. For lease renewal, the coordinator first atomically
reads the tuple, then checks the lease extension condition, if it is allowed, it poses an RDMA
ATOMIC CAS with the previous tuple.rts is the old value and its commit tts as the new value.
In this way, the lease renewal is performed atomically. It is worth noting that we can implement
in this manner because theSUNDIAL protocol only requires updating one variable tuple.rts to
1
The conditioncommit tts must be greater thanwts of the record in RS based on how it is updated
22
renew the lease. If multiple variables need to be updated, then more sophisticated mechanisms are
needed and it is beyond the scope of this dissertation.
Key RCC insight: We can efficiently implement atomic tuple read by using double doorbell
batched READs of one-sided primitives or double read with RPC. Atomically renew the lease in
SUNDIAL by atomically updatingtuple.rts.
2.4.6 CALVIN
Different than all other protocols,CALVIN [71] enforces a deterministic order among transaction
in an epoch, e.g., all transactions received by the system during a certain time period. The readers
can reference the original paper for the complete motivation and advantages of this approach, we
are interested in how the communication happen and can be implemented in RDMA for such a
protocol.
SEQ
CPU RNIC
RDMA WRITE * 2
SEQ
CPU RNIC
RPC one-sided
SEQ
RNIC CPU
Txn Input
Broadcast
SEQ
RNIC CPU
Done
Figure 2.9: CALVIN : Txn Input Broadcasting
In RCC, CALVIN works as follows. For each epoch, each machine node receives a set of
transactions. The sequencing layer in each machine determines the order of the locally received
transactions and broadcasts them to all other machines. After the transaction dispatch, each ma-
chine has the whole set of transactions in the epoch with a consensus and deterministic order. The
transaction dispatch incursCALVIN ’s first source of communication: the transaction inputs, its RS
and WS, will be delivered to all other machines. With RPC, such information can be sent in batch
and the receiver nodes will store the data locally. With one-sided primitives, the implementation is
23
more challenging, since the sender node needs to be aware of the location to write to remote nodes.
We design a specific buffer structure, in each node that is known among all machines, so that the
sender can directly use doorbell batched RDMA WRITEs to deliver the transaction information to
all other nodes and update metadata. The procedure is shown in Figure 2.9.
Figure 2.10 exemplifies the buffer organization design for CALVIN . RCC CALVIN uses two
memory buffers that enable RDMA remote access. 1. CALVIN Request Buffer (CRB). Each CRB
contains one CALVIN Header (CH) and a list of CALVIN Requests (CR). Each CH has control
information forCALVIN ’s scheduler to decide whether it has collected all transaction inputs in one
epoch and whether all transactions in a batch have finished execution and all threads should move
on to the next epoch. 2. CALVIN Forward Buffer (CFB). Each execution co-routine uses one CFB
to receive forwarded values from other machines. We will discuss CALVIN ’s value forwarding
in later paragraphs. Besides these two buffers above, to support asynchronous replication, each
backup machine has a list of CRBs for receiving asynchronous backup requests for each epoch.
Calvin Header (CH) Epoch Status Batch Sz Chunk Sz Received Sz
Calvin Request (CR)
req_idx/req_seq timestamp req_info
CH CH
Partition 0 Partition 1
CR
1
CR
2
CR
3
CR
4
CH CH
RWSets
RWSets
RWSets
RWSets
CFB
RWSets
RWSets
RWSets
RWSets
CFB CRB
0
CRB
1
CRB
0
CRB
1
RWSets len0 val0 len1 val1 len2 val2
CR
1
CR
2
CR
3
CR
4
CR
1
CR
2
CR
3
CR
4
CR
1
CR
2
CR
3
CR
4
Figure 2.10: RDMA-enabled buffer organization with batch size per epoch = 4 and the maximum
number of read/write sets per transaction = 3.
CALVIN has the unique execution model that each transaction is executed by multiple ma-
chines. Specifically, the machines that have records in WS according to data partition will execute
24
the operations of the transaction that will write these records. These machines are called active
participants. The machines that have records in RS are called passive participants, since they do
not contain records in WS, they do not execute the transaction but only provide data to active par-
ticipants. To start the execution in active participants, they need to get all records in RS/WS. This
leads to the second communication inCALVIN .
First, the passive participants need to send the local records in RS to all active participants.
Second, the active participants need to send the local records in WS to the other active participants.
Actively participants will wait and collect all the needed records forwarded from other machines.
Two-sided implementations is easier since we can simply use a data structure for holding the
mappings from tuple key to their values in the epoch. The one-sided version needs two doorbell-
batched RDMA WRITEs to forward value and notify the receiver. After the communication, the
transactions can execute in active participants.
We only described the key operations inCALVIN that is relevant to communications and omit
many details, which can be found in [71]. The main challenge of implementing CALVIN is to
design the sophisticated data structures to facilitate the correct communication between machines,
especially for one-sided primitives. We choose not to discuss them in detail since it is mainly en-
gineering efforts. Compared to other five protocols, we do not need to consider many subtle issues
to ensure correctness, because after transaction dispatch and RS/WS preparation, the execution is
mostly local. We believe including CALVIN in RCC is important because we can understand the
communication implementation and cost for the shared-nothing protocol. As far as we know, it is
also the first implementation of CALVIN with RDMA.
Key RCC insight: We can implement the input broadcasting and value forwarding by two
doorbell-batched WRITEs in the one-sided version with a careful design and handling of RDMA-
enabled buffer data structures.
25
2.5 Hybrid Protocols
RCC can evaluate all protocol stages, a natural question is: what would be the best implementation
if we can use different primitives for different stages? DrTM+H [75] only provides the answer for
OCC, but what about others?
2.5.1 Methodology
RCC uses two methods for exploiting the potential of hybrid protocols. The first method is based on
the stage-wise latency breakdown produced by RCC. Accordingly, the hybrid designs for protocols
can be straight-forward by cherry-picking the better communication type among the two-sided and
one-sided world for each operation. Figure 2.11 shows the latency-breakdown of all five protocols
in RCC using one co-routine under various workloads. As one example, we can see that for
SmallBank: 1. a hybrid design ofMVCC which includes RPC Read & Lock and one-sided Log &
Release & Commit can be a good candidate of hybridMVCC; 2. a hybrid design ofSUNDIAL which
includes RPC Read & Renew and one-sided Lock & Log & Commit will incur shorter latency and
thus may improve its throughput on SmallBank. With the analysis of latency results, we see that: 1.
Log, Commit and Release operations prefer one-sided operations; 2. SUNDIAL’s renew operation
prefers two-sided RPC; 3. For complex read/lock operation as inMVCC andSUNDIAL, two-sided
RPC may be rewarding; and 4. The best hybrid designs of any protocol are workload-dependent.
0
2
4
6
SmallBank
0
2
4
6
Read Lock Validate Log Release Commit Renew
0
10
20
30
40
YCSB
0
10
20
30
40
nowait
waitdie
occ
mvcc
sundial
0
2
4
6
8
10
TPC-C
nowait
waitdie
occ
mvcc
sundial
0
2
4
6
8
10
Figure 2.11: Latency breakdown: RPC vs. one-sided
26
Alternatively, RCC has implemented all protocols in a way that makes it possible to conduct
the exhaustive search of all combinations of hybrid protocols. This is useful when multiple co-
routines are involved or when the system load is high. RCC provides a configurable framework
that could comprehensively evaluate any two-sided, one-sided, and any combination of hybrid
implementations of protocols included. To achieve this goal, we provide a “code” for each hybrid
implementation, each binary digit in the code specifying the primitive to use for each stage. This
interface allows RCC to be friendly to both common and expert users: common users can find the
best hybrid implementation given the protocol and the workload specification. Expert users can
specify their own hybrid code to indicate the primitive used in each stage and verify their intuitions
quickly. By leveraging RCC, we aim to find solid evidence of the best hybrid design instead of
allowing users to guess and try based on suggestive guidelines. Figure 2.12 shows a comparison of
stage-wise hybrid protocols compared to their purely two-sided RPC or one-sided implementation
when 10 co-routines are used for four protocols. Each blue dot in Figure 2.12 corresponds to a
hybrid design with one combination of primitives in different stages. It can be seen that most
hybrid designs span in the middle of purely two-sided and purely one-sided designs. Yet there are
hybrid ones that do outperform the better of the two both latency-wise and throughput-wise.
5 6 7 8
0.05
0.06
0.07
Latency
nowait
5 6 7 8
0.05
0.06
0.07
waitdie
4 5 6
0.06
0.07
0.08
occ
5 6 7
0.05
0.06
0.07
mvcc
4 5 6 7
0.05
0.06
0.07
0.08
sundial
0.3 0.4 0.5 0.6
0.8
0.9
1.0
Latency
0.3 0.4 0.5 0.6
0.8
0.9
1.0
0.3 0.4 0.5
0.8
0.9
1.0
1.1
0.4 0.5
0.8
0.9
1.0
0.3 0.4 0.5
0.9
1.0
0.8 1.0 1.2
Tput
0.30
0.35
0.40
Latency
0.8 1.0 1.2
Tput
0.30
0.35
0.40
0.5 0.7 0.9
Tput
0.4
0.5
0.6
0.7
0.8 1.0
Tput
0.4
0.5
0.6
0.8 1.0 1.2
Tput
0.3
0.4
Figure 2.12: Performance of all stage-wise hybrid implementations compared with two-sided or
one-sided implementation: RPC (Green), one-sided (Red), hybrid (Blue) for three workloads
from up to down: SmallBank, YCSB, TPC-C.
2.5.2 Implementation Requirements
The design of a universal hybrid implementation generator needs to satisfy several requirements to
ensure correctness. First, the remote tuple address must be recorded for RPC Read or Lock since
27
0
1
2
SmallBank
Throughput (M txns/s)
0.00
0.02
0.04
1.59 2.12 1.87 1.68 1.92
Latency (ms)
0.0000
0.0005
0.0010
0.0015
Abort Rate
0.0
0.5
1.0
1e6 Network Round Trips
0.0
0.1
0.2
0.3
YCSB
0.0
0.1
0.2
0.3
4.98 5.64 2.88 4.60 5.05
0.000
0.005
0.010
0
2
4
6
1e5
nowait
waitdie
occ
mvcc
sundial
0.0
0.2
0.4
0.6
TPC-C
nowait
waitdie
occ
mvcc
sundial
0.00
0.05
0.10
0.15
0.20
4.94 4.61 6.08 5.34 5.33
nowait
waitdie
occ
mvcc
sundial
0.0
0.2
0.4
0.6
nowait
waitdie
occ
mvcc
sundial
0.0
2.5
5.0
7.5
1e5
TCP RPC one-sided hybrid
Figure 2.13: Overall throughput, latency, abort rate and # of network round trips
future one-sided stages may need the offset to access the tuple. Second, any two-sided or one-
sided stage must work correctly, assuming that it may work with another stage using a different
primitive. We rely on a shared RDMA-enabled memory region for every tuple in the read/write
set to maintain the correct communication between stages. Third, the heterogeneous stages must
reach consensus to indicate if one has finished its work correctly. This may cause trickiness if
not handled carefully: RPC handler must notify lock requesters of the completion of the lock
by not only sending back a success reply, but also writing a success bit in the agreed region in
the RDMA-enable memory of the locked tuple so that one-sided Release stage can successfully
release the lock.
2.6 Evaluation
2.6.1 Workloads and Setups
We use three popular OLTP benchmarks, SmallBank [69], YCSB [15], and TPC-C [17], to evaluate
protocols using two-sided RPC, one-sided primitives, and stage-wise hybrid primitives. For each
protocol and benchmark, we leverage the cherry-picking methodology described in section 2.5 to
find the best hybrid design, which can happen to be purely one-sided across all stages. We include
the traditional TCP-based protocols in the evaluation section. For all benchmarks, records are
partitioned across nodes.
28
SmallBank [69] is a banking application. Each transaction performs reads and writes on the
account data. SmallBank features a small number of writes and reads in one transaction (< 3) with
simple arithmetic operations, making SmallBank a network-intensive application.
YCSB [15] (The Yahoo! Cloud Serving Benchmark) is designed to evaluate large-scale Inter-
net applications. There is just one table in the database. YCSB parameters such as record size, the
number of writes or reads involved in a transaction, the ratio of read/write, the contention level, and
time spent at the computation phase are all configurable. In all our experiments, the record length
is set to 64 bytes. The number of records in the YCSB table is proportional to the cluster size and
the number of transaction threads used. By default, each transaction contains 10 operations: 20%
write, and 80% read, and it spends 5 microseconds in its execution phase. The hot area accounts
for 0.1% of total records. The contention in YCSB is controlled by allowing a configurable per-
centage of read/write to access the hot area, which we call the Hot Access Probability, which is
10% probability by default. In Section 2.6.4, we study the effects of contention.
TPC-C [17] simulates the processing of warehouse orders and is representative of CPU-intensive
workloads. In our evaluation, we run the new-order transaction since other transactions primarily
focus on local operations. The new-order accounts for 45% in TPC-C and consists of longer (up
to 15) distributed writes and complex transactions.
We evaluate RCC on four nodes of an RDMA-capable EDR cluster, each node equipped with
two 12-core Intel Xeon E5-2670 v3 processor, 128GB RAM, and one ConnectX-4 EDR 100Gb/s
InfiniBand MT27700. With one RNIC on each node, we run evaluations on the CPU on the same
NUMA node with the RNIC to prevent NUMA from affecting our results. By default, we use ten
transaction execution threads affiliating to ten cores (12 threads excluding one stat thread and one
QP thread) and use 1 co-routine in section 2.6.2 and 10 co-routines in section 2.6.4, 2.6.5, and
2.6.6. We enable 3-way replication for RCC. The implementations in RCC are evaluated on three
metrics: throughput, latency and abort rate.
RCC ensures unbiased cross-protocol comparison by enforcing that 1) all protocol implemen-
tations share the same RDMA library and thus share the same set of RDMA-related parameters
29
such as max package size and max receive queue size. 2) other system-wide parameters such as
co-routines numbers, thread numbers are the same among protocol comparisons. 3) the same set
of workload-related parameters are applied among comparisons. These three factors allow only
the protocol themselves as the variable, thus making the comparison unbiased.
2.6.2 Overall Results
Figure 2.13 shows the results of all three implementations of the six protocols. The results show
the effects of different implementations and cross-protocol comparisons.
For the same protocol, the performance of one-side is generally better than RPC, except MVCC
under TPC-C.MVCC does not benefit from one-sided primitives on TPC-C, both latency-wise and
throughput-wise. As TPC-C contains long 100% write operations, all protocols incur over 50%
abort rate. Therefore latency is determined by how quickly an abort decision can be made. one-
sided MVCC does not outperform RPC in this scenario since a one-sided MVCC transaction may
need two round trips to decide to abort.
Across all one-sided implementations, OCC is one best choice for YCSB, yet it becomes the
second-to-the-worst for SmallBank. In fact, one-sided 2PL has better performance on SmallBank
over one-sidedOCC,MVCC andSUNDIAL. Besides, the best protocol choice depends not only on
workload characteristics but also on communication types. For YCSB, the performance of RPC
implementations are similar across protocols while one-sided ones peaks atOCC.
Stage H-MVCC⋆ H-SUNDIAL ⋆ H-SUNDIAL †
Read R R O
Lock R R O
Validate N/A N/A N/A
Log O O O
Release O O O
Commit O O O
Renew N/A R R
Table 2.1: Stage-wise hybrid strategy for the three cases that achieve better performance compared
to their RPC and one-sided counterparts.(R for RPC and O for one-sided. ⋆ indicates SmallBank
and † indicates YCSB.)
30
2 4 6 8
0.02
0.03
0.04
0.05
0.06
Latency
nowait
2 4 6 8
0.02
0.03
0.04
0.05
0.06
0.07
waitdie
1 3 5
0.02
0.03
0.04
0.05
0.06
0.07
occ
2 4 6
0.02
0.03
0.04
0.05
0.06
0.07
mvcc
2 4 6 8
0.02
0.03
0.04
0.05
0.06
0.07
sundial
RPC one-sided hybrid
0.2 0.3 0.4 0.5
Tput
0.2
0.4
0.6
0.8
1.0
Latency
0.2 0.3 0.4 0.5
Tput
0.2
0.4
0.6
0.8
1.0
0.2 0.3 0.4
Tput
0.2
0.4
0.6
0.8
1.0
0.2 0.3 0.4
Tput
0.2
0.4
0.6
0.8
1.0
0.2 0.3 0.4
Tput
0.2
0.4
0.6
0.8
1.0
Figure 2.14: Throughput (M txns/s) and latency (ms) when varying #co-routines for SmallBank
(Up) and YCSB (Down).
2.6.3 Effect of Co-routines
As for the chosen hybrid implementations, they generally have higher throughput than RPC or one-
sided counterpart by 32.2% and up to 67%; we found three hybrid implementations performing bet-
ter than both throughput-wise. On SmallBank, the hybridMVCC performs 17.8% and 21.7% better
than the RPC and one-sided implementations; the hybridSUNDIAL performs 14.8% and 8.6% bet-
ter than its RPC and one-sided counterparts. On YCSB, the hybridSUNDIAL performs 51.6% and
4.5% better than the RPC and one-sided implementations. Table 2.1 lists the hybrid strategies for
these three occurrences. It is clear that the best hybrid choice forSUNDIAL is workload-dependent.
Figure 2.14 shows the latency and throughput change when increasing the number of co-
routines from 1 to 11 with a step of 2 for both SmallBank and YCSB. We see that the latency
is always increased with more co-routines due to the fact that with more co-routines the waiting
time of each of them being served in a round table by their serving thread is increased. Also, the
throughput increases since more co-routines can hide the latency of network operations. However,
we also observe that throughput starts to plateau after a certain number of co-routines. This is due
to the higher contention with longer latency. The performance of hybrid implementations lies in
the middle between RPC and one-sided ones for SmallBank and similar to the one-sided imple-
mentations for YCSB as more co-routines are used. Note that in Figure 2.14, hybrid designs may
31
not perform equivalently or better with more co-routines since these best hybrid designs are stat-
ically determined when #co-routines is 1. With every #co-routines greater than 1, the best hybrid
design needs to be re-evaluated by enumeration.
0
100
200
SmallBank
1 3 5 7 9 11
# co-routines
0
10
20
YCSB
TCP RPC one-sided
Figure 2.15: CALVIN throughput (K txns/s)
Figure 2.15 shows the results forCALVIN . Due to its shared-nothing architecture, it is not di-
rectly comparable to others. In both RPC and one-sided, we see that increasing #co-routines may or
may not improve throughput. This is becauseCALVIN requires RDMA-based epoch synchroniza-
tion among all sequencer co-routines on all machines; therefore network latency due to staggered
co-routines cannot be hidden with the use of increasing #co-routines. Note that RDMA-based
CALVIN does not reach as higher throughput as other protocols compared to their TCP counterpart
due to high synchronization cost.
2.6.4 Effect of Contention Level
Figure 2.16 shows the throughput of RPC and one-sided implementation of different protocols with
different contention levels using YCSB. We control the contention levels by limiting the number of
hot records to 0.1% of total records and varying the possibility of read/write visiting hot records.
We have several key observations. With low contention, the throughput differences are small,
and the worst one-sided is better than the best RPC. As the contention increases, the thoughput
32
of all protocols decrease, butOCC always drops most significantly because of a larger possibility
to abort and high abort cost due to its optimistic assumption under a high contention level. The
performance ofNOWAIT andWAITDIE also decrease considerably due to the intensive conflict read
and write locks. MVCC andSUNDIAL are less affected when the conflict rate increases. As a result,
with high contention, the throughput of different protocols become quite different, but the gaps
between RPC and one-sided are much smaller. We also notice that one-sidedSUNDIAL andMVCC,
although featuring advanced read-write conflict management, are worse than one-sided OCC at low
contention. It is because these two have more complicated and costly operations to maintain more
information to reduce the abort rate. After all, every access to remote data will trigger network
operation in their one-sided versions. A key conclusion is thatOCC is not the best—in fact always
the worst with high contention, it can only be justified with a common framework.
0 25 50 60 70 80 90 95 99
Hot Access Probability (%)
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Throughput (M txns/s)
RPC
0 25 50 60 70 80 90 95 99
Hot Access Probability (%)
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
one-sided
nowait
waitdie
occ
mvcc
sundial
Figure 2.16: Effect of contention level on throughput
2.6.5 Effect of Computation
1 2 4 8 16 32 64 128256
Txn computation ( μ s)
0.0
0.1
0.2
0.3
0.4
0.5
Throughput (M txns/s)
RPC
1 2 4 8 16 32 64 128256
Txn computation ( μ s)
0.0
0.1
0.2
0.3
0.4
0.5
one-sided
nowait
waitdie
occ
mvcc
sundial
Figure 2.17: Effect of computation on throughput
33
To study the effect of different computation time in the whole life of transaction execution,
we add dummy computation in the execution stage of YCSB, ranging from 1 to 256 µs. We
show results in Figure 2.17. We observe that (1) RPC and one-sided share a similar decreasing
trend as computation increases; and (2) the advantage of one-sided over RPC is diminishing as
the computation increases. For RPC, more computation adds latency to handle RPC request; for
one-sided, more computation narrows its advantage over RPC due to the non-involvement of CPU
in communication.
4 40 80 120 160
emulated machines
0.20
0.25
0.30
0.35
0.40
0.45
0.50
YCSB Tput (M txns/s)
RPC
4 40 80 120 160
emulated machines
0.20
0.25
0.30
0.35
0.40
0.45
0.50
one-sided
nowait
waitdie
occ
mvcc
sundial
Figure 2.18: Throughput on emulated large EDR clusters.
2.6.6 Scalability of QPs
To understand how protocols perform with a much larger cluster, we run all protocols against the
YCSB benchmark (90% hot access probability and 0.1% hot area) on several emulated larger EDR
clusters. as shown in Figure 2.18. Each RDMA operation selects the sending QP from multiple
same-destination QPs in a round-robin manner to emulate the network traffic on large clusters.
We observe that on emulated larger clusters, one-sided implementations maintain their superiority
over RPC ones up to a 160-node cluster, yet the advantage gap closes as cluster size increases. We
attribute this behavior to the fact that an increasing number of QPs needed for larger clusters will
cause performance loss due to limited NIC capabilities.
34
2.7 Related Work
Distributed transaction processing requires dataset partitioned into multiple machines [53]. Dis-
tributed data is also common in other areas such as WBAN [68] and cloud systems [83]. Deneva [31]
is the most recent work comparing distributed concurrency control protocols in a single unified
framework, however all protocols in Deneva are based on TCP. In terms of comparing RDMA
primitives, FaRM [21] finds out that RDMA WRITE’s polling outperforms SEND and RECV verbs.
FaSST [39] shows that UD-based RPC outperforms one-sided primitives. DrTM+H [75] did more
primitive-level comparisons with different payload sizes. Compared to them, RCC compares prim-
itives by constructing a wide range of concurrency control algorithms using both primitives. High
performance transaction systems have been investigated intensively [71, 16, 73, 22, 12, 75, 45].
Most of them focus on distributed transaction systems [16, 22, 12, 75] since it is more challenging
to implement a high performance transaction system with data partitioned across the nodes. Some
works, e.g., [45, 22, 12, 75, 39], focus only on one protocol (i.e., some variants of OCC). Other
works like [76, 80, 71] explore novel techniques like determinism or leasing. However, these
works did not explore the opportunity of using RDMA networks. Compared to NAM-DB [81],
which implements a snapshot protocol based on one-sided primitives, RCC builds serializable pro-
tocols with both primitives. Besides RCC, many domains, such as machine learning, use the hybrid
method [59] to obtain better performance compared to the pure way.
2.8 Summary
In order to leverage distributed memories efficiently, we re-visit the idea of Remote Direct Memory
Access (RDMA). As a case study, we develop RCC, the first unified and comprehensive RDMA-
enabled distributed transaction processing framework containing six serializable concurrency con-
trol protocols. One goal of RCC is to unbiasedly compare RDMA-based protocols on OLTP work-
loads in a common execution environment with the concurrency control protocol being the only
changeable component. Based on RCC, we get the insights that cannot be obtained by any existing
35
systems. Most importantly, we obtained and analyzed stage-wise latency breakdowns to develop
efficient hybrid implementations. Moreover, RCC can enumerate all hybrid implementations of
a protocol under a given workload characteristic. RCC is a significant advance over the state-
of-the-art; it can provide performance insights and be used as the common infrastructure for fast
prototyping new hybrid implementations.
36
Chapter 3
Enabling memory/compute dis-aggregation on distributed
graph systems
3.1 Introduction
With the advance of ultra-fast network enabled by RDMA, Resource dis-aggregation has become
a hot topic in cloud computing and in designing cloud database management systems [35] [48] in
both industry and academy. Hardware resource dis-aggregation provides many benefits such as
better elasticity, higher hardware resource utilization etc. However, most existing software were
built without the idea of resource dis-aggregation in mind, not to mention that many of them are
frameworks, middle-wares that many applications have been running on top of. Therefore, the as-
sumption of a monolithic machine with all hardware resources within it must be used all together
may waste unnecessary hardware resources or may slow down the applications. Researchers in the
community have proposed and implemented distributed software artifact that tries to fill the gap
between the prospect of ideal hardware dis-aggregation and current software reality with the assis-
tant of ultra-fast network enabled by RDMA [9]. One example is LegoOS [63], which designed
an operation system specially for dis-aggregated hardware resources by viewing them as network-
attached components. However, besides traditional operation systems, there are many other exist-
ing distributed software that also need to be re-visited with the idea of resource dis-aggregation. In
this chapter, we re-visited two distributed graph frameworks, Gemini [85] and Kudu [10]. Gemini
37
is a recent distributed graph processing system and Kudu is the latest distributed graph-mining
engine. Our goal is to understand their task computation patterns and memory access patterns,
and to enable these frameworks to use different amount of computation and memory machines
disproportionately.
The design of such memory/compute dis-aggregation for graph frameworks inherits several
key challenges. First, without decoupling computation and memory resources, the original graph
framework may assume that distributed computation workload is only supposed to be done against
local graph partition. Yet with such dis-aggregation enabled, all graph workloads are required to
be properly handled only on compute machines, irrespective of where graph data is actually stored
and far memory machines where partition of graph data is located cannot execute local graph
computation. Second, graph workloads may need special attention for handling the graph data
since each framework has its own computation pattern and framework-specific features, which
introduces extra complexity and trickiness. Third, certain optimizations are required to make the
dis-aggregation enabled graph frameworks feasible and performance-wise reasonable compared to
the original framework.
To address the first challenge, we proposed to equip a delegation set with each compute node
so that compute nodes can take care of the tasks that were supposed to be done on memory nodes.
To address the second challenge, we designed different delegation techniques by making use of
framework-specific features. For D-Gemini, we designed a delagation-based propagation model
that is tightly coupled with Gemini’s mini-step-based communication per iteration. For D-Kudu,
we leveraged Kudu’s fined-grained task stolen idea to enable memory nodes’ task delegation. To
tackle the third challenge, we used a series of caching methods including bitmap caching, index
caching, and edge pre-fetching and caching to reduce the communication overhead for D-Gemini.
As of D-Kudu, we leveraged an insight to leverage more local-memory copies instead of resort-
ing to inter-node communication. In later sections, We will re-visit these challenges and provide
detailed solutions.
In this chapter, we made the following contributions:
38
• We proposed a novel execution paradigm for graph processing and graph mining in a dis-
tributed setting, which realizes the decoupling of memory and computation resources for
executing de-centralized graph tasks.
• We designed SlimGraph, a thin layer targeting graph processing with resource dis-aggregation,
on top of which we ported existing distributed graph systems with small re-engineering ef-
fort. Two dis-aggregation based systems named D-Gemini and D-Kudu are built by porting
Gemini [85] and Kudu [10] upon SlimGraph.
• We proposed several optimizations to make the altered dis-aggregation-aware systems reach
a reasonable performance compared to the original ones.
3.2 Background and Related Work
Graph Representation. Two commonly used methods for representing a graph are Compressed
Sparse Row (CSR) and Compressed Sparse Column (CSC). As seen in Figure 3.1, both represen-
tations contain an index array and an edge array. For a certain vertex v, the index array contains the
index of v into the edge array where Neighbors(v) are located. For CSR, Neighbors(v) is defined
as all the u where v→ u is an edge in the graph. For CSC, Neighbors(v) is defined as all the u
where u→ v is an edge. We use the CSR format for representing graphs in SlimGraph .
0 1
2 3
Example
graph:
0
1
2 2 3
3 0 1 2
0 1 2 3
Index
Array
Edge
Array
(a) CSR representation
0
2
1 3 4
0 3 3 0
0 1 2 3
(b) CSC representation
5 5
Figure 3.1: Graph Representations.
Graph Processing. There are many key graph computation algorithms like breadth first search
(BFS) and SSSP (Single Source Shortest Path). Given a large input graph, it is critical for executing
39
these graph computation workloads efficiently, especially when the size of the graph is beyond the
memory limit of a single machine. A simple abstraction for expressing graph algorithms is another
critical goal. There have been many graph processing research work. Powergraph [28] provides
a new abstraction for graph processing by leveraging the internal structure of power-law graphs.
Powerlyra [11] uses a differentiated processing technique for handling low-degree and high-degree
vertices to reduce the overhead of computation and communication brought by the challenge of
skewed graphs. Likewise, we used a different technique for processing huge-degree vertices in
D-Kudu. Goffish [65] provides a sub-graph-centric programming abstraction for processing large-
scaled graphs while our D-Gemini employs the vertex-centric model similar to Pregel [54] and
Gemini [85] for chunk-based partitioning and edge processing. PGX.D [32] implements a faster
distributed graph processing framework then single-machine execution by large traffic reduction
and good workload balance. We used bitmap/index cache and well-managed edge cache to reduce
the communication traffic, and also a well-defined delegation set for a balanced workload among
compute/memory nodes. Graphlab [50] provides a vertex-centic model for processing graphs on
a single machine. Wonderland [82] provides an abstraction for users to maintain sufficient control
over processing graphs in an out-of-core manner. Compared to all these graph processing sys-
tems, SlimGraph can easily stretch graph processing into a fully distributed manner, leveraging
computation and memory resources over multiple nodes in a cluster; Alternatively, SlimGraph can
also use only a small set of nodes or even a single node for processing graphs while maintaining
distributed graph partitions. Like Gluon [20], the latest communication-optimized substrate for dis-
tributed heterogeneous graph analytics, SlimGraph optimizes communications heavily, however,
SlimGraph does not focus on leveraging a heterogeneous cluster with different processor types
(CPUs or GPUs) on each node. Instead, SlimGraph aims to allow distributed graph frameworks to
leverage dis-aggregated compute and memory resources in a cluster.
Graph Pattern Mining (GPM). GPM is another important type of graph processing workload.
Given an input graph and a user-defined pattern, a GPM system should enumerate all the subgraphs
40
(a.k.a, embeddings) in the input graph isomorphic to the input pattern. AutoMine [55] is a single-
machine GPM system that generates efficient code for the mining tasks. Peregrine [34] is a fully
single-machine pattern-aware GPM system that is capable of reducing large amount of unnecessary
intermediate computations. Arabesque [70] is the first distributed GPM system that automates the
processes of generating large number subgraphs with applications’ subgraph extension decision in
the loop. GraphPi [64] is the latest distributed GPM system that leverages generated asymmetric
restrictions from 2-cycles in group theory to eliminate redundant computations. Kudu [10] is the
distributed GPM execution engine that leverges the idea of extendable embeddings. Like existing
graph computation frameworks, all these GPM systems are either using one-machine or a fixed
number of machines for both computation and graph partition storage.
Dis-aggregated Memory and Far Memory. [14] linked remote memory with the design of
distributed applications. In an era of fast network, dis-aggregated memory or far memory is an ef-
fective idea for solving the discrepancy between large data size compared to limited local memory.
Dis-aggregated memory usually means to place all the physical memories in a fabric and allow
hosts to access the memory pool whenever they need, as illustrated by [2]. However, this requires
extra effort to develop new hardware architecture. Compared to dis-aggregated memory, far mem-
ory is a pure software technique. With the assistance of the operating system, virtual memory pages
can be swapped into the memory of a remote host via network, as designed by Infiniswap [30],
Mojim [84], [49] and [3]. Some approaches far memory by developing user-level systems like
Dodo [42], avoiding the need to modify the OS. AIFM [62] proposed application-integrated far
memory where users can leverage remote-able data structures to construct far-memory-access en-
abled applications. Instead, SlimGraph tries to hide applications from seeing the extra complexity
incurred due to enabling the access to far memory node. Compared with all existing work, Slim-
Graph is the first one that allows distributed applications to make use of far memory, which brings
new challenges and design complexities due to coordination among distributed tasks.
Resource Dis-aggregation and Network Requirements. Resource dis-aggregation splits re-
sources into network-attached components, which is a hot topic in the cloud era since it can be
41
used to increase the flexibility and elasticity of cloud providers. With dis-aggregation, resources
can scale to different magnitudes according to the demand. However, the use of resource dis-
aggregation inevitably increases the network traffic. [26] provides the network requirement to
realize resource dis-aggregation. New technologies and development of the Remote Direct Mem-
ory Access (RDMA) has enabled ultra-low latency, and ultra-high bandwidth network, which
makes resource dis-aggregation become more realistic and practical. There are many hardware
dis-aggregation projects like [4]. Firebox [4] maintains warehouse-scale computers built out of
commercial off-the-shelf-components. dReDBox [41] prototyped a dis-aggregated data-center and
allows hardware components to be interconnected via buses [13]. LegoOS [63] is the first OS that
dis-aggregates hardware resources, but it needs a dis-aggregated cluster to run on. Compared to
hardware-level or OS level resource dis-aggregation, Our work, SlimGraph, focuses on the soft-
ware dis-aggregation of two most critical resources (i.e., the computation and memory), in the
context of distributed graph frameworks.
3.3 System Design
3.3.1 Overview
Graph Pattern Mining
Apps
Kudu
Graph Processing
Apps
Gemini SlimGraph
MPI MPI
RDMA Verbs RDMA Verbs
SlimGraph
Figure 3.2: An overview of SlimGraph in the distributed graph framework stack
Figure 3.2 shows the overview of SlimGraph and its software stacks. SlimGraph is designed
to be a thin layer on top of distributed graph engines targeting either graph processing enabled
42
by Gemini and graph pattern mining enabled by Kudu. Both graph frameworks originally lever-
age MPI APIs to realize lower-level RDMA-based communication between nodes in a cluster.
SlimGraph likewise leverages MPI primitives to implemented cross-node communication when
necessary, may them be either two-sided or one-sided. We designed SlimGraph such that most of
its code stays in the graph framework without touching too much code in the graph applications to
maintain the best portability of legacy graph applications on distributed graph frameworks. In the
later sections, we will describe the design of D-Gemini and D-Kudu in details.
3.3.2 SlimGraph Execution Paradigm
Node 0 Node 1 Node 2 Node 3
S-0 S-1 S-1 S-0 S-0 S-1 S-0 S-1
Graph Workload
Threads
Memory
Offloaded Memory
Access Threads
GP-
(0,0)
GP-
(0,1)
GP-
(1,0)
GP-
(1,1)
GP-
(2,0)
GP-
(2,1)
GP-
(3,0)
GP-
(3,1)
Figure 3.3: The compute/memory dis-aggregated execution paradigm of SlimGraph. A graph is
evenly partitioned as GP-(i, j) on each NUMA memory node (S-i) of each machine. Node 0,1 are
compute nodes, and Node 2,3 are memory nodes.
Figure 3.3 examplifies the task execution paradigm for designing SlimGraph. Basically, we
follow the design of the original frameworks as of how graph is partitioned onto the memory of
different nodes: For both frameworks, Graphs are first partitioned onto different nodes, and are
further partitioned onto NUMA memory nodes. SlimGraph respects the original partition policies
of graph frameworks: partition by either graph vertex IDs or randomly. Each thread is entitled to
access the specific graph sub-partition it attaches to. SlimGraph allows to spare an elastic amount
of nodes as special memory nodes where only graph partitions are located and no graph work-
load tasks are executed. For performance considerations, the memory access computations can be
offloaded to the memory nodes when necessary. Such execution paradigm decouples the graph
43
storage and graph workload tasks. And the benefit is two folds: 1) graphs can continue to scale
with an increasing amount of nodes in the cluster, 2) graph tasks can be or may only need to be
executed with a smaller amount of machines without much performance degradation. In the cloud
setting, this enables users to replace some compute-intensive nodes with some lightweight ones or
even a far memory dedicated machine to reduce costs.
To adapt the original frameworks into ones following our proposed execution paradigm, we
noted several key questions:
• The original distributed graph frameworks have their own execution models for processing
graph tasks. How can we fit SlimGraph so that it is compatible with the original design while
providing the extra compute/memory dis-aggregation capability?
• With fewer nodes are used for graph task execution, How can we reach a reasonable perfor-
mance compared to the original frameworks?
We address these two questions in later sections of this chapter. Specifically, we provide detailed
designs for D-Gemini in section 3.4 and for D-Kudu in section 3.5. In section 3.6, we provide
some more implementation details. In section 3.7, we evaluate the performance of SlimGraph and
we summarize this chapter in section 3.8.
3.4 D-Gemini
D-Gemini is a version of Gemini [85] respecting SlimGraph ’s execution paradigm. D-Gemini
supports the push update propagation model for now and we would leave D-Gemini to support the
pull update propagation model in the future. As in Figure 3.4, the master node first sends an up-
date message containing the state of vertex v to v’s mirror on another node, by using a user-defined
SparseSignal function, and then the mirror on remote node actually updates the Neighbors(v)’s
states along its outgoing edges per the needs of graph applications, according a user-defined Spars-
eSlot function.
44
V V
SparseSignal SparseSlot
Message
node-0 node-1
Communication Computation Master Mirror
Figure 3.4: D-Gemini’s push update propagation model
In Gemini, every edge processing iteration is divided into mini-steps for better overlapping
computation and inter-node communication. To enable the SlimGraph execution paradigm, D-
Gemini adopts a new delegation-based workload propagation model and a novel algorithm for
SparseSlot computation delegation, which we describe in following subsections.
3.4.1 Delegation-based propagation model
Figure 3.5 shows how the workload is propagated in each mini-step in D-Gemini and the orginal
Gemini. Gemini co-schedules communication and computation in a cyclic ring order [85]: in a
N-node cluster, each node i sends its batched workload (graph vertices together with their states)
to be processed to the (i− K)modN node at the Kth mini-step (K = 1.2.3,..,N). As shown in
Figure 3.5. For each node, the Nth step would be to process local workload in its own memory
(not shown). After N mini-steps, each node should have the batched vertices from every node.
45
mini-step 1
mini-step 2
mini-step 3
Node-1
Node-0
Node-2
Node-3
Recv From
Node-1
Node-0
Node-2
Node-3
Gemini D-Gemini
1
2 3
0
Node-1
Node-0
Node-2
Node-3
1
2 3
0 2
3 0
1
1
2
Node-1
Node-0
Node-2
Node-3
1
2 3
2
Gemini D-Gemini
Node-1
Node-0
Node-2
Node-3
1
2 3
0 2
3 0
1
0
3
1
2
Node-1
Node-0
Node-2
Node-3
1
2 3
2
D-Gemini
0
3
Gemini
Self-generated
Figure 3.5: D-Gemini’s delegation-based workload propagation model with two compute nodes
(green box) and two memory nodes (yellow box). compared to Gemini’s ring-based mini-step-
wise communication per iteration. Note that memory nodes will not receive or send any workloads
in D-Gemini.
However, with SlimGraph’s execution paradigm, not all nodes are able to do graph-related
computations: there are nodes serving only as memory nodes shown as yellow rectangles. There-
fore, the traditional ring-based co-scheduling cannot work properly in SlimGraph’s setting. To
cope with this obstacle, we proposed a workload propagation method based on delegation. The
key idea is to redistribute the graph compute workload that was original assigned to a memory
node to some compute node in the mini-step. In each iteration, before any mini-steps, each com-
pute node i determines a coherent global view of a set of memory nodes DSet(i) as the nodes it
wants to delegate. To maintain balanced workload, we use a round-robin delegation sets in Slim-
Graph:
46
DSet(i) :={ j| j= i+ kC, j≤ N,k= 1,2,3,...} (3.1)
where C is the number of compute nodes.
In this way, the graph compute workload on every memory node can be delegated and each
memory node only has a unique delegate. Each compute node then prepares the workload in the
send buffer for itself as well as for the memory nodes it delegates before entering mini-steps. At
each mini-step, if node i needs to receive workload from node j where j=(i+ K)modN, it tests
whether node j is a compute node or a memory node. If j is a compute node, node i receives
j’s workload message from node j like before. Otherwise there are two situations: 1) if node i
delegates node j, then it has already self-generated j’s workload before entering mini-steps, and
hence no communication is needed. 2) If node i does not delegate node j, then it receives j’s
workload from another compute node that delegates j. As seen in Figure 3.5, at mini-step 1, node
0 receives the workload message from node 1 and node 1 receives the message from the node that
delegates node 2, which is node 0. At mini-step 2, since node 0 is supposed to receive the workload
on node 2 but node 0 delegates node 2, therefore, node 0 does not need any communication.
Instead, it just copies node 2’s workload from its own sending buffer to its own receive buffer. It is
similar for node 1 at this mini-step. At mini-step 3, node 0 receives node 3’s workload from node
1, which delegated node 3. node 1 is able to directly receives from node 0.
Note that for each mini-step, memory nodes will not receive or send any workloads: they only
fulfill remote edge requests. In contrast, a compute node not only receives workloads, but it also ex-
ecutes user-defined SparseSlot handler to update the outgoing neighbors’ states, which contributes
the main computation for graph processing applications. For every compute node, completing
these mini-steps guarantees that the updates from each vertex v in the workload are propagated to
v’s partial outgoing neighbors stored at that node. However, For v’s outgoing neighbors located
at memory nodes, their updates must be handled by a unique delegate according to SlimGraph’s
execution paradigm in Figure 3.3.
47
3.4.2 Delegating SparseSlot Computation
Algorithm 1 Basic algorithm for delegating memory nodes for handling SparseSlot with one
NUMA partition per node
Require: i← MachineID
Require: WL(i) is fullfilled for i= 1...N
Require: | DSet(i)| ¿ 0
1: step← 1
2: while step≤ N do
3: for d∈ DSet(i) do
4: r← (d+ step) mod N
5: for{v,msg}∈ WL(r) do ▷ This for is parallel
6: ext(v)← fetching from d.
7: if ext(v) then
8: {idx
0
(v),idx
1
(v)}← fetching from d.
9: Neighbors(v)← fetching from d.
10: SparseSlot(v, msg, Neighbors(v))
11: end if
12: end for
13: end for
14: step← step+ 1
15: end while
Algorithm 1 shows our basic algorithm for delegating memory nodes for SparseSlot compu-
tation. Before the execution this algorithm, it is required that the workload from each node is
prepared via either self-generation or inter-node communication and cached locally, as shown in
Figure 3.5. Like the mini-steps used in workload propagation, we also split theSparseSlot com-
putation delegation process into N steps. For each step (line 2), a compute node i traverses all its
delegated nodes in its DSet(i) (line 3). For each delegated node d, we use r to denote the remote
node at this step whose vertex workloads were cached (line 4). These vertex workloads are evenly
distributed to threads and processed in parallel as in line 5. For each workload vertex and its up-
date, an outgoing neighbor existence bit was fetched from delegated node d (line 6). If outgoing
neighbor(s) of v do exist (line 7), the two indices of v to the CSR edge array on d are fetched
(line 8), and are further used to fetch neighbors of v (line 9). When the neighbors of v are locally
cached, the SparseSlot computation updates v’s neighbors.
48
3.4.3 Optimizations
The basic algorithm for delegating SparseSlot computation incurs extra communication overhead
compared to the original Gemini. Given a graph G of size(|V|,|E|), line 6 would incur O(V) num-
ber of communications and line 8− 9 also incur O(V) communication in the worst case. To avoid
unnecessary fetching from remote delegated memory nodes, D-Gemini pre-cached the neighbor
existence bitmap and CSR indices into the CSR edge array when loading a graph, each of which
taking O(V) extra space on each compute/memory node. With the bitmap and index cache, line 6
and line 8 in the algorithm 1 can be replaced with local cache queries and effectively reduce com-
munication overhead without incurring excessive local storage.
The fetch operation at Line 9 of algorithm 1 could also be naively realized as one-sided fetching
operations with local cache inquiry. However, there are two drawbacks for this naive design:
a) The one-sided fetching operation happens O(V) times. Our experiment shows that one-
sided fetching operations on graph neighboring data incur extra network overhead proportional to
the number of fetching operations, implying an ineffective use of network bandwidth.
b) The user-defined SparseSlot computation of v is immediately data-dependant on the neigh-
bors of v and thus cannot be effectively overlapped with the network communications.
To overcome these obstacles, we optimized the edge requesting process of D-Gemini and lever-
aged a list of techniques to ensure that only a close-to-minimum number of network operations
determined by the local edge buffer size are required and the computations are mostly overlapped
with its depending communications. The optimized edge requesting process between a compute
node and its delegated memory node is described in the following subsection.
3.4.4 Optimized Edge Requesting of D-Gemini
The key insight for optimizing edge requesting is to avoid sending O(V) edge fetching requests
and hide the network communication as best as possible. Our overall design for the optimized edge
requesting leverages per-thread NUMA-aware bounded buffers to buffer local requests such that
these requests can be handled by corresponding threads on a remote memory machine as a batch.
49
NRG
Thread
Requesting
Thread
Compute Node
Per-thread BB
Local
Edge
Cache
SparseSlot
Comp
threads
Neighbours
Serving threads
Memory Node
Figure 3.6: D-Gemini’s optimized edge requesting between a compute node i and one of its dele-
gated memory node per mini-step of iteration.
To hide the network latency, we step aside all the network operations away from the critical path
of SparseSlot computation so that a workload vertex v only has to wait for its neighbours in the
local edge cache when needed.
Figure 3.6 shows the idea for our optimizations, in D-Gemini, each compute node starts a list of
neighbour request generation (NRG) threads after selecting a delegated memory node d upon each
mini-step of iteration, and then generates all neighbour requests into per-thread bounded buffers.
Another thread is responsible for collecting all NRG threads’ requests on their bounded buffers
and send them to the remote memory machine where an array of neighbour serving threads will
be responsible to access local CSR edge array to fulfill neighbour requests and send edges back.
Upon receiving edges from the remote memory node d, local thread which sended the neighbour
request of v is responsible for filling up the local edge cache and mark v’s local edges from d as
exist.
For generating edge requests, as in Algorithm 2, each NRG thread checks if a workload vertex v
has a non-zero number of outgoing edges (line 1), and if the neighbours of workload vertex v from
d are not already in the cache (line 4). If yes, the NRG thread adds the corresponding request into
its per-thread bounded buffer (line 5). Note that each request includes two cached CSR indices
of v, the thread id, and whether the request is intended for outgoing or incoming edges. These
information are used by a remote neighbour serving thread to decide if it is responsible for handle
the request and which local CSR array and which portion of the array is intended to be accessed.
50
As illustrated in algorithm 3, a request sending thread basically traverses each thread’s bounded
buffer and determines the number of requests to be sent. The number of requests is upper bounded
by: a) the number of the requests in the thread-local bounded buffer (line 4), and b) the size
of thread-local edge buffer (line 5). Condition a) ensures that requests can be accumulated as
many as generated in a batch and, condition b) ensures that the thread-local edge buffer won’t
overflow with remote neighbour serving thread fulfilling any request batch. It is required that the
local thread-local edge buffer has a minimum size that can fit the neighbours of the largest degree
vertex.
With the previous constructs, theSparseSlot computation w.r.t v can simply be data-dependent
on v’s cached neighbours. and line 10 of Algorithm 1 can be replaced by Algorithm 4, where each
computation thread executes the computation only when a workload vertex v’s neighbours on re-
mote memory node d has already been cached. Note that since a neighbour request for workload
vertex v has been issued, D-Gemini guarantees that v’s neighbours will eventually be cached. The
cached edges for v on d is marked as used after the computation so that the space can be freed and
re-used on local edge cache full. The computation on line 3 of Algorithm 4 is essentially over-
lapping with the network communications. Figure 3.7 exemplifies how such overlapping works
in a ideal scenario. In Figure 3.7, Req(v), N(v) and SS(v) are the buffered request, locally cached
neighbouring edges, and the SparseSlot computation for workload vertex v, respectively. Five
neighbour requests are buffered and sent in three batches. SS(0) and SS(1) execute after their
neighbouring edges are cached and in concurrent with the communication for N(2) and N(3). Sim-
ilarly, SS(2) and SS(3) execute in concurrent with the communication for N(4). We leveraged two
insights in our design: 1) The cached neighbouring edges for any of two workload vertices v and
u are locally available in the same order as they are executed inside a batch or among different
batches. 2) The execution time of SS(v) is proportional to O(N(v)), which leads to a proposition
that the overall execution time of k vertices in a batchΣ
k
i
SS(v
i
) is also proportional toΣ
m
i
N(u
i
) for
m vertices in a different batch, ifΣ
k
i
N(v
i
)=Σ
m
i
N(u
i
), which is condition D-Gemini tries to satisfy
or get close to in Algorithm 3.
51
Algorithm 2 Edge Requests Replay: generating neighbor requests into the per-thread Bounded
Buffer, assuming one NUMA node.
Require: v← workload vertex.
Require: d← delegated memory node.
Require: tid← current thread id.
Require: BB(tid)← per-thread bounded buffer
1: ext(v)← BitMap Cache of v on d.
2: if ext(v) then
3: {idx
0
(v),idx
1
(v)}← Index Cache of v on d.
4: if Neighbors NOT CACHED(v, d) then
5: BB(tid).write({v,d,idx
0
(v),idx
1
(v),tid,is outgoing})
6: end if
7: end if
8: Update BB(tid).WP atomically.
Algorithm 3 Request Sending and Edge Caching
Require: d← delegated memory node.
Require: BB(tid)← per-thread bounded buffer
1: while True do
2: for tid∈ all threads do
3: wp← BB(tid).WP
4: max reqs 1← wp− BB(tid).RP
5: max reqs 2← #requests capped by edge buffer.
6: n requests← MIN(max reqs 1,max reqs 2)
7: requests← BB(tid).read(n requests)
8: Update BB(tid).RP atomically.
9: Send requests to d.
10: end for
11: Wait for completing one-sided communication.
12: for tid∈ all threads do
13: edges← Parsed v’s neighboring edges.
14: Update Local Edge Cache by edges.
15: end for
16: end while
52
Algorithm 4 Edge Cache Access and SparseSlot computation, assuming one NUMA node.
Require: v← workload vertex.
Require: d← delegated memory node.
Require: tid← current thread id.
1: while True do
2: if LocalEdgeCache(v,d) then
3: SparseSlot(v, Neighbors(v)) ▷ overlapping w/ comm.
4: Mark LocalEdgeCache(v,d) as used.
5: else
6: thread wait(tid)
7: end if
8: end while
Req(0)
Req(1)
Time
Req(2)
Local Edge Cache
Local Edge
Requests
N(0)
N(1)
N(2)
Req(3)
N(3)
SS(0)
SS(1)
SS(2)
SS(3)
Req(4)
Comm
Comm
N(4)
Comm
SS(4)
SS = SparseSlot
Figure 3.7: D-Gemini’s overlapping ofSparseSlot computations with communication.
3.5 D-Kudu
Kudu[10] is the latest graph pattern mining (GPM) engine that is able to enumerate all sub-graphs
isomorphic to some user-defined pattern(s) (a.k.a, embedding(s)) given an input graph. Like Gem-
ini, Kudu is distributed and NUMA-aware: it can partition a large graph into multiple machines,
and further into different NUMA memory nodes. A Kudu engine is associated with each NUMA
memory node, taking care of a tree of fine-grained local tasks, thus making Kudu both computation
and memory scalable. However, As a general distributed graph engine, Kudu is not able to leverage
computation and memory resources disproportionately, which motivates us to develop D-Kudu.
53
3.5.1 Explore the potential of one-sided primitives
Kudu employs a symmetric design for inter-node communication: there is one sender thread and
one handler thread on each machine. The sender thread tries to collect the next batch of vertices B
pertaining to locally extendable embedding jobs, and then communicate with the remote handler
thread on another machine where B’s neighbors are located to retrieve B’s neighbor sets, which
are later used by Kudu’s graph mining engine to execute the jobs (i.e., to grow the embeddings).
The handler thread, when receives a remote request pertaining to the batch, access local memory
to complete the request.
In order for D-Kudu to respect SlimGraph ’s execution paradigm and enable compute/memory
dis-aggregation, we modified the workload distribution in D-Kudu such that all the embedding
extension jobs are executed only by compute nodes. D-Kudu leverages two modes for commu-
nications. As seen in Figure 3.8, D-Kudu’s communication between compute nodes is similar
to Kudu’s, which takes four steps. 1) The request sender/fetcher thread first SENDs the metadata
including the graph data displacement and the number of requested vertices. 2) Then the sender
actuallySENDs all the requested vertices B. 3) UponRECVing the previous information, the remote
handler thread batches all the neighboring vertices of B and WRITEs them to the displacement in
sender’s memory. 4) The handler thread finally notifies the requester that the one-sided WRITEs
has/have completed by SENDing a completion token. Unlike Kudu’s request sender thread, when
a compute node communicates with a memory node, the sender/fetcher thread will directly fetch
the neighbors of B using pure one-sided READs instead of relying on both two-sided SEND/RECV
primitives or one-sidedWRITEs.
3.5.2 Displacement Broadcasting
For a target remote vertex v in the requesting batch B, to retrieve Neighbors(v) via the single one-
sided READ operation, one challenge is that we need to know the displacement of Neighbors(v) in
the remote CSR edge array, which is unknown to the sender/fetcher thread. Therefore, the remote
displacement of Neighbors(v) should be broadcasted when the CSR edge array is constructed.
54
Requesting
thread
Handler
one-sided WRITEs
Memory
Compute Node Compute Node
Requesting
thread
Memory
one-sided READs
Memory Node
(a) Retrieving remote graph edge data
from another compute node
(b) Retrieving remote graph edge data
from memory node
1
2
3
4
Compute Node
Memory
Figure 3.8: D-Kudu’s communication pattern.
Ideally, each graph partition maintains an equal amount of vertices. As seen in Figure 3.9, each
vertex together with its outgoing edge(s) are located at one NUMA node as a partition in the
example graph, in which case, the displacements of all Neighbors(v) could be obtained by the
sender/fetcher thread via theAllGather collective op.
Since for most graphs, it is infeasible to partition a graph equally, D-Kudu AllGathers|V|
displacements at each invocation, buffering the displacement results, and then merging the results
locally afterwards to finalize the displacement broadcasting process. The merging step is shown
in algorithm 5. Basically, We initialized an AllGather result buffer of length (P∗| V|) where P
55
is the number of partitions. And for every vertex v (line 1), we traverse each portion of displace-
ment results corresponding to a partition (line 4 and line 6) and update the offset of neighbors(v).
(line 5).
OFF(0) OFF(1) OFF(2) OFF(3)
Node-0 Node-1
0 1
2 3
Example
graph:
OFF(0), OFF(2),
OFF(1), OFF(3)
OFF(0), OFF(2),
OFF(1), OFF(3)
AllGather-0
AllGather-1
Figure 3.9: An example of Neighbors(v) displacement broadcasting. OFF(v) represents the dis-
placement of Neighbors(v) in local CSR edge list.
Algorithm 5 MergingAllGather results
Require: P← #Partitions
Require: all ptr[P∗| V|]← AllGather results.
1: for v∈ V do
2: start← all ptr
3: o f f set← 0
4: for p← 1..P do
5: o f f set← MAX(o f f set,start[v])
6: start← start+|V|
7: end for
8: disp[v]← o f f set
9: end for
3.5.3 Delegating Memory Nodes
In Kudu, embedding extension tasks are categorized into different levels based on the number
of vertices in the embeddings. D-Kudu leverages one nice property of Kudu where all levels of
tasks can be stolen by remote nodes. Compared to Gemini, where tasks can only be stolen by
local threads on the same node, D-Kudu naturally makes use of this feature to realize task delega-
tion. However, for some large graph, Kudu handles huge-degree vertices differently compared to
other low-degree vertices: multiple distributed engines instead of one are used to grow one-vertex
embeddings if the degree of that vertex is overwhelmingly high. Yet for processing two-vertex
embeddings, Kudu needs to transition to the original processing mode: only one engine will be
used to process embeddings with over one-vertex. Affected by this transition, D-Kudu guarentees
56
that the embeddings over one-vertex will never to scattered to memory nodes. By checking the
master partition ID of the vertex associated with the two-vertex embeddings, we used a predicate
to validate wheather current engine on the compute/memory node should work as a delegate for
processing the embedding. We used similar delegation set as defined by equation 3.1 to ensure that
all embeddings can be processed by an engine located on a unique compute/memory node.
3.5.4 Optimization
Figure 3.10 show an example of how D-Kudu is able to reduce the amount of communication vol-
umes. Consider at some point t, a series of extendable embeddings are scattered to the local engine
of a node, awaiting for retrieving the neighbors of vertices associated with these embeddings. As
a request sender in Kudu, it must first send a request so that the handler can retrieve the neighbors
from local CSR edge list and then batch them in a continuous array before it can issue a one-sided
WRITEs to fill the local buffers associated with embedding extension tasks. However, as the request
sender/fetcher in D-Kudu, only a portion of neighbors need to be fetched by the one-sided READs.
The intuition for such reduction is that different extendable embeddings may share the same ver-
tex. Therefore, we can maintain a mapping from embedding’s vertices to their corresponding local
neighbor set addresses. Knowing that the neighbors of some embedding have already been cached
at certain address, we can directly copy from that address instead of initiating a one-sided READ,
thus saving the communication volumes. In Figure 3.10, E
1
(4) and E
2
(4) shared the same vertex
4 and same for E
1
(7) and E
2
(7). Therefore, The second N(7) and N(4) can be directly produced
via memory copy instead of fetching remotely. We are able to save 33.3% communication volume
using this optimization in this example.
57
Request
Sender
E
1
(4)
N(4) N(7) N(3) N(2) N(7) N(4)
E
1
(7)
E(3) E(2)
E
2
(7) E
2
(4)
Batched
Neighbors
One-sided WRITE(s)
CSR Edge List
memory copy
N(4) N(7) N(3) N(2) N(7) N(4)
N(2) N(3) N(4) N(7)
Request
Sender/Fetcher
memory copy
One-sided READ(s)
Embeddings
N(4) N(7) N(3) N(2) N(7) N(4)
Figure 3.10: Communication volume reduction. N(v) is the neighbors of v and E(v) is the embed-
ding containing vertex v. N(7) and N(4) in green boxes are produced locally without communica-
tion.
3.6 Implementation
We implemented D-Gemini and D-Kudu such that their behaviors are similar to Gemini’s and
Kudu’s respectively when there is no memory node involved, while respecting SlimGraph execu-
tion paradigm when otherwise. We use MPI primitives for inter-node communication. Particularly,
we use one-sided MPI primitives to transmit edges from remote memory node since one-sided MPI
primitives can bypass the remote CPUs and the operating system by leveraging underlying one-
sided RDMA verbs. We also distilled some common operations in both projects into a runtime,
which we believe is also helpful when transforming other distributed graph processing frameworks
into SlimGraph’s execution paradigm. One goal of our implementations is that original graph pro-
cessing applications on Gemini and Kudu can run with D-Gemini and D-Kudu without much mod-
ification. We reached this goal by hiding all of the complexities inside the modified frameworks.
On the application level, distributed graph applications only need to be simply dispatched to the
58
original workflow on compute nodes and on memory nodes via SlimGraph’s wrapper function.
Table 3.1 shows the lines of code (LOC) that we changed the applications.
Graph Processing Apps LOC ∆LOC %
BFS⋆ 131 5 3.8%
SSSP⋆ 127 5 3.9%
CC⋆ 130 5 3.8%
PR⋆ 151 6 3.9%
BC⋆ 434 6 1.4%
TC† 222 6 2.7%
3-MC† 274 7 2.6%
4-CC† 259 6 2.3%
Table 3.1: Lines of code modified to Gemini applications( ⋆) and Kudu applications(†). See sec-
tion 3.7.2 for details.
3.7 Evaluation
3.7.1 Platform Configuration
We evaluate SlimGraph on four nodes in our in-house cluster, where nodes are interconnected by a
Mellanox InfiniBand FDR (56Gbps) network. Each node contains two 8-core Intel Xeon E5-2630
v3 CPUs, and 64GB DDR4 RAM split into two NUMA nodes, and runs CentOS 7.4 on top. We
used MPICH-3.4.2 with our MPI library for inter-node communication. In this section, we denote
D-Gemini running on N nodes and M (M≤ N) compute nodes as D-Gemini-(N,M), and D-Kudu
running on N nodes and M (M≤ N) compute nodes as D-Kudu-(N,M). For the rest N− M nodes
that we used only as memory nodes, we only run neighbour serving thread for D-Gemini without
executing actual graph workloads.
3.7.2 Datasets and Evaluated Applications
The graph datasets for evaluating SlimGraph are shown in Table 3.2. For D-Gemini, we evaluated
the same set of graph analytics applications as Gemini [85], which includes 1) Breadth-First Search
(BFS), a basic algorithm for traversing nodes in a graph; 2) Single-Source-Shorted-Path (SSSP), the
59
algorithm to find the shortest path from a single vertex to any vertex in a graph with weighted edges.
3) PageRank(PR), an essential algorithm for analyzing the importance of web pages; 4) Connected
Component (CC), which finds the number of connected components in un-directed graphs. 5)
Betweenness Centrality (BC), which evaluates the importance of vertices in a graph between other
vertices. For D-Kudu, The graph mining applications that we run includes 1. Triangle Counting
(TC), which simply counts the number of triangles in the graph. 2.3-Motif Counting (3-MC), which
counts the number of size-3 patterns, which can be either a triangle or a chain of three vertices.
3). 4-Clique Counting (4-CC), which counts the number of 4-clique patterns (i.e., the complete
subgraph of four vertices.).
Graph Abbr. |V| |E|
wiki-vote WV 7115 100762
mico[23] MC 96638 1080156
live-journal-2008[47] LJ 5363260 42851237
enwiki-2013 WK 4206785 101355853
uk-2005[8] UK 39454463 783027125
twitter-2010[44] TW 41652230 1202513046
friendster[77] FR 65608366 1806067135
sk-2005 SK 50636073 1949412601
Table 3.2: Graph Datasets
3.7.3 Overall Performance
To have an overview of how our transformed frameworks work compared to their original versions
without supporting compute/memory dis-aggregation, We run both Gemini and D-Gemini on our
four-node cluster. D-Gemini is configured as D-Gemini-(4,4) - using four nodes and four-compute
nodes. Since our work only performs the transformation for the sparse mode of Gemini, we also
disabled Gemini’s dual update model, leaving only the sparse mode of Gemini as an apple-to-apple
baseline. Without special notation, this modification applies to all evaluations in this chapter by
default. We will leave the dense mode transformation of Gemini as our future work. Figure 3.11
shows the overall performance of D-Gemini compared to Gemini. As we can see, D-Gemini only
60
incurs a small overhead when all nodes are used as compute nodes compared to Gemini, which is
as expected since allSparseSlot computations are executed on local edges on each node.
Likewise, Figure 3.11 also shows the overall performance of D-Kudu compared to the Kudu
baseline when D-Kudu uses all four nodes for computation. Their performance is comparable
for most graphs, which is as expected since we tried to make D-Kudu works similar as Kudu
when full compute nodes are used. For some runs on the graphs twitter-2010 and on wiki-vote,
D-Kudu performs better then Kudu by at most 48%. And when counting triangles on wiki-vote,
D-Kudu performs 31% worse than Kudu. We attribute this difference to the fact that the number of
embedding extension tasks may be imbalanced among different compute nodes, thus causing such
instability.
0.0
0.1
WK
Execution Time (s)
0.0
0.1
0.2
LJ
0
2
TW
BFS SSSP BC PR CC
0.0
2.5
5.0
SK
Gemini D-Gemini
0.0
2.5
WV
1e1
Execution Time (ms)
0.0
0.5 LJ
1e4
0.0
0.5
TW
1e8
0.0
0.5
MC
1e3
0.0
2.5
UK
1e7
TC 3-MC 4-CC
0
2
FR
1e5
Kudu D-Kudu
Figure 3.11: Performance of D-Gemini (left) and D-Kudu (right) with full compute nodes com-
pared to their baselines.
3.7.4 Elasticity
To evaluate the elasticity of both frameworks, we used fewer nodes as computing nodes and com-
pare the result with when full nodes are used as computation nodes. To balance the computation
workload on each nodes, we use half nodes as computation nodes for D-Gemini and use the default
61
0.0
0.5
1.0
WK
Execution Time (s)
0
1
LJ
0
10
20
TW
BFS SSSP BC PR CC
0
20
40
SK
D-Gemini-(4,4)
D-Gemini-(4,2)
0.0
0.5
WV
1e3
Execution Time (ms)
0
1
LJ
1e5
0
2
MC
1e3
0
1
UK
1e8
TC 3-MC 4-CC
0.0
0.5
FR
1e7
TC 3-MC
0
2
TW
1e7
D-Kudu-(4,4)
D-Kudu-(4,3)
D-Kudu-(4,2)
D-Kudu-(4,1)
Figure 3.12: Compute/Memory Dis-aggregation: using fewer nodes to compute with D-Gemini
(left) and D-Kudu (right).
delegation sets. Figure 3.12 on the left shows how D-Gemini performs with two computing nodes
in the four-node cluster (i.e., D-Gemini-(4,2)). It can be seen that with half nodes for computa-
tion, D-Gemini incurs from 1.9x to 8.3x slow down. This is mostly because the communication
overhead for retrieving remote edges cannot always be perfectly hidden under graph workload
computations. While each one-sided operation is efficient in transferring a batch of edges, A real
graph inherently follows the power-law where most of vertices have only a small amount of neigh-
bours but a few vertices have a huge amount of neighbours. Therefore it is hard for batches where
vertices vary vastly in degrees to have similar communication time. One straight-forward solution
is to overlap the communication of a batch with another batch where vertex degrees are similar,
which we will leave as future work.
Figure 3.12 on the right shows how D-Kudu performs with a smaller set of computing nodes in
the four-node cluster. It can be seen that with the decrease of computing nodes, it gradually takes
62
increasing amount of time to execute the graph mining workloads, and the slow-downs from using
two computation nodes to using four computation nodes range from 1.82x to 16.67x, which is
comparably better than D-Gemini. This is because D-Kudu leverages the fine-grained task stolen
technique which automatically balances the workload among computation nodes while D-Gemini
uses static delegation assignments. One interesting fact to note is that with using only half com-
putation nodes, the slow down can be smaller than 2x for twitter-2010, suggesting that counting
triangles on twitteer-2010 does not fully utilize computation resources and the mining task is es-
sentially bounded by other resources like the network. Similarly, the slow-downs of using a single
computation node to using four computation nodes range from 3.125x to 40x, and twitter-2010
incurs the slow down less than 4x, where We can draw similar conclusion.
3.7.5 Scalability
0
1
2
WK
Execution Time (s)
0
1
2
LJ
0
20
TW
BFS SSSP BC PR CC
0
25
50
SK
1079.83
D-Gemini-(2,1)
D-Gemini-(4,2)
0.0
2.5
5.0
TC
1e6
Execution Time (ms)
UK FR
0.0
2.5
5.0
3-MC
1e6
D-Kudu-(2,1)
D-Kudu-(4,2)
Figure 3.13: Scalability of D-Gemini (left) and D-Kudu (right).
Figure 3.13 shows how D-Gemini and D-Kudu with half-computation-node setting scales with
the numbers of nodes in the cluster. Except SSSP on the SK graph, compared to D-Gemini-(2,1),
D-Gemini-(4,2) has 1.83x speedup at best, and 1.23x speedup at worst. D-Gemini scales with an
increase in computation capability and memory storage similar to that of Gemini because workload
63
vertices can be executed on more machines. SSSP on the SK graph has an outlier of 20.1x speedup
since with only two compute nodes, local edge caching leverages more virtual memory pages than
physically available, which incurs excessive memory paging. As for D-Kudu, On uk-2005 and
friendster, D-Kudu-(4,2) obtains a speedup of 2.03x to 2.16x compared to D-Kudu-(2,1). We
saw similar trends on other graphs (not shown), indicating that D-Kudu scales perfectly to up to a
four-node cluster.
3.7.6 Impact of Bitmap Caching, Index Caching, and Edge Caching
0.0
2.5
5.0
WK
1e2
Execution Time (s)
0.0
2.5
5.0
LJ
1e2
0
2
4
TW
1e3
BFS SSSP BC PR CC
0
1
SK
1e4
NO_CACHE
w/BC
w/BC+IC
w/BC+IC+EC
Figure 3.14: The impact of bitmap cache (BC), index cache (IC), and edge cache (EC) as important
optimizations in D-Gemini.
To lower the impact of extra inter-node communication overhead in D-Gemini, we leverage
various caching techniques, including bitmap caching (BC), index caching (IC), and edge caching
(EC) as our optimizations. The bitmap/index caching is done when the graph is loaded, and
edges are pre-fetched and cached during the execution of graph computation workloads. The
bitmap/index cache are both O(N) and we use the whole local memory as the edge cache by default.
Figure 3.14 shows the impact of these caches. Compared to D-Gemini without any optimization,
the bitmap caching contributes 1.24x - 2.37x speedup; the index caching further contributes 1.91x
64
- 4.62x speedup, and the edge caching contributes another 34x - 177x speedup, which brings a
total of 164x - 587x speedup. We can see that edge pre-fetching and caching effectively hide most
of communication traffic for edge inquiries.
3.7.7 Communication Volume Reduction
0
2
WV
1e1
Network Comm Volume (MB)
0.0
2.5
LJ
1e3
0
1
MC
1e2
0.0
0.5
FR
1e6
TC 3-MC 4-CC
0
2
UK
1e6
TC 3-MC
0
1
TW
1e7
D-Kudu-(4,4)
D-Kudu-(4,3)
D-Kudu-(4,2)
D-Kudu-(4,1)
Figure 3.15: D-Kudu’s Communication V olume.
As one of our optimization technique, D-Kudu is able to reduce the communication volumes
by filling the neighbor set of an embedding by looking for other embeddings containing the same
vertex. If the neighbor set of the other embedding has already been fetched via network, we can
do a memory copy instead, thus saving the communication volumes. Fig 3.15 shows the com-
munication volumes of D-Kudu with different computing nodes in the cluster. Noted that with
four computation nodes (D-Kudu-(4,4)), we are using the original requesting threads as in Kudu,
where it is not able to save communication volumes. Therefore, we can safely treats D-Kudu-(4,4)
as the baseline. As fewer nodes are used for computation, more fetcher threads bring increasing
opportunity to save communication volumes. Compared to D-Kudu-(4,4), other settings indeed
65
save some communication volumes on wiki-vote, mico, and live-journal-2008. But they do not
have effect on larger other graphs like friendster, uk-2005, and twitter-2010. In other words, larger
graphs do not seem to benefit from this optimization much. This is explainable since the effective-
ness of reducing communication volumes depends on whether we can find some embedding of the
same vertex in the same batch. As the size of graph increases, embeddings in the same batch is
increasingly unlikely to have the same vertex; they tend to be the neighbors of a third vertex. For
smaller graphs, this optimization works much better. From Figure 3.15, at maximum, the commu-
nication volume is reduced by 26%, 42%, and 85% respectively when the number of computation
nodes is decremented from four to one.
3.8 Summary
In this chapter, we challenged the underlying assumption that graph processing workloads need to
be scaled equally for memory and computation resources, which is the foundation for the design
of many existing distributed graph processing and graph mining frameworks. We proposed a novel
execution paradigm for processing graph workloads where compute and memory resources can
be disproportionately scaled. We designed SlimGraph and specifically, D-Gemini and D-Kudu,
which are the transformation of traditional Gemini and Kudu, to respect our proposed execution
paradigm. To make our design feasible, we employed optimization techniques to reduce the cost
incurred by extra communication demand to memory nodes. Experiment results show that Slim-
Graph can use fewer compute machines to complete distributed graph computation with a reason-
able performance degradation.
66
Chapter 4
Conclusion
In this dissertation, we broadly investigated two directions of leveraging distributed memories: to
make the remote memory access faster by leveraging RDMA as the underlying communication
media and to make memory access more flexible by enabling the disaggregation of compute and
memory resources. To leverage RDMA efficiently, we investigated the use of two-sided and one-
sided RDMA verbs in distributed transaction systems. Specifically, we build RCC to explore the
potential performance gains of using these two verbs in various representative concurrency control
protocols and for the first time compared these RDMA-based protocols in a single framework.
With RCC, we are able to generate hybrid concurrency control protocol designs that are better
off from distinctive stage-wise RDMA verbs. To enable more flexible remote memory access, we
design SlimGraph, a layer that supports the translation of existing distributed graph frameworks
into ones supporting memory/compute dis-aggregation. With some framework-specific optimiza-
tions, we are able to reduce the network overhead due to the dis-aggregation, making the translated
distributed graph systems reach a reasonable performance degradation.
67
Bibliography
[1] Rakesh Agrawal, Michael J Carey, and Miron Livny. Concurrency control performance
modeling: Alternatives and implications. ACM Transactions on Database Systems (TODS),
12(4):609–654, 1987.
[2] Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Pratap
Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei.
Remote memory in the age of fast networks. In Proceedings of the 2017 Symposium on
Cloud Computing, SoCC ’17, page 121–127, New York, NY , USA, 2017. Association for
Computing Machinery.
[3] Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Mar-
cos K. Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. Can far memory
improve job throughput? In Proceedings of the Fifteenth European Conference on Computer
Systems, EuroSys ’20, New York, NY , USA, 2020. Association for Computing Machinery.
[4] Krste Asanovi´ c. FireBox: A hardware building block for 2020 Warehouse-Scale computers.
Santa Clara, CA, February 2014. USENIX Association.
[5] Peter Bailis, Alan Fekete, Michael J Franklin, Ali Ghodsi, Joseph M Hellerstein, and Ion
Stoica. Coordination avoidance in database systems. Proceedings of the VLDB Endowment,
8(3):185–196, 2014.
[6] Philip A. Bernstein, Philip A. Bernstein, and Nathan Goodman. Concurrency control in
distributed database systems. ACM Comput. Surv., 13(2):185–221, June 1981.
[7] Philip A Bernstein and Nathan Goodman. Multiversion concurrency control—theory and
algorithms. ACM Transactions on Database Systems (TODS), 8(4):465–483, 1983.
[8] Paolo Boldi and Sebastiano Vigna. The webgraph framework i: compression techniques. In
WWW ’04, 2004.
[9] Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, Jacob Nel-
son, Irene Zhang, and Dan R. K. Ports. Prism: Rethinking the rdma interface for distributed
systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Prin-
ciples, SOSP ’21, page 228–242, New York, NY , USA, 2021. Association for Computing
Machinery.
[10] Jingji Chen and Xuehai Qian. Kudu: An efficient and scalable distributed graph pattern
mining engine, 2021.
68
[11] Rong Chen, Jiaxin Shi, Yanzhe Chen, Binyu Zang, Haibing Guan, and Haibo Chen. Pow-
erlyra: Differentiated graph computation and partitioning on skewed graphs. ACM Trans.
Parallel Comput., 5(3), jan 2019.
[12] Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. Fast and general dis-
tributed transactions using RDMA and HTM. In Proceedings of the Eleventh European Con-
ference on Computer Systems, page 26. ACM, 2016.
[13] I-Hsin Chung, Bulent Abali, and Paul Crumley. Towards a composable computer system.
In Proceedings of the International Conference on High Performance Computing in Asia-
Pacific Region , HPC Asia 2018, page 137–147, New York, NY , USA, 2018. Association for
Computing Machinery.
[14] Douglas Comer and Jim Griffioen. A new design for distributed systems: The remote memory
model. In USENIX Summer, 1990.
[15] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears.
Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium
on Cloud computing, pages 143–154. ACM, 2010.
[16] James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost,
Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter
Hochschild, et al. Spanner: Google’s globally distributed database. ACM Transactions on
Computer Systems (TOCS), 31(3):8, 2013.
[17] The Transaction Processing Council. TPC-C Benchmark V5.11. http://www.tpc.org/
tpcc/, 2018.
[18] James Cowling and Barbara Liskov. Granola: low-overhead distributed transaction coordina-
tion. In Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC
12), pages 223–235, 2012.
[19] Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. Schism: a workload-driven ap-
proach to database replication and partitioning. Proceedings of the VLDB Endowment, 3(1-
2):48–57, 2010.
[20] Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden,
Marc Snir, and Keshav Pingali. Gluon: A communication-optimizing substrate for distributed
heterogeneous graph analytics. SIGPLAN Not., 53(4):752–768, jun 2018.
[21] Aleksandar Dragojevi´ c, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM:
Fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Imple-
mentation (NSDI 14), pages 401–414, 2014.
[22] Aleksandar Dragojevi´ c, Dushyanth Narayanan, Edmund B Nightingale, Matthew Renzel-
mann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: distributed
transactions with consistency, availability, and performance. In Proceedings of the 25th Sym-
posium on Operating Systems Principles, pages 54–70. ACM, 2015.
69
[23] Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis. Grami:
Frequent subgraph and pattern mining in a single large graph. Proc. VLDB Endow.,
7(7):517–528, mar 2014.
[24] Robert Escriva, Bernard Wong, and Emin G¨ un Sirer. Warp: Lightweight multi-key transac-
tions for key-value stores. arXiv preprint arXiv:1509.07815, 2015.
[25] Jose M Faleiro and Daniel J Abadi. Rethinking serializable multiversion concurrency control.
arXiv preprint arXiv:1412.2324, 2014.
[26] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agar-
wal, Sylvia Ratnasamy, and Scott Shenker. Network requirements for resource disaggre-
gation. In Proceedings of the 12th USENIX Conference on Operating Systems Design and
Implementation, OSDI’16, page 249–264, USA, 2016. USENIX Association.
[27] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Pow-
ergraph: Distributed graph-parallel computation on natural graphs. In Presented as part of
the 10th{USENIX} Symposium on Operating Systems Design and Implementation ({OSDI}
12), pages 17–30, 2012.
[28] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Power-
graph: Distributed graph-parallel computation on natural graphs. In 10th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 12), pages 17–30, Hollywood, CA,
October 2012. USENIX Association.
[29] Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Effi-
cient memory disaggregation with infiniswap. In 14th{USENIX} Symposium on Networked
Systems Design and Implementation ({NSDI} 17), pages 649–667, 2017.
[30] Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin.
Efficient memory disaggregation with infiniswap. In Proceedings of the 14th USENIX Con-
ference on Networked Systems Design and Implementation, NSDI’17, page 649–667, USA,
2017. USENIX Association.
[31] Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. An evaluation
of distributed concurrency control. Proceedings of the VLDB Endowment, 10(5):553–564,
2017.
[32] Sungpack Hong, Siegfried Depner, Thomas Manhardt, Jan Van Der Lugt, Merijn Verstraaten,
and Hassan Chafi. Pgx.d: a fast distributed graph processing engine. In SC ’15: Proceedings
of the International Conference for High Performance Computing, Networking, Storage and
Analysis, pages 1–12, 2015.
[33] Imranul Hoque and Indranil Gupta. Lfgraph: Simple and fast distributed graph analytics. In
Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems,
pages 1–17, 2013.
70
[34] Kasra Jamshidi, Rakesh Mahadasa, and Keval V ora. Peregrine: A pattern-aware graph mining
system. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys
’20, New York, NY , USA, 2020. Association for Computing Machinery.
[35] Ylber Januzaj, Jaumin Ajdari, and Besnik Selimi. Dbms as a cloud service: Advantages and
disadvantages. Procedia - Social and Behavioral Sciences, 195:1851–1859, 2015. World
Conference on Technology, Innovation and Entrepreneurship.
[36] Yan Jiang, Wei Liu, Xuanhua Shi, and Weizhong Qiang. Optimizing the copy-on-write mech-
anism of docker by dynamic prefetching. Tsinghua Science and Technology, 26(3):266–274,
2021.
[37] Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using rdma efficiently for key-value
services. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, page
295–306, New York, NY , USA, 2014. Association for Computing Machinery.
[38] Anuj Kalia, Michael Kaminsky, and David G Andersen. Design guidelines for high perfor-
mance RDMA systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16),
pages 437–450, 2016.
[39] Anuj Kalia, Michael Kaminsky, and David G Andersen. Fasst: Fast, scalable and simple
distributed transactions with two-sided RDMA datagram RPCs. In 12th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 16), pages 185–201, 2016.
[40] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stan-
ley Zdonik, Evan PC Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, et al. H-
store: a high-performance, distributed main memory transaction processing system. Proceed-
ings of the VLDB Endowment, 1(2):1496–1499, 2008.
[41] K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zervas, D. Theodoropoulos, I. Koutsopoulos,
K. Hasharoni, D. Raho, C. Pinto, F. Espina, S. Lopez-Buedo, Q. Chen, M. Nemirovsky,
D. Roca, H. Klos, and T. Berends. Rack-scale disaggregated cloud data centers: The dredbox
project vision. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE),
pages 690–695, 2016.
[42] S. Koussih, A. Acharya, and S. Setia. Dodo: a user-level system for exploiting idle mem-
ory in workstation clusters. In Proceedings. The Eighth International Symposium on High
Performance Distributed Computing (Cat. No.99TH8469), pages 301–308, 1999.
[43] Hsiang-Tsung Kung and John T Robinson. On optimistic methods for concurrency control.
ACM Transactions on Database Systems (TODS), 6(2):213–226, 1981.
[44] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social net-
work or a news media? In Proceedings of the 19th International Conference on World Wide
Web, WWW ’10, page 591–600, New York, NY , USA, 2010. Association for Computing
Machinery.
71
[45] Collin Lee, Seo Jin Park, Ankita Kejriwal, Satoshi Matsushita, and John Ousterhout. Imple-
menting linearizability at large scale and low latency. In Proceedings of the 25th Symposium
on Operating Systems Principles, pages 71–86. ACM, 2015.
[46] Kang Lee, John C Eidson, Hans Weibel, and Dirk Mohl. Ieee 1588-standard for a precision
clock synchronization protocol for networked measurement and control systems. In Confer-
ence on IEEE, volume 1588, page 2, 2005.
[47] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Community
structure in large networks: Natural cluster sizes and the absence of large well-defined clus-
ters. Internet Mathematics, 6(1):29–123, 2009.
[48] Feng Li, Sudipto Das, Manoj Syamala, and Vivek R. Narasayya. Accelerating relational
databases by leveraging remote memory and rdma. In Proceedings of the 2016 International
Conference on Management of Data, SIGMOD ’16, page 355–370, New York, NY , USA,
2016. Association for Computing Machinery.
[49] Shuang Liang, Ranjit Noronha, and Dhabaleswar K. Panda. Swapping to remote memory
over infiniband: An approach using a high performance network block device. In 2005 IEEE
International Conference on Cluster Computing, pages 1–10, 2005.
[50] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph
Hellerstein. Graphlab: A new framework for parallel machine learning. In Proceedings of
the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence , UAI’10, page 340–349,
Arlington, Virginia, USA, 2010. AUAI Press.
[51] Yucheng Low, Joseph E Gonzalez, Aapo Kyrola, Danny Bickson, Carlos E Guestrin, and
Joseph Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint
arXiv:1408.2041, 2014.
[52] Hatem A Mahmoud, Vaibhav Arora, Faisal Nawab, Divyakant Agrawal, and Amr El Abbadi.
Maat: Effective and scalable coordination of distributed transactions in the cloud. Proceed-
ings of the VLDB Endowment, 7(5):329–340, 2014.
[53] Mohammad Sultan Mahmud, Joshua Zhexue Huang, Salman Salloum, Tamer Z. Emara, and
Kuanishbay Sadatdiynov. A survey of data partitioning and sampling methods to support big
data analysis. Big Data Mining and Analytics, 3(2):85–101, 2020.
[54] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty
Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. In
Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’10, page 135–146, New York, NY , USA, 2010. Association for Computing Ma-
chinery.
[55] Daniel Mawhirter and Bo Wu. Automine: Harmonizing high-level abstraction and high per-
formance for graph mining. In Proceedings of the 27th ACM Symposium on Operating Sys-
tems Principles, SOSP ’19, page 509–523, New York, NY , USA, 2019. Association for Com-
puting Machinery.
72
[56] David L. Mills. A brief history of ntp time: Memoirs of an internet timekeeper. SIGCOMM
Comput. Commun. Rev., 33(2):9–21, April 2003.
[57] Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, and Jinyang Li. Extracting more concurrency
from distributed transactions. In 11th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 14), pages 479–494, 2014.
[58] Andrew Pavlo, Carlo Curino, and Stanley Zdonik. Skew-aware automatic database partition-
ing in shared-nothing, parallel OLTP systems. In Proceedings of the 2012 ACM SIGMOD
International Conference on Management of Data, pages 61–72. ACM, 2012.
[59] Guo Pu, Lijuan Wang, Jun Shen, and Fang Dong. A hybrid unsupervised clustering-based
anomaly detection method. Tsinghua Science and Technology, 26(2):146–153, 2021.
[60] Sudip Roy, Lucja Kot, Gabriel Bender, Bailu Ding, Hossein Hojjat, Christoph Koch, Nate
Foster, and Johannes Gehrke. The homeostasis protocol: Avoiding transaction coordination
through program analysis. In Proceedings of the 2015 ACM SIGMOD International Confer-
ence on Management of Data, pages 1311–1326. ACM, 2015.
[61] Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguilera, and Adam Belay. AIFM: High-
performance, application-integrated far memory. In 14th USENIX Symposium on Operat-
ing Systems Design and Implementation (OSDI 20), pages 315–332. USENIX Association,
November 2020.
[62] Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguilera, and Adam Belay. AIFM: High-
Performance, Application-Integrated far memory. In 14th USENIX Symposium on Operat-
ing Systems Design and Implementation (OSDI 20), pages 315–332. USENIX Association,
November 2020.
[63] Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. Legoos: A disseminated, dis-
tributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 18), pages 69–87, Carlsbad, CA, October 2018.
USENIX Association.
[64] Tianhui Shi, Mingshu Zhai, Yi Xu, and Jidong Zhai. Graphpi: High performance graph pat-
tern matching through effective redundancy elimination. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20.
IEEE Press, 2020.
[65] Yogesh Simmhan, Alok Kumbhare, Charith Wickramaarachchi, Soonil Nagarkar, Santosh
Ravi, Cauligi Raghavendra, and Viktor Prasanna. Goffish: A sub-graph centric framework
for large-scale graph analytics. In European Conference on Parallel Processing, pages 451–
462. Springer, 2014.
[66] James W. Stamos and Flaviu Cristian. Coordinator log transaction execution protocol. Dis-
trib. Parallel Databases, 1(4):383–408, October 1993.
[67] Michael Stonebraker. The case for shared nothing. IEEE Database Eng. Bull., 9(1):4–9,
1986.
73
[68] Xiao Tan, Jiliang Zhang, Yuanjing Zhang, Zheng Qin, Yong Ding, and Xingwei Wang. A
puf-based and cloud-assisted lightweight authentication for multi-hop body area network.
Tsinghua Science and Technology, 26(1):36–47, 2021.
[69] The H-Store Team. SmallBank Benchmark. http://hstore.cs.brown.edu/
documentation/deployment/benchmarks/smallbank/, 2018.
[70] Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J.
Zaki, and Ashraf Aboulnaga. Arabesque: A system for distributed graph mining. In Proceed-
ings of the 25th Symposium on Operating Systems Principles, SOSP ’15, page 425–440, New
York, NY , USA, 2015. Association for Computing Machinery.
[71] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and
Daniel J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. In
Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’12, pages 1–12, New York, NY , USA, 2012. ACM.
[72] Shin-Yeh Tsai and Yiying Zhang. Lite Kernel RDMA support for Datacenter Applications. In
Proceedings of the 26th Symposium on Operating Systems Principles, pages 306–324. ACM,
2017.
[73] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy
transactions in multicore in-memory databases. In Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles, pages 18–32. ACM, 2013.
[74] Chenxi Wang, Haoran Ma, Shi Liu, Yuanqi Li, Zhenyuan Ruan, Khanh Nguyen, Michael D.
Bond, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu. Semeru: A memory-
disaggregated managed runtime. In 14th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 20), pages 261–280. USENIX Association, November 2020.
[75] Xingda Wei, Zhiyuan Dong, Rong Chen, and Haibo Chen. Deconstructing RDMA-enabled
distributed transactions: Hybrid is better! In 13th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 18), pages 233–251, 2018.
[76] Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. Fast in-memory transac-
tion processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating
Systems Principles, pages 87–104. ACM, 2015.
[77] Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on
ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics,
MDS ’12, New York, NY , USA, 2012. Association for Computing Machinery.
[78] Xiangyao Yu. An evaluation of concurrency control with one thousand cores. PhD thesis,
Massachusetts Institute of Technology, 2015.
[79] Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and Michael Stonebraker.
Staring into the abyss: An evaluation of concurrency control with one thousand cores. Pro-
ceedings of the VLDB Endowment, 8(3):209–220, 2014.
74
[80] Xiangyao Yu, Yu Xia, Andrew Pavlo, Daniel Sanchez, Larry Rudolph, and Srinivas Devadas.
Sundial: harmonizing concurrency control and caching in a distributed OLTP database man-
agement system. Proceedings of the VLDB Endowment, 11(10):1289–1302, 2018.
[81] Erfan Zamanian, Carsten Binnig, Tim Kraska, and Tim Harris. The end of a myth: Distributed
transactions can scale. arXiv preprint arXiv:1607.00655, 2016.
[82] Mingxing Zhang, Yongwei Wu, Youwei Zhuo, Xuehai Qian, Chengying Huan, and Kang
Chen. Wonderland: A novel abstraction-based out-of-core graph processing system. SIG-
PLAN Not., 53(2):608–621, mar 2018.
[83] Wei Zhang, Xiao Chen, and Jianhui Jiang. A multi-objective optimization method of initial
virtual machine fault-tolerant placement for star topological data centers of cloud systems.
Tsinghua Science and Technology, 26(1):95–111, 2021.
[84] Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. Mojim: A reli-
able and highly-available non-volatile memory system. SIGARCH Comput. Archit. News,
43(1):3–18, mar 2015.
[85] Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. Gemini: A computation-
centric distributed graph processing system. In 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 16), pages 301–316, Savannah, GA, November
2016. USENIX Association.
75
Asset Metadata
Creator
Wang, Chao (author)
Core Title
Towards the efficient and flexible leveraging of distributed memories
Contributor
Electronically uploaded by the author
(provenance)
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2022-12
Publication Date
11/04/2022
Defense Date
09/07/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
concurrency control,distributed graph mining,distributed graph processing,distributed transaction systems,OAI-PMH Harvest,RDMA,resource disaggregation
Format
theses
(aat)
Language
English
Advisor
Qian, Xuehai (
committee chair
), Deshmukh, Jyotirmoy (
committee member
), Prasanna, Viktor (
committee member
), Raghavendra, Cauligi (
committee member
)
Creator Email
albertwang87@gmail.com,wang484@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112305185
Unique identifier
UC112305185
Identifier
etd-WangChao-11305.pdf (filename)
Legacy Identifier
etd-WangChao-11305
Document Type
Dissertation
Format
theses (aat)
Rights
Wang, Chao
Internet Media Type
application/pdf
Type
texts
Source
20221107-usctheses-batch-990
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
Nowadays data volume is growing exponentially beyond the storage capability of single machines in the cloud era. With the thriving of a broad spectrum of distributed systems, it is critical to access remote data effectively and efficiently, and in a flexible manner, such that compute and memory resources can be easily disaggregated. We investigated two typical types of distributed systems in our case study. a) distributed transaction systems. b) distributed graph systems.
To allow ultra-low-latency remote data access in distributed transaction systems, we revisited the idea of Remote Direct Memory Access (RDMA) and its effectiveness in terms of building classical and state-of-the-art concurrent control protocols. Specifically, we developed RCC, the first unified and comprehensive RDMA-enabled distributed transaction processing framework containing six serializable concurrency control protocols. Based on the insights obtained from analyzing pure two-sided and one-sided protocol implementations, RCC is able to prototype more efficient hybrid deployments automatically.
The efficient leveraging of remote memory provides an opportunity for compute/memory disaggregation. We proposed a novel execution paradigm for processing graph workloads where compute and memory resources can be disproportionately scaled. Furthermore, we designed SlimGraph, a layer for transforming Gemini and Kudu - two state-of-the-art distributed graph frameworks, into ones supporting compute/memory dis-aggregation. Our work shows that with all types of optimizations enabled, SlimGraph can enable the use of fewer compute machines to complete distributed graph computation with reasonable performance degradation.
Tags
concurrency control
distributed graph mining
distributed graph processing
distributed transaction systems
RDMA
resource disaggregation
Linked assets
University of Southern California Dissertations and Theses