Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Building straggler-resilient and private machine learning systems in the cloud
(USC Thesis Other)
Building straggler-resilient and private machine learning systems in the cloud
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Building Straggler-Resilient and Private Machine Learning Systems in the Cloud
by
Hema Venkata Krishna Giri Narra
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
December 2020
Copyright 2020 Hema Venkata Krishna Giri Narra
Dedication
To my mother, Padmaja Narra
and
my grandfather, the late Venkata Krishna Rao Potluri
ii
Acknowledgements
I want to express sincere gratitude to my advisor Professor Murali Annavaram. He has been incredibly
supportive, forgiving of mistakes, and patient throughout my Ph.D. From him, I learned how to find and
how to tackle challenging research problems practically. With his calm personality, he guided me during
the challenging periods of my Ph.D. During the many instances when I got stuck working on a problem,
he worked with and guided me to make progress. I appreciate his time and tremendous effort while writing
and submitting research papers. I am grateful for the immense freedom he provided me in exploring vastly
different research topics and our many stimulating discussions. I am always amazed at the speed he used
to grasp the key concepts and problems when discussing new research topics. Without his guidance and
advice, I cannot imagine finishing my Ph.D.
I want to thank Professor Salman Avestimehr for his guidance and insights on using Coded Computing
techniques that make up the major part of this dissertation. I want to thank my thesis and qualifying
committee members Professor Bhaskar Krishnamachari, Professor Barath Raghavan, Professor Yan Liu,
and Professor Xuehai Qian, for their valuable suggestions for improving this dissertation.
I am immensely grateful to my research collaborator and friend Zhifeng Lin, for his contributions to the
projects in this dissertation. It has been a pleasure working together. I learned a more experiments-driven
approach to solving research problems while working with him.
I am thankful to my friend, collaborator, and fellow SCIP group member Kiran Matam, for his time
and support during my Ph.D. Our conversations around research, life, and spirituality formed an integral
part of my Ph.D. journey.
iii
I had an enjoyable learning experience interning at Samsung, where I worked under the guidance of Dr.
Yang-Seok Ki and Dr. Stephen Choi. I am grateful to Dr. Yang-Seok Ki for his insights and suggestions
on the computer systems research.
I am thankful to fellow SCIP group members and alumni: Qiumin Xu, Yongqin Wang, Mohammad
Abdel-Majeed, Daniel Wong, Gunjae Koo, Hanieh Hashemi, Haipeng, Tingting Tang, Keshav Balasub-
ramaniam, Ruisheng Wang, Lakshmi Kumar, Hyeran Jeon, Abdulaziz Tabbakh, Abdulla Alshabanah,
Rachit, Sangwon, Waleed for their camaraderie, and their valuable suggestions during our group discus-
sions.
I am thankful to Professor Murali Annavaram’s wife, Sok, and their two children, Amrit and Akshar,
for hosting all the SCIP group members multiple times in their house. They made us feel at home.
I am thankful for my friends Deepan Muthirayan, Jaymin, Mohammad Noormohammadpour, Saurav,
Naveen, Sanmukh, Jitin, Anup, Shashi Kiran, Arvind, L Kiran, Jitendra, Ravi Teja, Sundar, Shreyas,
Karishma, Rama Reddy, Nathan, Hiteshi, Mehdi, Mahdi, Chao Wang, Qinyi, Manjunath, Dhananjay,
Mayank, Shivani, Navya Dasaratha, Vishnu, Sanjay, Amar Sagar Reddy, Sudheer, Shankar Ganesh, CSLV
Rohit, Gaurav, Amit Padhy, Vaibhav Sharma, and many others, for their warmth, lovely conversations
about life, research and everything else throughout my Ph.D.
I am grateful to my friends from school, Anil, and Chaitanya, who were sources of support and en-
couragement during the difficult times in my Ph.D.
This dissertation work would not have been possible without the regular support from Diane Demetras,
department student services director, and Tracy Charles, doctoral programs coordinator. Diane has been
like a departmental motherly figure to all electrical engineering Ph.D. students, including me. Tracy
has been immensely helpful whenever I approached her about my academic requirements and fellowship
related questions. I am thankful to Estela Lopez, our associate research administrator, for her continuous
assistance with my funding.
I am grateful to M. Vasanta Devi, the late M. Janardhana Rao, and G.K.B. Chowdary for their constant
encouragement and support since middle school.
iv
I want to thank my grandmother, Jayaprada, for her affection throughout my life. My mother and
father, Padmaja and Subrahmanyeswara Rao, and my brothers, Sesha Giri and Uday Nag, supported me
through the ups and downs during my Ph.D. Without my family’s love and continuous support, pursuing
Ph.D., and finishing this dissertation would not have been possible.
v
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables x
List of Figures xi
Abstract xiii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Primary contributions of this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Mitigating stragglers in distributed training . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Straggler-resilient distributed inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Privacy-preserving inference in the cloud . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Dissertation organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Background on Coded Computing 14
2.1 Coded distributed computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Slack Squeeze Coded Computing 17
3.1 Overheads of straggler mitigation techniques . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Speed prediction and coded data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 S
2
C
2
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 BasicS
2
C
2
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vi
3.3.2 GeneralS
2
C
2
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Dealing with mis-prediction or failures . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.4 Robustness ofS
2
C
2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Extension to bi-linear coded computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.1 LSTM based speed prediction model . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.2 S
2
C
2
specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.3 Computing applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.4 System setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.5 Verification in a controlled cluster . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.1 Results from controlled cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.1.1 Logistic Regression and SVM . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.1.2 Page Rank and Graph Filtering . . . . . . . . . . . . . . . . . . . . . . 39
3.6.2 Results from industrial cloud deployment . . . . . . . . . . . . . . . . . . . . . . 39
3.6.2.1 Results in low mis-prediction rate environment . . . . . . . . . . . . . . 41
3.6.2.2 Results in high mis-prediction rate environment . . . . . . . . . . . . . 42
3.6.2.3 Results withS
2
C
2
on polynomial coding . . . . . . . . . . . . . . . . . 44
3.6.2.4 Scalability studies on a larger cluster . . . . . . . . . . . . . . . . . . . 44
3.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Background on CNN based image classification 49
4.0.1 Image classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.0.2 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.0.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Collage inference 52
5.1 Characterizing tail latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Existing techniques and their limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Collage inference algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Collage-cnn architecture and implementation . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 S-cnn architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vii
5.4.2 Collage-cnn architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.3 Training of collage-cnn models . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.1 Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.2 Accuracy of collage-cnn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.3 Stand alone inference using collage-cnn . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.4 End-to-end system performance with collage-cnn . . . . . . . . . . . . . . . . . . 69
5.5.5 Comparison to alternate backup models . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6 Origami inference 80
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.1.1 TEEs and Intel SGX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Overheads of secure execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3 Origami inference and two key ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.1 Key idea 1: Model partitioning and offloading . . . . . . . . . . . . . . . . . . . . 84
6.3.2 Key idea 2: Reducing SGX execution further with Slalom . . . . . . . . . . . . . 87
6.3.3 Origami: Combining model splitting with blinding . . . . . . . . . . . . . . . . . 88
6.4 Conditional GAN threat models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.1 Formal privacy definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.2 C-GAN adversary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.3 Training the c-GAN adversary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.4 Model partitioning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5.1 c-GAN architecture and implementation . . . . . . . . . . . . . . . . . . . . . . . 94
6.5.2 Implementation of models in SGX . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6.1 Hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6.2 Partitioning and input privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6.2.1 VGG-16 c-GAN reconstruction results . . . . . . . . . . . . . . . . . . 98
6.6.2.2 Inception-v3 c-GAN reconstruction results . . . . . . . . . . . . . . . . 101
6.6.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
viii
6.6.3.1 SGX enclave memory usage . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6.3.2 Inference runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.6.3.3 Power event recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.6.3.4 Comparison with non-private inference . . . . . . . . . . . . . . . . . . 108
6.6.3.5 Performance evaluation summary . . . . . . . . . . . . . . . . . . . . . 109
6.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7 Conclusion 113
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.1 Coded computation at parameter servers in distributed training . . . . . . . . . . . 114
7.2.2 Coded computation and hardware enclaves . . . . . . . . . . . . . . . . . . . . . 118
7.2.3 Adding theoretical guarantees to Origami inference . . . . . . . . . . . . . . . . . 118
Reference List 121
ix
List of Tables
5.1 Usage of the 3x3 collage-cnn model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Usage of the 4x4 collage-cnn model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1 Enclave memory requirements for VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Recovery time from power events for VGG16 . . . . . . . . . . . . . . . . . . . . . . . . 107
x
List of Figures
1.1 Collage inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 An illustration of MDS-coded computation scheme presented in [64]. In this case we
assume worker node 2 is a straggler. In this figure, the blue color shows the portion of task
executed by worker nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Logistic regression experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Measured speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Storage overhead of uncoded computation with accurate speed predictions . . . . . . . . . 23
3.4 S
2
C
2
illustration on MDS codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 GeneralS
2
C
2
on polynomial codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 LR execution time comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 PR execution time comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Execution time comparison on cloud whenS
2
C
2
has low mis-prediction rate . . . . . . . 39
3.9 Per worker wasted computation effort with low mis-prediction rate . . . . . . . . . . . . . 40
3.10 Execution time comparison on cloud whenS
2
C
2
has high mis-prediction rate . . . . . . . 42
3.11 Per worker wasted computation effort with high mis-prediction rate . . . . . . . . . . . . 43
3.12 S
2
C
2
on polynomial codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.13 S
2
C
2
performance on a 51 node cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Architecture of Alexnet [61] Convolutional Neural Network (CNN) . . . . . . . . . . . . 50
5.1 Inference latency distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Collage inference algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Collage decoding scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Collage-cnn architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Accuracy on 100 classes of ImageNet-1k . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Comparison using 9 s-cnn models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.7 Comparison using 16 s-cnn models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.8 Comparison using 25 s-cnn models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xi
6.1 Unsecure system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Secure system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Comparison of runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 Runtime variation with different partitioning points . . . . . . . . . . . . . . . . . . . . . 86
6.6 c-GAN adversary model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.7 c-GAN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.8 Similarity between real and reconstructed images at different partition layers . . . . . . . . 98
6.9 Real images and reconstructed images using feature maps of different layers in VGG-16 . 99
6.10 Real images and reconstructed images using feature maps of different Inception Modules
in Inception-v3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.11 Inference Runtime offloading to GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.12 Inference Runtime offloading to CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.13 Baseline 2 runtime breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.14 Inference runtime offloading to GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.15 Inference runtime offloading to CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.1 An illustration of using centralized parameter server (left) where all the workers send their
gradient matrix G to a single parameter server. In this case the parameter server becomes
a central bottleneck (as well as a single point of failure). The figure on the right shows
a distributed parameter server with two servers and each worker node sends first half of
its gradient G
1
to the first parameter server and second half G
2
to the second parameter
server. Each parameter server computes one half of the new parameters M
next
1
and M
next
2
and the workers concatenate the two halves to create the full parameter matrix M
next
. . . 117
7.2 An illustration of CAPS scheme with three parameter servers. Each worker creates a
special encoded gradient matrix (G
i
1
+ G
i
2
) to communicate with server 3. In this example
if parameter server 2 is a straggler then results from server 1 and server 3 are used to
reconstruct the next parameter matrix M at each worker node. . . . . . . . . . . . . . . . 117
7.3 The tradeoff of top-1 classification accuracy vs. . . . . . . . . . . . . . . . . . . . . . . 119
xii
Abstract
Over the past decade, cloud computing services have increased accessibility, lowered costs, and enabled
seamless scaling. These traits are particularly useful for training and deploying machine learning (ML)
models in the cloud, as model training is a very compute-intensive task that can benefit from cloud-scale
services. In particular, distributed training of models reduces the wall clock time taken for convergence.
Similarly, performing inference in the cloud enables model providers to scale efficiently as inference de-
mand grows. However, distributed machine learning in the cloud has to contend with multiple challenges.
One challenge both distributed training and inference systems in the cloud face is the incidence of
straggler nodes during execution. Straggler nodes are too slow (or even dead) to respond to a task assign-
ment, causing the entire distributed system to slow down. The second challenge is preserving the privacy of
the data used by machine learning models in the cloud. This dissertation first develops straggler-resilience
techniques by designing efficient coded computing frameworks. It then develops techniques that leverage
secure hardware enclaves to preserve the privacy of data used in machine learning inference in the cloud.
Coded computing has been proposed to tolerate stragglers in distributed computing. However, the
coding strategies are designed for a specific worst-case straggler tolerance. The execution overheads are
proportional to the worst-case straggler counts, and the overheads of redundant computing on coded data
remain constant even when there are fewer than the worst-case stragglers. This dissertation presents a
dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation
(S
2
C
2
). S
2
C
2
squeezes the compute slack (i.e., overhead) built into the coded computing frameworks by
efficiently assigning work for all fast and slow nodes according to their speeds and without needing to
re-distribute data. S
2
C
2
is particularly beneficial to machine learning models that rely on linear algebraic
operations such as support vector machines. S
2
C
2
enhances the benefit of coded computing of linear
xiii
algebraic operations by removing redundant computing when there are fewer than worst-case stragglers,
while allowing the worst-case straggler tolerance when needed.
Using theS
2
C
2
design experience, the dissertation then broadens the use of coded computing to deal
with stragglers in deep neural networks, which include both linear and non-linear operations. A novel
coded redundancy model, Collage-cnn, is proposed to deal with stragglers in distributed image classifica-
tion systems. A collage-cnn model takes collage images formed by combining multiple images as input
and performs multi-image classification in one shot, albeit at slightly lower accuracy. Collage-cnn aug-
ments a collection of traditional single image classifier models with a single collage-cnn classifier, which
acts as a low-cost redundant backup. Collage-cnn provides backup classification results if any single image
classification request experiences a slowdown.
Finally, this dissertation focuses on protecting data privacy while performing inference in the cloud.
The Origami inference provides privacy-preserving inference for large deep neural network (DNN) models
through a combination of secure enclave execution, cryptographic blinding, interspersed with accelerator-
based computations on the GPU. While secure enclaves provide privacy by themselves, they do not allow
the use of GPUs for inference. To tackle this challenge, Origami blinds the input data within a secure
enclave, offloads the linear computations in DNNs to execute on GPUs, and then de-blinds the data back
inside the enclave to perform non-linear operations. Rather than performing the repeated blinding and de-
blinding operations that incur high overhead, Origami partitions the DNN model into two partitions and
runs only the first partition in the secure enclave to preserve privacy, while running the second partition
entirely on a GPU. Origami relies on the empirical observation that the feature maps after the first several
layers can not be used to reveal the input data, eliminating the overhead of repeated transitions between
the enclave and GPUs.
xiv
Chapter 1
Introduction
1.1 Background
Artificial intelligence has contributed to advances across many fields, including com-
puter vision, autonomous driving, machine translation, health care, wireless communi-
cations, data center management, online advertising, recommendation systems, etc. The
availability of large public data sets, open-source learning frameworks for rapid proto-
typing, and relatively inexpensive cloud infrastructure accelerate the progress of machine
learning related technologies. Most of the recent breakthroughs in machine learning have
come from deep learning (DL). However, there are many prediction tasks where machine
learning models like Support Vector Machine (SVM) and Logistic Regression are still
widely used.
Supervised learning consists of two phases: training phase and inference or prediction
serving phase. The training phase involves data collection and preprocessing, selecting
a model architecture, and training the parameters of the architecture to minimize a loss
function. The model is trained through forward and backward propagation operations.
1
The output is a trained model that stores the architecture and the parameter values of
the architecture. During inference, the trained model is deployed to make predictions on
different inputs. The inference phase only performs forward propagation operation on
the models.
As the complexity of a learning problem increases, the size of the machine learn-
ing model also increases. Larger models have many model parameters that can only be
trained with a substantial amount of training data. For instance, smaller image classifi-
cation tasks such as MNIST and CIFAR-10 used only 50000 training samples to train
high-accuracy models. However, to train high-accuracy models like ResNet-50 on com-
plex images such as in the ImageNet [25] classification task, more than 1 million training
samples and a substantial amount of computing resources are needed. ML models’ enor-
mous computing demand is being met by advances in hardware acceleration of ML,
using technologies such as graphics processing units (GPUs) and tensor processing units
(TPUs) [53].
Even with this substantial acceleration, hardware training on a single machine is un-
duly long, thereby hindering model architecture exploration studies. Distributed training
with Stochastic Gradient Descent (SGD) is used to train machine learning models on
large data sets. One approach to distributed training with SGD is to use a central server
(sometimes referred to as a master node) coordinating training across the many compute
nodes (workers). The large dataset is partitioned and distributed to workers. An initial
model is distributed to the worker nodes at the start of training. During each iteration
of synchronous SGD, each worker computes gradients of model parameters using data
from its input partition and sends these gradients back to the central server. The server
2
aggregates the gradients from all the workers, updates the model, and sends the updated
model to workers for the next iteration. This training process continues until the model
converges. After training, the models are deployed to perform inference. This is an ex-
ample of a data-parallel training approach, which is highly scalable and can be deployed
in a cloud setting without substantial model re-design.
With the increasing accessibility, lower cost, and scale provided by cloud computing
services, training and deploying machine learning models in the cloud is an attractive
option. Offloading training and inference to the cloud comes with two main challenges.
First, as we scale-out computations across many distributed nodes in the cloud, one
needs to deal with the “system noise” that is due to several factors such as heterogeneity
of hardware, hardware failures, disk IO delay, communication delay, operating system
issues, maintenance activities, and power limits [9]. System noise leads to uneven ex-
ecution latencies where different servers may take different amounts of time to execute
the same task, even if the servers have identical hardware configuration. In extreme
cases, a server may even be an order of magnitude slower than the remaining servers,
called a straggler node. Such speed variations create significant delays in task executions
and lead to major performance bottlenecks since the central server waits for the slowest
nodes to finish their tasks. In particular, if a single worker takes a substantially longer
time to compute gradients on its data, then the central server must wait to average the
gradients before the next training iteration can begin. This phenomenon results in tail la-
tency, which is defined as the high percentile completion latency of the distributed tasks.
If the number of servers within a cluster experiencing this speed variance increases, the
3
probability of having long tail latency increases exponentially [22]. It can slow down
each iteration in distributed training.
In section 1.3, we introduce S
2
C
2
[85], a workload distribution algorithm for miti-
gating stragglers in distributed training using coded computing. S
2
C
2
is ideally suited
for tolerating stragglers in linear algebraic dominated ML algorithms, such as linear re-
gression and support vector machines. However, the increasingly popular deep neural
networks have a substantial number of non-linear computations that must also be handled
efficiently in a distributed setting. As such, section 1.4 introduces the collage inference
algorithm [84] that significantly reduces the tail latency of distributed image classifica-
tion services in the presence of stragglers. Collage inference is an end-to-end straggler
tolerance solution for any deep neural network that may have linear and non-linear com-
putations. Collage inference is well suited for cloud based inference schemes that require
strong Quality of Service (QoS) guarantees. High QoS is essential for end-user facing
applications, and not meeting QoS guarantees can adversely affect the interactiveness of
the applications [12, 22].
The second challenge facing ML models using cloud computing is the privacy and
confidentiality of the user data shared with the cloud computing services. Users of cloud
computing services expect the confidentiality of their data to be maintained. Newer
regulations like GDPR make it necessary for the service providers to protect user data
confidentiality. While running in the cloud, deep learning models can be exposed to a
wide attack surface consisting of malicious users, compromised hypervisors, and phys-
ical snooping, leading to data leakage. It is the service providers’ responsibility not to
4
compromise the privacy of user data accidentally or otherwise. In section 1.5, we intro-
duce the Origami inference technique [86], which provides privacy-preserving inference
while using convolutional neural network models through an elegant combination of
hardware enclave execution, cryptographic blinding, interspersed with accelerator-based
computation.
1.2 Primary contributions of this dissertation
This dissertation considers the problems of building machine learning systems that are
resilient to stragglers and that preserve data privacy in the cloud. It develops straggler-
resilience techniques by extending and proposing new coded computing frameworks. It
develops techniques that leverage hardware enclaves to preserve privacy in the cloud.
Sections 1.3, 1.4, 1.5 expand on each of these techniques.
1.3 Mitigating stragglers in distributed training
Chapter 3 considers the problem of mitigating stragglers in distributed training. “Cod-
ing” can provide a novel approach to mitigate the tail latency caused by straggler nodes,
and a new framework named “coded computation” has been proposed [28, 64, 70, 93,
114]. The key idea of coded computation frameworks is to inject computation redun-
dancy in an unorthodox coded form (as opposed to the state-of-the-art replication ap-
proaches) in order to create robustness to stragglers. For example, it was shown in [64]
5
that error correcting codes (e.g., Maximum-Distance-Separable (MDS) codes
1
) can be
utilized to create redundant computation tasks for linear computations (e.g., matrix mul-
tiplication).
Overview of MDS coded computing: An (n;k)-MDS coded computation first de-
composes the overall computation intok smaller tasks, for somekn. Then it encodes
them inton coded tasks using an (n;k)-MDS code, and assigns each of them to a node
to compute. From the desirable “anyk ofn” property of the MDS code, we can accom-
plish the overall computation once we have collected the results from the fastest k out
ofn coded tasks, without worrying about the tasks still running on the slow nodes (or
stragglers).
But (n;k)-MDS coding requires the designer to select a priori the number of strag-
glers that can be tolerated. This worst case straggler count determines the size ofn for
a given k. Once the code is designed for worst case straggler counts, coded comput-
ing frameworks handle worst case stragglers efficiently, but their efficiency drops if the
number of persistent stragglers during a particular execution instance is fewer than what
the code is built to support. For instance, in MDS coding as explained above, there is no
significant performance benefit if there are fewer thannk stragglers, since the coded
computation still has to wait fork nodes to complete their execution. In cloud computing
systems partial stragglers are more often encountered i.e., nodes that are slower but can
do partial amount of work assigned to them. The existing coded computation schemes
always waste the compute capability of the nk partial stragglers. Further they do
1
MDS codes are an important class of block codes since they have the greatest error correcting and detecting capabilities.
For more information see [46] Chapter 16.
6
not take advantage of the fact that data which is needed for partial computation already
exists within each worker node and they can do partial amount of work. It is this lack of
elasticity that makes coded computing unpalatable in large scale cluster settings. What
is ideal is to allow the developer to select high redundancy coding to be conservative
(essentially assuming a reasonable worst case straggler scenario) but allow a workload
scheduler to decide how much redundant computing to perform based on observed speed
variations in a distributed or cloud computing environment.
The first contribution of this dissertation is a new dynamic workload scheduling strat-
egy for coded computing that is elastic with the speeds of nodes measured during run-
time, irrespective of how much redundancy is chosen for creating the coded data. Our
proposedS
2
C
2
(Slack Squeeze Coded Computing) strategy adapts to the varying num-
ber of stragglers by squeezing out any computation slack that may be built into the coded
computation to tolerate the worst-case execution scenarios. The performance of S
2
C
2
is determined by the actual speeds measured and the actual number of very slow nodes
seen rather than by the redundancy used in encoding the data. As the speeds of nodes
change, S
2
C
2
responds by appropriately increasing or decreasing the amount of work
allocated to each node in the cluster to maximize the performance.
To predict the speeds of the nodes as they change during runtime we use a prediction
mechanism. We model speed prediction into a time series forecasting problem and use
a Long Short-Term Memory (LSTM) based learning model to predict the speeds. These
predicted speeds are used byS
2
C
2
to do work allocation among the nodes.
7
We demonstrate the performance gains ofS
2
C
2
on top of an MDS-coded dataset by
deploying on the cloud, DigitalOcean [1]. While executing algorithms such as gradient
descent, graph ranking, and graph filteringS
2
C
2
can reduce the total compute latency by
up to 39:3% over the conventional coded computation and by up to 19% over the fine-
grained replication approach. We go beyond matrix-vector multiplication to demonstrate
the versatility ofS
2
C
2
by applying its workload distribution and scheduling strategies on
top of polynomial code [114], a coded computing strategy for polynomial computations.
One limitation ofS
2
C
2
is that it enables distributed computations on linear algebraic
computations. While these operations are extensively used in many ML models, DNNs
also use several non-linear operations that are not handled by MDS-codes. In the next
section, we discuss our contribution to tolerating stragglers in DNN inference.
1.4 Straggler-resilient distributed inference
In chapter 5 of this dissertation, we tackle stragglers in distributed inference. Given
the immense computational demands of training, many research efforts have tackled dis-
tributed training in the cloud [15,38,68,69,87]. Cloud based inference, on the other hand,
poses a different set of constraints while deploying as a service. Model deployment in
cloud for inference is generally concerned with quality of service (QoS) guarantees pro-
vided to the user. From the user perspective one critical QoS metric is the query latency.
However, cloud services are prone to straggler problems, which lead to unexpected vari-
ability in an inference latency. Prior coded computing techniques have been applied to
8
mitigate stragglers in deep learning inference [59], but they suffer from a significant drop
in accuracy.
Prediction serving systems deploy a front-end load balancer that receives requests
from multiple users and submits them to the back-end cloud instances. In this setting, the
load balancer has the unique advantage of treating multiple requests as a single collective
and create a more cost effective redundancy strategy.
We propose the collage inference technique as a cost-effective redundancy strategy to
deal with variance in inference latency. Collage inference uses a unique convolutional
neural network (CNN) based coded redundancy model, referred to as a collage-cnn, that
can perform multiple predictions in one shot, albeit at some reduction in prediction ac-
curacy. Collage-cnn is like a parity model where the input encoding is the collection
of images that are spatially arranged into a collage, as depicted in figure 1.1. Its output
gets decoded to get the missing predictions of images from models taking too long to
complete, like model 4 in the figure illustrated with an X on top. This coded redun-
dancy model is run concurrently as a single backup service for a collection of individual
image inference models. We present the design and implementation of the collage-cnn
model and demonstrate the effectiveness of collage inference on cloud deployments.
Collage-cnn is focused on the image classification application since it is a fundamental
component of many deep learning based image processing systems.
9
cheetah
camel
deer
1 2 3 4
tiger, camel
deer, frog
5
frog
Figure 1.1: Collage inference
1.5 Privacy-preserving inference in the cloud
Chapter 6 considers the problem of preserving input data privacy during cloud based
inference. The proposed approach to preserve privacy uses trusted execution environ-
ments (TEE) like Intel SGX enclaves [81], ARM TrustZone [7] or Sanctum [18]. TEEs
can be leveraged to protect privacy of user data. Classic techniques like data encryption
can protect the data during its storage and communication phases. TEEs complement
these protections by protecting the data during computation phase. They achieve this
task by using a combination of hardware and software techniques to isolate and protect
sensitive data and computations. These security features can be exploited to provide
private inference capabilities. For instance, one approach for private inference is to run
the DNN model within an Intel SGX enclave, which is invisible to the cloud service
10
provider or even a potential hacker with root access to the cloud server. The user then
sends encrypted data into the enclave, which is then decrypted within the enclave before
running the inference entirely within the enclave.
However, running an entire model inference within a secure enclave provides sev-
eral practical hurdles. First, accelerators such as GPUs do not support trusted execution.
Hence, applications cannot take advantage of the growing list of deep learning accelera-
tors, such as GPUs and TPUs, if they rely purely on enclaves to achieve privacy. Second,
hardware enclaves do not support the efficient execution of arbitrarily large programs.
Typically the program memory footprint is limited to a threshold. This threshold is less
than 128 MB for Intel SGX enclaves. When the program memory limit exceeds the
threshold, then frequent swapping of data in and out of SGX leads to significant perfor-
mance slowdowns. This overhead stems from the fact that moving pages in and out of
SGX enclaves requires decryption and encryption of data.
Popular deep learning models such as VGG-16 and VGG-19 [99] have a large mem-
ory footprint, larger than the SGX memory limit. The performance gap between running
deep learning models within SGX compared to running their inference on an untrusted
GPU/CPU is very high. As we will show later in the dissertation, in our experimental
setup running a VGG-19 model inference entirely within SGX is 105 times slower than
running it on a GPU, and is 6.5 times slower than running it on a CPU. These drastic
slowdowns make it unpalatable to run the entire inference models within SGX.
Model splitting: To minimize the negative performance impacts of SGX, hardware
vendors recommend that a program be carefully split into sensitive and non-sensitive
11
parts, and schedule only the sensitive part to be executed inside the TEE. They also
recommend minimizing the sensitive code to reduce latency. Following this guidance
specifically for deep neural network (DNN) models, we perform some of the DNN layers
within SGX and allow other layers to be executed outside of SGX. Since DNNs provide
a clean layer abstraction, it is possible to re-design the models such that only the first few
layers are executed within a secure container while the remaining layers are executed in
an untrusted container. The computations that are performed outside of SGX can take
advantage of accelerators such as GPUs. However, care must be taken such that the
computations performed on a GPU, without the benefit of SGX protection, should still
protect the privacy of user input.
Given the above-discussed limitations, in this dissertation, we propose Origami In-
ference, a new inference framework that lowers the performance overhead of using a
TEE while protecting the privacy of the user data. It is built on Intel SGX, but comes
with the flexibility to run models that are much larger than SGX’s protected memory
size, and can exploit insecure accelerators, such as GPUs. Origami is built on the insight
that only the first few layers of the model contain most of the information that can be
used to reconstruct the model’s input, and the output from deeper layers of the model
can not help with input reconstruction. In Origami, a pre-trained model is split into two
partitions. The first partition consists of multiple layers of a DNN. Each layer within
the first partition is split partially between GPU and SGX enclave. We use the Slalom
approach [105] to offload computationally intensive convolutions (basically matrix mul-
tiplications) on GPUs while allowing the non-linear operations (such as ReLUs) to be
performed within an enclave. Data privacy is preserved using cryptographic blinding
12
of the convolution data before offloading to the GPU (more details of blinding later).
Unlike Slalom, which continues to split every DNN layer between GPU and SGX, lead-
ing to unnecessary overheads, Origami dynamically switches to offloading the second
partition of the DNN model to execute entirely on a GPU (or even a CPU).
We present and evaluate a conditional GAN adversary model to verify if a user’s input
can be reconstructed from the intermediate data sent to the GPU. We demonstrate that it
is possible to partition DNNs so that the computation of the first partition of the model
that runs using cryptographic blinding between GPU and SGX can be minimized. For
example, for VGG-16 model we find that it is sufficient to run the first 4 convolutional
layers using blinding to protect input privacy. The remaining 9 convolutional layers and
3 fully connected layers can be executed completely in the open without the risk of input
recovery. Our evaluations demonstrate that up to 15x speedup in inference latency can
be achieved compared to running the full model inside the SGX secure enclave.
1.6 Dissertation organization
The rest of the dissertation is organized as follows. Chapter 2 provides background on
coded computing. Chapter 3 describes the Slack Squeeze Coded Computing algorithm.
Chapter 4 provides background on image processing and CNNs. Chapter 5 describes the
Collage inference technique. Chapter 6 describes Origami inference technique. Chapter
7 concludes the dissertation and discusses some future research directions.
13
Chapter 2
Background on Coded Computing
In this chapter, we provide a background on coded computing, which forms the foun-
dation for the first and second parts of the dissertation in 2.1.
2.1 Coded distributed computation
We explain the coded computation concept through the application of Maximum Dis-
tance Separable (MDS) codes. Let us consider a distributed matrix multiplication prob-
lem, in which as shown in Fig. 2.1 , a central server (or a master node) wants to multiply
a data matrixA with the input vector
!
x to computeA
!
x . The data matrixA is dis-
tributed across 3 worker nodes on which the matrix multiplication will be executed in a
distributed manner.
One natural approach to tackle this problem is to vertically and evenly divide the data
matrixA into 3 sub-matrices, each of which is stored on one node. Then when each
node receives the input
!
x , it simply multiplies its locally stored sub-matrix with
!
x and
returns the results, and the master vertically concatenates the returned matrices to obtain
14
Worker 1 Worker 2 Worker 3
Master
C o m p u t e s : Figure 2.1: An illustration of MDS-coded computation scheme presented in [64]. In this case we assume
worker node 2 is a straggler. In this figure, the blue color shows the portion of task executed by worker
nodes.
the final result. However, we note that since uncoded approach relies on successfully
retrieving the task results from all 3 nodes, it has a major drawback that once one of the
nodes runs slow, the computation may take long to finish. Coded computation framework
deals with slow or straggler nodes by optimally creating redundant computation tasks.
As shown in Fig. 2.1, a coded computing scheme vertically partitions the data matrixA
into 2 sub-matricesA
1
andA
2
, and creates one redundant task by summingA
1
andA
2
.
ThenA
1
,A
2
andA
1
+A
2
are stored on worker nodes 1, 2, and 3 respectively. In the
case of Fig. 2.1, the final result is obtained once the master receives the task results from
any 2 out of the 3 nodes, without needing to wait for the slow/straggler node. Let us
assume worker node 2 is a straggler and the master node only collects results from node
1 and 3. Then the master node can computeA
2
!
x by subtracting the computed task of
node 1, i.e.A
1
!
x , from the computed task of node 3, i.e. (A
1
+A
2
)
!
x .
15
As illustrated in the above example, an (n;k)-MDS coded computation is a scheme
where the master may encode and distribute the data to alln nodes but the master has to
wait only foranyk nodes to return their results to decode the full computational output.
Broader coded computing: Maximum Distance Separable (MDS) codes, described
previously inject redundancy to tolerate stragglers in linear computations. Coded com-
puting is also applicable to a wider range of compute intensive algorithms, going beyond
linear computations. Polynomial coded computing [41, 114] can tolerate stragglers in
computations that solve polynomial equations such as Hessian matrix computation. La-
grange coded computing [115] can add coded redundancy to tolerate stragglers in any
arbitrary multivariate polynomial computation, thereby extending the reach of coded
computing to a wide range of application domains such as general tensor algebraic func-
tions, inner product functions, function computing outer products, and tensor contrac-
tions [95]. Parity Models [60] proposes a general parity models framework, ParM, for
coding-based resilient inference in tasks such as image classification, speech recognition
and object localization.
16
Chapter 3
Slack Squeeze Coded Computing
In this chapter, we discuss theS
2
C
2
technique. Section 3.1 provides the overheads
of existing straggler mitigation techniques. Section 3.2 describes the speed prediction,
section 3.3 describes the proposedS
2
C
2
algorithm, section 3.4 describes extensions to
non-linear coded computing, section 3.5 provides implementation and system details,
and section 3.6 shows different evaluations of the technique.
3.1 Overheads of straggler mitigation techniques
Consider an uncoded strategy withr-replication i.e., each data partition is replicated
across r different worker nodes where r is the replication factor. Consider a node N
executing taskT on data partitionDP
T
. If the nodeN is determined to be a straggler at
some future time, the master node can replicate taskT on any one of the nodes which
has a replica of data partitionDP
T
to speed it up. However, there are two challenges.
First, when should the master determine thatN is a straggler? Second, even if the master
has early knowledge ofN as a straggler, it is restricted to launching the taskT only on a
17
subset of nodes that have the required data partitionDP
T
. Third, in the worst case if all
the nodes with replicas are also stragglers i.e., if the system hasr stragglers, the uncoded
replication strategy cannot speed up computation at all. An alternative is to move the
data partition DP
T
to another available faster node and execute T on that node. This
option forces data transfer time into the critical path of the overall execution time.
Next let us consider the (n;k)-MDS coded computation on matrix multiplication.
The master node divides the original matrixA intok sub-matrices, encodes them inton
partitions and distributes them to workers. As we discussed before, a smallk needs to
be chosen for dealing with worst case scenarios. However, this over-provisioning with a
smallk comes with a price. If the original data size isS, then each of the worker nodes
must compute on a coded partition of size (S=k). Ifk becomes smaller, each worker node
has to execute a larger fraction of the computation independent of their actual speeds.
On the other hand with a largek the robustness of the computation decreases. This is a
difficult tradeoff since the selection ofk must be done prior to creating a correct encoding
and decoding strategy, and distribution of the encoded data partitions appropriately to all
nodes, which are usually done once before executing the given workload.
One solution to deal with the straggler uncertainty with MDS-coded computation
is to store multiple encoded partitions in each worker node, such that the system can
adapt and choose the appropriate encoded partition dynamically when the number of
stragglers changes in the cluster. For example, in a cluster with 12 worker nodes, each
worker node can store a (12; 9)-MDS encoded partition and a (12; 10)-MDS encoded
partition at the same time. Assume the original data size isS, when it’s observed there
are three straggling nodes, (12; 9)-MDS-coded computation will be performed with each
18
0
1
2
3
4
5
6
0 straggler 1 straggler 2 stragglers 3 stragglers
Normalized Computation Latency
uncoded with 3-replication (12,10)-MDS coded computation (12,9)-MDS coded computation
Figure 3.1: Logistic regression experiments
worker node operating on an encoded partition of size (S=9); and when it’s observed
there are fewer straggling nodes, (12; 10)-MDS-coded computation is performed with
each worker node operating on partition of size (S=10). This approach is optimal only
for two scenarios, and supporting a wider range of scenarios means storing more copies
of the encoded data. This dramatically increases the storage overhead. It is possible to
encode the data at run time and redistribute the large data partitions based on measured
speeds and slow node count. However, this will dramatically increase the communication
overhead and is not practical.
Figure 3.1 shows the computation latency in a cluster of 12 nodes with three schemes:
uncoded with 3-replication, (12,10)-MDS coded computation, and (12,9)-MDS coded
computation as the number of stragglers increases. In the uncoded with 3-replication
strategy, if the number of straggler nodes isr = 3 or more, computation latency increases
significantly. The computation time of (12,10)-MDS coded computation increases expo-
nentially when there are more than two stragglers. The computation time of (12,9)-MDS
coded computation is constant with more stragglers. But there is an increase in baseline
execution latency with this strategy compared to other schemes, because (12,9)-MDS
19
Figure 3.2: Measured speeds
code requires each worker node to perform more work than (12,10)-MDS code, even if
the number of stragglers is fewer than 3.
In summary, although conservative MDS-coded computation can provide robust pro-
tections against stragglers, its computation overhead per node is higher and remains the
same even when all the nodes in the cluster are fast, since it does not make efficient use
of all worker nodes. These drawbacks bring us to our key idea which is to have a work-
load scheduling strategy that provides the same robustness as the (n;k)-MDS-coded
computation, but only induces a much smaller computation overhead as if (n;s)-MDS-
coded computation is being used when there are onlyns stragglers in the cluster with
0 (ns)< (nk).
3.2 Speed prediction and coded data
It is important to know the speed variations across compute nodes before adaptively
assigning work to them in the MDS-coded computing framework. To collect and ana-
lyze the execution speeds of servers, we conducted experiments on 100-compute nodes,
20
referred to as droplets in Digital Ocean cloud [1]. Each droplet is similar to a t2.micro
shared compute instance in Amazon AWS. For our experiments, each droplet node exe-
cutes matrix-matrix multiplication and logs its execution times after completion of every
1% of the task. The size of each matrix is 20000 by 5000. We analyzed the measured
speeds at 1% granularity intervals at all nodes. Figure 3.2 shows the speed variations in 4
of the representative nodes. X-axis in each plot corresponds to time. Y-axis in each plot
corresponds to speed of the node normalized by its maximum observed speed during the
experiment.
One critical observation from the figure is that while the speed of each node varies
over time, on average the speed observed at any time slot stays within 10% for about
10 samples within the neighborhood. This relatively slow changing speed provides us
an opportunity to estimate speeds of nodes in future intervals using speeds from past
intervals. The speed estimates can be reasonably accurate for most of the time intervals
except for a short time window when the speed changes drastically, but again we will
soon be able to track the new speed as the nodes stay in that speed zone.
To find a good prediction mechanism, we considered the speeds of each node as a
time series and modeled our problem as a time series forecasting problem. We evaluated
several LSTM (Long Short-Term Memory [47]) and Auto Regressive Integrated Moving
Average (ARIMA) models to predict the speeds. The details of the speed prediction
models are described in section 3.5.1. The prediction accuracy of the LSTM model
is better than the ARIMA models. As expected, only immediately after a large speed
variance is observed the model prediction lags behind but catches up with the observed
speed soon after.
21
Based on this critical observation we hypothesize that reliably estimating the speeds
for next computation round allows master node to perform appropriate task assignment
to the workers such that the computations performed by all workers can be utilized to ob-
tain the final result. But this fine-granularity task assignment and utilization of all worker
nodes becomes feasible only if there is no data movement overhead between rounds of
computation. Coded computing is well suited for this fine-grained task assignment since
the input data that is distributed among workers is encoded and as a result there would be
no additional data movement needed between rounds of computation. However, this fea-
ture is not exploited in conventional MDS-coded computation. In uncoded computation,
to assign workload optimally based on the predicted speeds, either each worker node
will need to store significant percentage of the entire data, which can impose huge stor-
age overhead; or it requires the master to redistribute the data among nodes at run-time,
which can add huge communication overhead for iterative workloads such as gradient
descent and page rank. To measure the storage overhead of uncoded computation, we
performed experiments in our local cluster consisting of 12 worker nodes. We measure
the total data moved to each node between rounds of computation and consider it as
the effective storage needed at that node to avoid additional data movement. Figure 3.3
shows the mean effective storage needed at each node to avoid data movement during
the course of 270 gradient descent iterations for Logistic Regression. In this experiment,
the uncoded computation has accurate predictions of speed of nodes for next iteration. It
needs 67% of the total data to be stored at each worker node to have zero data movement
overhead. ForS
2
C
2
with (12,10)-MDS coding the data storage remains fixed at 10% of
the total data and much lower than the uncoded computation.
22
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0 50 100 150 200 250 300
percentage of the full data
iterations
Average storage requirement per node over time
Uncoded S2C2
Figure 3.3: Storage overhead of uncoded computation with accurate speed predictions
Following from these observations, we argue forS
2
C
2
that exploits the unique feature
of coded data availability and thereby utilize the compute capacity of all worker nodes.
3.3 S
2
C
2
algorithm
3.3.1 BasicS
2
C
2
algorithm
Row
Index
Worker 1 Worker 3 Worker 2 Worker 4
(a) (4,2)-MDS. Size of
A
1
=A
2
=A
3
=A
4
=
A
2
Row
Index
Worker 1 Worker 3 Worker 2 Worker 4
(b) (4,3)-MDS. Size of
A
1
=A
2
=A
3
=A
4
=
A
3
Row
Index
Worker 1 Worker 3 Worker 2 Worker 4
(c)S
2
C
2
on (4,2)-MDS. Size of
A
1
=A
2
=A
3
=A
4
=
A
2
.
Work performed at each non-
straggler worker =
A
3
Figure 3.4:S
2
C
2
illustration on MDS codes
The major goals of the algorithm are to achieve high tolerance to stragglers and reduce
computation work assigned per worker when the number of slow nodes observed during
23
Figure 3.5: GeneralS
2
C
2
on polynomial codes
run time is less than the conservative estimate. To achieve high straggler tolerance the
master node encodes and distributes the large matrix using a conservative (n;k)-MDS
coding once at the beginning. To assign reduced computation work to worker nodes,
master node then employsS
2
C
2
algorithm. There are two key insights, in a cluster using
conservative (n;k)-MDS coding, that underlie our algorithm.
• Each worker node stores high redundancy encoded matrix data partition
• The master node can decode and construct the final product as long as it receives
any k out of n responses corresponding to each row index of partitioned matrix
Let there be n-s < n-k stragglers in the (n,k)-MDS coded cluster. As we explained in
previous section, when there are (n-s) stragglers (n,s)-MDS coding is the best suited
coding strategy. But rather than using a new (n,s)-MDS code to re-encode the data, we
use the (n,k)-MDS coded data partition as is but we change the amount of work done by
each node. In particular,S
2
C
2
allocates decodable computational work assignment per
node equal to that in (n,s)-MDS coding instead of (n,k)-MDS coding. IfD is the number
24
of rows in the original matrix, each node gets an allocation
k
s
D
k
=
D
s
number of rows
to be computed.
Figure 3.4 provides an illustration of the S
2
C
2
strategy in a cluster consisting of 4
worker nodes (and 1 master node). Figure 3.4a shows the conventional (4,2)-MDS coded
computation performed when worker 4 is the only straggler node and the remaining 3
workers have same speed. Note that (4,2)-MDS coding is conservative here, since it can
support 2 stragglers but in this case there is only 1 straggler. Each worker node computes
on its full partition but the master node needs only the results from workers 1 and 2 and
can ignore the result from worker 3. Sub-matrices A
1
; A
2
refer to the vertical divisions
of the matrix A. Data stored in worker 3 is a coded matrix, A
3
= A
1
+ A
2
. Data stored
in worker 4 is coded as A
4
= A
1
+ 2A
2
. These codes are generated as per MDS-coding
principles.
In figure 3.4b, conventional (4,3)-MDS coded computation when worker 4 is the
straggler node is shown. Each non-straggler node computes on its full partition but
the size of the partition here is smaller than partition size in the previous coding. The
master node needs the results of all workers to construct the final product. Sub-matrices
A
1
; A
2
; A
3
are the vertical divisions of the matrix A into three parts. Data stored in
worker 4 is coded as A
4
= A
1
+ A
2
+ A
3
.
S
2
C
2
with (4,2)-MDS Coded computation for this scenario is shown in figure 3.4c.
If we consider data in each worker as composed of 3 equal size partitions, worker node
1 computes only on the first and second of its partitions. Worker 2 computes only on the
first and third of its partitions. Worker 3 computes only on the second and third of its
25
partitions. As a result, each worker node is performing less amount of compute and this
compute is equal to the amount performed by each worker in conventional (4,3)-MDS
coded computation. Partitions to be computed at each worker are assigned to ensure that
each row index is computed by exactly two workers. This is necessary for successfully
decoding the results by the master node at the end of computation. For instance, worker
node 3 computes on the middle third of matrix A3 (which is the coded A1+A2 matrix)
and worker node 2 skips computing that portion of A2. As such the master has to decode
the missing A2 from the computations performed by worker node 1 and worker node 3
to reconstruct the middle portion of A2.
3.3.2 GeneralS
2
C
2
algorithm
In cloud computing services and data centers, compute nodes within a cluster can
have different speeds during run time, as described before, due to them being shared or
due to various micro-architectural events such as cache misses and other control/data bot-
tlenecks. They can also be heterogeneous. We present aGeneralS
2
C
2
algorithm which,
unlike basic S
2
C
2
, can consider the variation in speeds of all nodes and assign work
to them. At the beginning of execution of every application, matrix data is partitioned,
encoded and distributed to the worker nodes using (n;k)-MDS coding. For efficient
decoding and work allocation, generalS
2
C
2
algorithm also decomposes and considers
each matrix partition as composed of chunks (groups) of rows i.e., over-decomposition.
The speed predictions from LSTM model are provided to the generalS
2
C
2
. Then work-
ers are sorted according to their speeds. Starting from the worker with highest speed,
26
each worker is assigned chunks to be computed equal to the ratio of its speed to the total
available computational speed of all workers. If the assigned chunks for a worker turn
out to be more than the total chunks in the partition already stored in a worker, the algo-
rithm re-assigns these extra chunks to next worker. This case occurs when one worker is
relatively much faster than all other workers. The algorithm is summarized in Algorithm
1. It can be noted that general S
2
C
2
algorithm uses relative speed predictions of the
nodes during work allocation. In the scenarios where all non-straggler nodes have equal
speed generalS
2
C
2
would reduce to basicS
2
C
2
.
3.3.3 Dealing with mis-prediction or failures
Speed prediction algorithm can mis-predict when there is a sudden and significant
change in the speeds of workers. Also, one of the worker nodes can die or fail during
execution. To handle these scenarios, S
2
C
2
algorithm employs a timeout mechanism.
S
2
C
2
collects results from the first k workers that complete their work and measures
their average response time. If the remaining n k workers do not respond within
15% of the average response time,S
2
C
2
considers this situation as a mis-prediction and
reassigns the pending work among thek completed workers. We choose 15% based on
the average error from our speed prediction algorithm (16.7%).
3.3.4 Robustness ofS
2
C
2
Coded computing withS
2
C
2
is robust and can tolerate the same number of stragglers
as conventional coded computing because:
27
Algorithm 1 GeneralS
2
C
2
algorithm
Lines with # are comments
Input: List (U) of Speeds (u
i
) of worker nodes, n;k of coding, Number of rows per node
(numRowsPerNode)
Output: Computation assignment per node ialloc
i
#over decompose each partition into chunks of rows
maxChunksPerNode =
P
u
i
#minimum total chunks needed for correct decoding
totalChunks =kmaxChunksPerNode
#Sort the workers as per their speedU in descending
#order and assign number of chunks to be computed
for each nodei in sortedU do
#Allocate number of chunks to nodei proportional to its speed
chunksForNode[i] = (
u
i
n
P
j=i
u
j
totalChunks)
#Update total chunks left to be computed
totalChunks=totalChunkschunksForNode[i]
#Assign the exact chunks that will be computed
chunkBegin = 0
for each nodei in sortedU do
chunkEnd =chunkBegin +chunksForNode[i]
chunks node
i
[chunkBegin;chunkEnd]
chunkBegin =chunkEnd%maxChunksPerNode
#Convert chunks to exact row indices
alloc
i
=convert(chunks node
i
)
28
• Data distribution inS
2
C
2
is identical to the data distribution in conventional coded
computing.
• The worst case scenario may occur when the speed predictions forS
2
C
2
completely
fails. In this case the generalS
2
C
2
along with the timeout mechanism, described in
section 3.3.3, essentially turns into a conventional coded computing.
3.4 Extension to bi-linear coded computing
S
2
C
2
, being a workload distribution strategy, can be extended to many different coded
computations. In this section we demonstrate how to apply it on top of the popular poly-
nomial codes [114]. We refer the reader to the paper for the mathematical underpinning
of polynomial codes and we will only provide brief overview to demonstrate its working
and howS
2
C
2
can be applied to such a generalized codes. The idea of polynomial codes
is to encode data by computing polynomial functions.
Consider computing AB on two matrices A and B, in a distributed manner using a
cluster withn nodes. Matrix A is divided intoa sub-partitions along rows, and matrix
B is divided into b sub-partitions along columns. Then n encoded partitions each for
A; B are computed from these sub-partitions. Let us consider the scenario where n =
5 nodes. In this scenario, a = b = 2 i.e., each matrix has 2 sub-partitions. A
0
;A
1
are sub-partitions of A.B
0
;B
1
are sub-partitions of B. Computing AB is composed of
four partial computationsA
0
B
0
;A
0
B
1
;A
1
B
0
;A
1
B
1
. Each encoded partition of A is of
the form
~
A
i
= A
0
+iA
1
and each encoded partition of B is of the form
~
B
i
= B
0
+
29
i
2
B
1
, where i is the node index2f0,1,..n-1g. In this scenario, node 0 stores
~
A
0
=
A
0
+ 0:A
1
,
~
B
0
= B
0
+ 0:B
1
and node 2 stores
~
A
2
= A
0
+ 2A
1
,
~
B
2
= B
0
+ 2
2
B
1
,
and so on. Each node computes product of it’s two stored partitions. For instance,
node 2 computesA
0
B
0
+ 2A
1
B
0
+ 2
2
A
0
B
1
+ 2
3
A
1
B
1
. To fully decode the four partial
computations,A
0
B
0
;A
0
B
1
;A
1
B
0
;A
1
B
1
, we need to get coded computation results from
any 4 of the nodes. If none among the 5 nodes is a straggler, there is wastage of one
node’s computation similar to MDS coding.
In figure 3.5 we illustrate how ourS
2
C
2
framework can be applied on top of such a
polynomial coded bilinear computation. In figure 3.5, the cluster hasn = 5 nodes. For
illustration purposes each matrix partition
~
A
i
has 9 rows. A minimum of 4 responses
per each row are needed for successful computation ofAB. The relative speeds of each
nodes aref2,2,2,2,1g. Node 4 is a partial-straggler. Conventional polynomial coded
computing ignore the computation from this node. However, generalS
2
C
2
does not and
it allocates partial work to it. General S
2
C
2
allocatesf8,8,8,8,4g rows to the 5 nodes
respectively as highlighted by the bounding rectangles in each worker node. The last
worker (speed 1) is shown to compute the last set of rows. Product of each row with
~
B
i
is computed by exactly 4 nodes and sent to the master node.
3.5 Implementation
At the beginning of computation, master node encodes the matrix data and distributes
the encoded sub-matrices to the corresponding worker nodes. For MDS coding we are
dealing with just a single matrix, but with Polynomial codes we have two matrices to
30
encode, and both coding strategies use different encoding as described earlier. At the
start of each iteration of our applications master node distributes the vector (
!
x ) to all
worker nodes. At the end of each iteration, master node receives the sub-product from
the worker nodes, decodes them and constructs the result vector.
Each worker node has two separate running processes, one for computation and one
for communication. The computation process on the worker node performs the appro-
priate computation on encoded data, either a matrix-vector operation in MDS setting or
a Hessian matrix computation in polynomial setting. The communication process is in
charge of receiving input data from the master node, work assignment information, and
sending the partial product, and controlling the start and stop of the computation process
at the worker node.
3.5.1 LSTM based speed prediction model
We used the speed data measured from our experiments in motivation section as the
dataset for evaluating several prediction models. The train/test dataset split is 80:20.
We evaluated several LSTM (Long Short-Term Memory [47]) and Auto Regressive Inte-
grated Moving Average (ARIMA) models to predict the speeds. Among ARIMA mod-
els, we evaluated using ARIMA(1,0,0), ARIMA(2,0,0) and ARIMA(1,1,1) models. We
found that the ARIMA(1,0,0) model, which uses just the speed from past iteration, pro-
vided the highest prediction accuracy among all ARIMA models. Since this indicates
that using the speed from the past iteration is enough, the evaluated LSTM models have
31
1 dimensional input. The dimension of hidden state is a hyper parameter and we ex-
perimented with different values. The best performing LSTM model consists of one
single-layer LSTM with a hidden state being 4 dimensional with tanh activation, 1 di-
mensional input and output. The prediction accuracy of this LSTM model is better than
the ARIMA(1,0,0) model. The LSTM model predicts the speeds of the nodes within
83:3% of their actual values. In statistical terms, the Mean Absolute Percentage error of
the model on the test set is 16:7%. This prediction error is better than ARIMA(1,0,0) by
5%. This LSTM model is used to predict speed of nodes once every iteration. Input to
the model is the speed of node from previous iteration and its output is the speed predic-
tion for the next iteration. The LSTM model computation takes 200 microseconds for
each node.
3.5.2 S
2
C
2
specifics
BasicS
2
C
2
strategy needs information on which nodes are stragglers. GeneralS
2
C
2
strategy needs information on the relative execution speeds of all nodes and it adjusts the
work assignment to the worker nodes according to their speed. To obtain this information
we rely on the iterative nature of our algorithms. Initially master node starts with the
assumption that all the worker nodes have the same speed and this is provided as input to
the currentS
2
C
2
strategy. The master then distributes the work assignment calculated by
S
2
C
2
to each worker node. Upon receiving the partial products from the worker nodes,
master node also records the response timet
i
(iter) for each worker nodei corresponding
to iterationiter. If the number of rows computed at workeri is`
i
(iter), then the speed
32
s
i
(iter) of each worker node for the current iteration is computed as
`
i
(iter)
t
i
(iter)
. These values
from all nodes are provided as a batch input to the trained LSTM model which predicts
speeds for the next iteration. The predicted speeds are fed into the GeneralS
2
C
2
strategy
to generate the computational work assignment at each worker node for iteration (iter +
1). ThusS
2
C
2
automatically adapts to speed changes at the granularity of an iteration.
3.5.3 Computing applications
We evaluatedS
2
C
2
on MDS using the following linear algebraic algorithms: Logis-
tic Regression, Support Vector Machine, Page Rank and Graph Filtering. Graph rank-
ing algorithms like Page Rank and Graph signal processing algorithms employ repeated
matrix-vector multiplication. Calculating page rank involves computing the eigenvector
corresponding to the largest eigenvalue which is done using power iteration algorithm;
Graph filtering operations such as then-hop fitering operations employn iterations of
matrix-vector multiplication over the combinatorial Laplacian matrix. We evaluateS
2
C
2
on both these algorithms. We further evaluateS
2
C
2
on polynomial coding for comput-
ing the Hessian matrix as described in [41]. The Hessian computation is of the form
A
T
diagonal(x)A, wherediagonal(x) refers to a matrix composed of elements of vector
x on its diagonal.
3.5.4 System setup
We evaluated the above computing applications in a datacenter scale cloud setting in
the Digital ocean cloud. On Digital ocean cloud we employ 11 shared compute instances
33
each with 1 virtual CPU and 2 GB of memory. We use Kubernetes to bootstrap a cluster
using these 11 nodes, with one being the master and the other 10 nodes being the worker
nodes. We then dockerize the computing applications and deploy them on the cloud.
3.5.5 Verification in a controlled cluster
For theoretical verification purposes we also evaluated all the applications and results
on our local cluster where we had the ability to precisely control the straggler behavior.
Our local cluster is composed of 13 identical servers. Each server consists of two Intel
Xeon CPU E5-2630 v3 each with 8 cores (8 threads, 2.40 GHz), 20 MB of L3 cache,
running Centos linux version 7.2.1511. Each machine has 64GB of DRAM. All the
servers have access to a shared storage node of size 1 TB. All the servers are connected
to one another through Mellanox SX6036 FDR14 InfiniBand switch with a bandwidth of
56 Gbps. We use one of the nodes as master node and other 12 nodes as worker nodes.
3.6 Evaluation
3.6.1 Results from controlled cluster
Baseline strategies: We implemented and evaluated two baseline strategies in our
controlled cluster environment: Our first baseline is an enhanced Hadoop-like uncoded
approach that is similar to LATE [117]. In this baseline we used a 3-repetition strategy
with up to six tasks that are speculatively launched. The strategy provides 3 copies of
34
data at 3 randomly selected nodes in the distributed system. This enhanced Hadoop strat-
egy does not enforce strict data locality during speculation, unlike traditional Hadoop,
and allows data to be moved at runtime if a task needs to relaunched on a node that does
not have a copy of the data. We allow up to six tasks to be speculatively launched. Fur-
thermore, the speculative task assignment strategy always tries to find a node that already
has a copy of the data before moving the data, thereby allowing data communication only
when absolutely needed.
The second baseline is the MDS-coded computation proposed in prior work [64] and
described previously. The two MDS-coding schemes we evaluated in the controlled
cluster are: (12; 6)-MDS as the conservative scheme, and (12; 10)-MDS as the opti-
mistic scheme. No data movement is allowed in these schemes during computation. The
purpose of showing results for (12; 6)-MDS coding is simply to show the robustness of
our scheme in the presence of such high redundancy. We do not expect that system de-
signers will provision 2x computation redundancy in practice. Hence, we will highlight
(12; 10)-MDS results in our discussion in the next section.
Results: We evaluated the performance ofS
2
C
2
against the baseline strategies for vary-
ing straggler counts in our 12-worker-node cluster and these different cases correspond to
the X-axis in the figures 3.6 and 3.7. Each bar in the plots captures the average relative
execution time spent by the application for 15 iterations, normalized by the execution
time of the uncoded strategy when there is 0 straggler in the cluster. The execution time
includes the time worker nodes spend computing on their data partitions, the time spent
in communication between master and worker nodes, time spent by the master node in
35
0
1
2
3
4
5
6
0 straggler
1 straggler
2 stragglers
3 stragglers
4 stragglers
5 stragglers
6 stragglers
Relative Execution Time
Logistic Regression
uncoded with 3-replication and
upto 6 speculative jobs
(12,10)-MDS coded computation
(12,6)-MDS coded computation
S2C2 with (12, 6)-MDS, assuming
same speed
S2C2 with (12, 6)-MDS, knowing
the exact speeds
Figure 3.6: LR execution time comparison
decoding the results from workers. The encoding and distribution of matrix data are not
shown in the figures as it is a one-time cost.
3.6.1.1 Logistic Regression and SVM
We evaluated gradient descent for logistic regression (LR) and SVM. The results for
both of them are very similar and hence we focus the discussion on evaluations of LR.
For our experiments we use publicly available gisette dataset from UC Irvine machine
learning repository [71]. The data in this dataset is duplicated to create a larger dataset.
The final size of data partition in each node is 760 MB. All the worker nodes pre-load
the assigned data partitions into memory before beginning the computation. Only one
processor thread in each worker node is used for computation.
In our controlled cluster environment, we define a straggler as a node that is at least
5x slower than the fastest performing node. And even non-straggler nodes may have
up to 20% variation between their processing speeds. We compare the three baselines
with the two versions ofS
2
C
2
: basicS
2
C
2
that does not consider this 20% variation in
36
speeds of the non-straggler workers and treats all the non-straggler workers as having
equal speed, and generalS
2
C
2
algorithm takes the 20% speed variation into account and
allocates different computational work to non-straggler workers accordingly. The results
are shown in Figure 3.6.
As shown in the figure 3.6, when there are no stragglers, all strategies have low ex-
ecution times with S
2
C
2
having the lowest. The generalized S
2
C
2
algorithm has the
lowest execution time even with zero stragglers because it takes advantage of the 20%
speed variations to assign different amounts of work to different nodes. As the number
of stragglers increases, the execution time of uncoded strategy increases since the slower
job needs to be detected and re-executed. Whereas in coding based strategies there is
no need for re-execution. Once the number of stragglers exceeds 2, the uncoded strat-
egy’s performance starts to degrade and it is 3x of the execution time compared to no
straggler scenario. The super linear degradation is because data partition will need to be
moved across worker nodes prior to the re-execution and communication costs start to
play a role in the overall performance loss. Note that when the number of stragglers gets
closer to the replication count then there is a higher probability that the node where the
re-execution happens does not have a replica. Hence, data movement is in the critical
path of execution.
For the (12; 10)-MDS coded computation, the execution time remains steady with one
and two stragglers but grows super linearly once the straggler count exceeds two, since it
is designed to protect against a maximum of only two stragglers. A more redundant MDS
coding strategy is the primary option to deal with higher number of stragglers. Hence,
ideally the programmers should not be burdened with choosing an aggressive code with
37
0
1
2
3
4
5
6
0 straggler
1 straggler
2 stragglers
3 stragglers
4 stragglers
5 stragglers
6 stragglers
Relative Execution Time
Page Rank
uncoded with 3-replication and
upto 6 speculative jobs
(12,10)-MDS coded computation
(12,6)-MDS coded computation
S2C2 with (12, 6)-MDS, assuming
same speed
S2C2 with (12, 6)-MDS, knowing
the exact speeds
Figure 3.7: PR execution time comparison
less redundancy and then have to pay significant penalty if the selected redundant code
is not enough. S
2
C
2
solves this problem by allowing the programmers to select a more
aggressive redundancy in their codes and yet not pay the penalty when there are fewer
stragglers as shown in the (12; 6)-MDS code result.
Both versions of (12,6)-MDS coded S
2
C
2
not only are able to provide robustness
against up to two stragglers in the cluster, but also are able to reduce the computation
overhead due to the use of coding when there are fewer or no stragglers in the cluster.
By taking the various speeds of the non-straggler worker nodes into account, the general
version of theS
2
C
2
strategy is able to outperform the conservative (12,6)-MDS coded
computation strategy even more than the basic version ofS
2
C
2
. This result indicates that
even if we can’t take into account the precise variation in the processing speeds of var-
ious non-straggler nodes, the basicS
2
C
2
algorithm provides excellent performance and
robustness. However, if the processing speed information is more accurately gathered,
the generalizedS
2
C
2
can squeeze the hidden compute slack in the 20% speed variation
and provide further performance improvements without compromising robustness.
38
1.00
1.36
1.31
1.39
1.23
1.09
1.00
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
over-decomposition
MDS(8,7)
MDS(9,7)
MDS(10,7)
S2C2(8,7)
S2C2(9,7)
S2C2(10,7)
Execution time comparison
Figure 3.8: Execution time comparison on cloud whenS
2
C
2
has low mis-prediction rate
3.6.1.2 Page Rank and Graph Filtering
We evaluated page rank (PR) and graph filtering. The results for both of them are very
similar and hence we focus the discussion on Page Rank. We used the ranking dataset
available from [2]. This dataset is duplicated to create a larger dataset that is used in
evaluation.
The execution time for page rank is plotted in Figure 3.7. Similar to logistic regression
results, S
2
C
2
algorithms significantly outperform the baseline strategies. The general
S
2
C
2
algorithm reduces the execution time compared to basicS
2
C
2
in all scenarios.
3.6.2 Results from industrial cloud deployment
In this section we discuss results from our experiments on Digital Ocean cloud. Note
that in this setup we no longer can control the speed variations or the presence or absence
of a straggler. Instead we simply rely on the inherent speed variations of the 10 worker
nodes we used in the cloud environment to quantify the benefits ofS
2
C
2
. In our experi-
ments we evaluate and compare the performances of generalS
2
C
2
strategy against MDS
39
0
10
20
30
40
50
60
70
80
90
100
worker1
worker2
worker3
worker4
worker5
worker6
worker7
worker8
worker9
worker10
fraction of wasted computation (%)
Wasted Computation
(10,7)-MDS S2C2
Figure 3.9: Per worker wasted computation effort with low mis-prediction rate
coded computation and an over decomposition strategy based on Charm++ [6, 62]. We
evaluatedS
2
C
2
and MDS coded computation under (10,7), (9,7) and (8,7)-MDS codes.
Charm++ based over-decomposition baseline: In the cloud setting, we evaluated
an over-decomposition baseline strategy inspired by charm++ [6, 62]. In our implemen-
tation we combine over decomposition and speed prediction. We over-decompose each
data partition by a factor of 4. The data is divided into 40 partitions with each of the
10 workers receiving 4 partitions. The data is also replicated by a factor of 1.42, sim-
ilar to replication in (10,7)-MDS coding. The additional partitions are distributed in a
round-robin fashion across the 10 workers. Master node uses predictions from the speed
model to do load balancing and transfer of partitions between workers during compu-
tations. This is better than the uncoded baseline strategy used in our controlled cluster
environment since it allows for finer grained data transfer.
During the course of our experiments we observed different mis-prediction rates from
the LSTM speed prediction model. We show and discuss the performance gains from the
experimental conditions where we observe the best and worst case mis-prediction rates.
40
The performance results obtained across various applications are similar (as has been
shown also in the local cluster setting). We focus on the SVM results in this section.
3.6.2.1 Results in low mis-prediction rate environment
The average relative execution times for 15 iterations of SVM are shown in figure 3.8
when we observe a 0% mis-prediction rate for worker speeds. Generally this happens
when there are no significant variations in speeds between the nodes. The execution
times of all strategies are normalized by the execution time of (10; 7)-S
2
C
2
. First we
can observe that over decomposition approach performs better than MDS coded com-
putation. This result is expected since over-decomposition strategy utilizes all the 10
worker nodes to compute the result and each worker processes 1/10 of the data. But
each worker in the (10,7), (9,7) and (8,7) MDS-coded computation scenarios process
1/7th of the data. Next, we observe that all three variations of MDS-coded computa-
tion show similar execution times. In all cases the work performed by a single worker
remains same and only the results from fastest 7 workers are used by the master. Over
decomposition performs similar to (10,7)-S
2
C
2
in this environment since there is no
additional data movement during computations.
For all 3 data coding variationsS
2
C
2
outperforms regular MDS coded computation.
Further, performance ofS
2
C
2
increases as the redundancy is increased. This is because
work done in a single worker decreases as redundancy is increased. (10,7)-S
2
C
2
outper-
forms the (10,7)-MDS coded computation by 39:3%. (10,7)-S
2
C
2
performs best over
(10,7)-MDS coded computation when all 10 workers are always fast during execution;
41
1.19
1.34
1.24
1.17 1.18
1.11
1.00
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
over-decomposition
MDS(8,7)
MDS(9,7)
MDS(10,7)
S2C2(8,7)
S2C2(9,7)
S2C2(10,7)
Execution time comparison
Figure 3.10: Execution time comparison on cloud whenS
2
C
2
has high mis-prediction rate
in this scenarioS
2
C
2
uses all 10 worker nodes while MDS still relies only on 7 worker
nodes. The exact reduction would be
107
7
= 42:8%. S
2
C
2
with 0% mis-prediction rate
captures this best possible reduction in execution time.
Figure 3.9 plots the wasted computation measured in each of the worker node during
execution of the conservative (10,7)-MDS coded computation and (10,7)-S
2
C
2
. Since
the mis-prediction rate is 0% there is no wasted computation effort in S
2
C
2
. In this
execution, workers 1, 3, 7 and 8 have high wasted computation with (10,7)-MDS coded
computation. Worker 1 has close to 90% of its computation wasted. Further analysis
showed that in this experiment worker 1 is only slightly slower than the fastest 7 workers
but the MDS-coded computation ignores the execution of the 3 remaining workers after
it receives results from the fastest 7 workers.
3.6.2.2 Results in high mis-prediction rate environment
During our experiments with shared VM instances on DigitalOcean, we observe the
highest mis-prediction rate is 18%. Generally these mispredictions happens when there
42
0
5
10
15
20
25
30
35
40
worker1
worker2
worker3
worker4
worker5
worker6
worker7
worker8
worker9
worker10
fraction of wasted computation (%)
Wasted Computation
(10,7)-MDS S2C2
Figure 3.11: Per worker wasted computation effort with high mis-prediction rate
are significant and sudden variations in speeds over time. Under this condition, the av-
erage execution times for 15 iterations of SVM are shown in figure 3.10. (10,7)-MDS
coded computation performs better than (9,7) and (8,7)-MDS coded computation be-
cause the probability of any 7 out of 10 nodes being fast is higher than any 7 out of 9
nodes or 7 out of 8 nodes being fast. (8,7)-S
2
C
2
outperforms (8,7)-MDS coded com-
puting by 13%, whereas (9,7)-S
2
C
2
outperforms (9,7)-MDS coded computing by 11%
and (10,7)-S
2
C
2
outperforms the (10,7) MDS-coded computation approach by 17%. As
expected, (10,7)-S
2
C
2
outperforms both (9,7) and (8,7)-S
2
C
2
variants since the oppor-
tunities to do load balancing increase as the redundancy increases. The observed per-
formance of the over-decomposition approach is lower than the performance of (10,7)-
S
2
C
2
owing to the extra data movement costs for load balancing during computations.
Whereas in (10,7)-S
2
C
2
there are no extra data movement costs during computations.
The wasted computation efforts measured in each of the worker node under (10,7)-
coding are shown in figure 3.11. Due to a relatively high mis-prediction rate,S
2
C
2
also
incurs wasted computation among the worker nodes when the compute tasks of slow
nodes are cancelled and reassigned to other worker nodes. However, the conservative
43
(10; 7)-MDS approach incurs higher wasted computation since it also ignores the slowest
3 nodes’ computation efforts. On average, the conservative MDS scheme incurs 47%
more wasted computation effort.
3.6.2.3 Results withS
2
C
2
on polynomial coding
We evaluate S
2
C
2
applied on polynomial coding while performing Hessian matrix
computation of the form A
T
diagonal(x)A, as described in section 3.5.3. The dimensions
of matrix A are 6000 x 6000. The results collected under low and high mis-prediction
rates are shown in figure 3.12. In these experiments, the cluster consists of 12 nodes. The
matrices A; A
T
are partitioned each into 3 sub-matrices, encoded, and the encoded parti-
tions are distributed to the 12 nodes. Each node would compute on 2 encoded partitions.
Results from any 9 nodes would be enough to compute the Hessian. In this setup,S
2
C
2
reduces the overall computation time by 19% in low mis-prediction rate environment.
The maximum possible reduction is
129
9
= 33:3%. The part of Hessian computation
where each node has to first computediagonal(x)
~
A
i
is not influenced byS
2
C
2
. As a
result, the gains from usingS
2
C
2
are lower than expected. Under high mis-prediction
rate environment,S
2
C
2
reduces the overall computation time by 14%.
3.6.2.4 Scalability studies on a larger cluster
We performed experiments on a larger cluster with 50 worker nodes and one master
node. Due to resource constraints we limit this scalability experiment toS
2
C
2
and MDS
44
1.19
1.00
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
Conventional Polynomial
Codes
Polynomial Codes with
S2C2
Low mis-prediction Rate
1.14
1.00
0.90
0.95
1.00
1.05
1.10
1.15
1.20
Conventional Polynomial
Codes
Polynomial Codes with
S2C2
High mis-prediction Rate
Figure 3.12:S
2
C
2
on polynomial codes
1.25
1.00
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
MDS(50,40) S2C2(50,40)
Low mis-prediction Rate
1.12
1.00
0.92
0.94
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
1.12
1.14
MDS(50,40) S2C2(50,40)
High mis-prediction Rate
Figure 3.13:S
2
C
2
performance on a 51 node cluster
coded computing approaches running SVM. We comparedS
2
C
2
and MDS coded com-
putations under (50,40)-MDS codes while performing gradient descent for SVM. The
results collected under low and high mis-prediction rates are shown in figure 3.13. For
S
2
C
2
the maximum reduction in execution time over (50,40)-MDS coded computation
would occur when all 10 workers are always fast during execution. The exact reduc-
tion would be
5040
40
= 25%. S
2
C
2
reduces overall computation time by 25% in a low
speed mis-prediction rate environment. In a high mis-prediction rate environment,S
2
C
2
reduces overall computation time by 12%.
45
The evaluation results presented in this section demonstrate the effectiveness ofS
2
C
2
across different coded computation schemes and across different scales.
3.7 Related work
Prior coded computing works such as [28, 64, 70, 93, 104, 114] provide resiliency
to stragglers and can be utilized to mitigate tail latency in distributed computing. In
particular, several of these works target distributed machine learning. There have been
few recent works in the coded computing literature to exploit the computations of slow
nodes or partial stragglers [112,120]. However, the key ingredient of our proposedS
2
C
2
is that it dynamically adapts the computation load of each node to its estimated speed
from the previous rounds of computations.
Besides coded computation, different techniques have been proposed for straggler
mitigation in distributed systems. Several of these techniques in literature are reactive
i.e., they wait until many tasks finish their execution before detecting and mitigating
stragglers. The authors in work [9] utilize real time progress reports to detect and cancel
the stragglers early. Authors in [117] use LATE algorithm to improve the straggler
detection and speculative execution in Hadoop framework. Adrenaline [49] identifies
and selectively speeds up long queries by quick voltage boosting. Authors of [22] use
software techniques such as selective replication of straggling requests. S
2
C
2
differs
from these techniques because it is a pro-active approach to straggler mitigation. In
works [57, 67, 118], the authors explore system sources of tail latency and implement
mechanisms to eliminate these causes. Authors of [24, 66, 76, 121] focus on improving
46
resource efficiency while providing low latency. These works are complementary to
S
2
C
2
and can be used along withS
2
C
2
. Using replicated tasks to improve the response
times has been explored in [8, 13, 31, 65, 97, 107]. This approach involves launching
multiple copies of each task across workers, using results from the fastest copy and
canceling the slower copies. This approach is pro-active likeS
2
C
2
but it needs multiple
replicas of all the data resulting in large compute and storage overheads. S
2
C
2
, on the
other hand, uses efficient coded replication and has significantly low overheads. Another
strategy used for straggler mitigation is arriving at an approximate result without waiting
on the stragglers [35, 77]. S
2
C
2
does not do approximation and computes the precise
result.
Speed prediction algorithm in S
2
C
2
uses the LSTM model to predict speeds of the
nodes. Prior works associated with performance prediction include [26], [109]. In [26],
Dinda et al. described and evaluated, Running Time Advisor (RTA), a system that can
predict the running time of compute-bound tasks. For predicting running time, linear
time series analysis predictions of host loads are used.
In [109], Wolksi et al. developed Network Weather Service (NWS) to provide fore-
casts for network performance, and available cpu percentage at each compute node.
NWS uses time series models, like ARIMA models, for forecasting. It maintains and
updates multiple models, and dynamically selects the best performing model to provide
forecasts. Speed prediction algorithm inS
2
C
2
also takes a time series approach, similar
to these prediction algorithms, but uses the LSTM model to forecast the next running
time.
47
3.8 Chapter summary
In this chapter, we described S
2
C
2
that efficiently tolerates speed variance and un-
certainty about the number of stragglers in the system. S
2
C
2
distributes coded data to
nodes and, during runtime, adaptively adjusts the computation work per node. Thereby
it significantly reduces the total execution time of several applications. We demon-
strate ˜39.3% reduction in execution time in the best case through our evaluations using
machine learning and graph processing applications. We conclude that speed-adaptive
workload scheduling as done byS
2
C
2
effectively reduces the overhead in coded compu-
tation frameworks and make them more effective in real deployments.
48
Chapter 4
Background on CNN based image classification
Collage inference and Origami inference techniques presented next in this disserta-
tion use the convolutional neural network (CNN) based image classification and object
detection approaches. This chapter provides a brief background on CNNs.
4.0.1 Image classification
Image classification is a fundamental task in computer vision. In image classification,
the goal is to predict the main object present in a given input image. There are a variety
of algorithms, large datasets, and challenges for this task. A widely known challenge is
the Imagenet Large Scale Visual Recognition Challenge (ILSVRC). It’s training dataset
consists of 1.2 million images that are distributed across 1000 object categories. Since
2012 [61], the improvements in accuracy of image classification tasks have come from
using CNNs. Some of the popular CNN architectures are: ResNet [43], Wide ResNet
[116], Inception [103], MobileNet [48], VGGNet [100].
49
Figure 4.1: Architecture of Alexnet [61] Convolutional Neural Network (CNN)
4.0.2 Object detection
Given an input image, the object detection task involves predicting two things: the
classes of all objects present in the image, the locations of objects in the image. The
location information is predicted as a rectangular bounding box within the image. There
are two methods to perform object detection using CNNs: region based detectors pre-
dicting object locations in one stage followed by object classification in the next stage
[21,33,34,94], unified or single shot object detection [30,75,91,92,98]. The single shot
object detection models have lower inference latency while maintaining similar accuracy
as that of the region based detectors.
4.0.3 Convolutional Neural Networks
CNNs are a class of neural network architectures used for processing visual data i.e.,
images [63] [61]. A CNN is generally composed of 3 types of layers: convolution lay-
ers, pooling layers, fully connected or dense layers, and non-linear activation functions:
50
Rectified Linear Unit (ReLU), Sigmoid etc. Architecture of the Alexnet CNN [61] con-
taining the 3 types of layers is shown in figure 4.1. An input to the CNN is processed
using a sequence of many layers to produce corresponding output. Each layer along
with its activation function performs a non-linear transformation of its input. Convolu-
tion layer consists of multiple convolutional filters. During computation, each filter is
slid across the input image to generate a corresponding activation map (or feature map)
as an output. Convolution filter operations can be expressed as matrix multiplications.
Each convolution filter learns to detect a specific pattern (or feature) in its input. After
training it has been observed that, the convolution filters in the initial layers detect sim-
pler features like edge, color etc. whereas convolution filters in the deeper layers detect
complex features like nose, eye etc. Pooling layers are used for down-sampling the size
of intermediate activation maps and are placed between sequences of convolution layers.
Fully connected layers are located towards the end of the CNN and they generate values
that are used to classify the object in the image.
51
Chapter 5
Collage inference
In this chapter, we describe the collage inference technique in detail. In section 5.1,
we discuss the tail latencies observed while doing image classification in the cloud. Sec-
tion 5.2 discusses the limitations of existing techniques to reduce tail latency. In section
5.3, we propose the collage inference technique to reduce tail latency. We describe the
architecture of the collage models and implementation details in section 5.4. Section 5.5
provides experimental evaluations and comparisons of collage-cnn models to various
alternatives.
5.1 Characterizing tail latency
We measured the tail latency for image classification in the cloud. For this purpose,
we designed an image classification server that uses a ResNet-34 model to provide infer-
ence and created 50 instances of this server on Digital Ocean cloud [1]. Each instance
is running on a compute node that consists of 2 CPUs and 4 GB memory. Clients gener-
ates requests to all these servers. We measured inference latency across these 50 nodes
52
Figure 5.1: Inference latency distribution
while performing single image classification. The probability density function of the
latency is shown in figure 5.1. The average single image inference latency was 0.15
seconds whereas the 99-th percentile latency was 0.70 seconds. The 99-th percentile
latency is significantly (4.67x) higher than the mean latency. This long tail in latency
degrades Quality of Service (QoS) and impacts Service Level Agreements (SLAs) ad-
versely [12,22]. This motivates the need for techniques that can mitigate slowdowns and
lower the variation in latency.
5.2 Existing techniques and their limitations
One option to improve QoS in the presence of high latency variance or a straggler
node is to add redundancy in the form of over-provisioning of compute nodes. Consider
53
a system of 10 nodes over-provisioned by 1 node. This node would be running another
replica of s-cnn. But it is difficult to know ahead of time which one of theN = 10 nodes
will be a straggler. As a result, deciding which one of the input requests to replicate
becomes difficult. One strategy is to duplicate the inference request sent to nodei only
when the nodei is detected as a straggler. This is a reactive approach that requires ob-
serving a slowdown before launching a redundant request. For instance, a request may be
launched speculatively after waiting for an expected latency, similar to what is adopted in
Hadoop MapReduce frameworks [3,23]. There are practical challenges in implementing
the reactive approach. First, from our measurements, shown in figure 5.1, the inference
latency could be in 10’s to 100 milliseconds. As a result, speculative relaunch techniques
must be fast enough to adopt. Second, the image must be re-distributed to a new machine
for replicated execution. As a result reactive approach may increase the service latency
depending on how long the reactive approach waits for a response before speculating a
job. To avoid the challenges, the system can be over provisioned by factor of 2. That is
for every one of N nodes there will be a backup node and every input request will be
duplicated. However, this approach increase the compute and communication costs by
2x.
Another technique using coded computing to address straggler mitigation in dis-
tributed inference [59] uses learned encoding and decoding neural networks to provide
redundancy. Briefly the technique is as follows.
In a system consisting ofN = 5 compute nodes 1 node, sayO, is added for redudancy.
Each of theN nodes executes a replica of s-cnn. The model in nodeO takes as input
54
all the 5 input images. These images are passed through a convolutional encoder net-
work and the outputs are then passed onto the s-cnn model. The outputs from theN + 1
models are fed to a decoder network, composed of fully-connected layers. The outputs
from any straggler node is represented as a vector of zeros. The final output from the
decoder network generates the missing prediction. Both the encoder and decoder net-
works are trained through back-propagation. The training data consists of images and
also the expected predictions under different straggler scenarios. This technique when
evaluated on CIFAR-10 dataset shows a recovery accuracy of 80.74% forN = 2 nodes,
and the recovery accuracy is 64.31% forN = 5 nodes, when any one of theN nodes is a
straggler. One reason for the significant accuracy loss is that the encoding network does
not preserve the spatial information of the individual input images.
5.3 Collage inference algorithm
Technique: A critical insight behind collage inference is that the spatial information
within an input image is critical for CNNs to achieve high accuracy, and this information
should be maintained. Hence, we use a collage image composed of all the images as the
encoding. The encoding used by collage-cnn is a simple spatial arrangement of images
[Image
1
;::;Image
i
;::;Image
N
] in a grid format so as to preserve the individual image
information, albeit at a reduced resolution. The collage-cnn model is a novel multi-
object classification model. The collage-cnn provides the predictions for all the objects
in the collage along with the locations of each object in the collage. The predicted
locations are in the form of rectangular bounding boxes. By encoding the individual
55
7
Compute Node 1
Node 10
Node 9
2
8
3
4
5
6
7
predicted
class
predicted
predicted
class
classes
decode
4 7
8 5 2
3 6 7
Request 1
Request 9
Request
Request
Request
Request
Request
Request
Request
collage
creation
1
9
Load Balancer/Encoder
8
collage
image 1
image 9
S-CNN
Collage-CNN
S-CNN
Load Balancer/Decoder
prediction
decoder
image
Response 1
Response 2
Response 3
Response 4
Response 5
Response 6
Response 7
Response 8
Response 9
Figure 5.2: Collage inference algorithm
images into a collage grid of images and using location information from the collage-
cnn predictions, the collage inference technique can replace the missing predictions from
any slow running or failed nodes.
Since the goal of our work is to mitigate stragglers using a single collage-cnn model,
it is imperative that the collage-cnn which acts as a redundant classification model to be
as fast as the single image classification task latency.
Encoding: The encoding of individual images into a single collage image happens
as follows. Let a collage-cnn be providing backup forN s-cnn model replicas that are
each running on a compute node. To encode theN images into a collage, we first create
a square grid consisting of [
p
N;
p
N] boxes. Each image that is assigned to an s-cnn
model running on compute nodei is placed in a predefined square box within the collage.
56
Class A Class B
Class C
Class D
G1 G2
G3 G4
(a) Ground Truth
Class A Class B
Class C Class D
P1
P2
P3 P4
(b) Scenario 1
Class A
P1
Class C
P3
Class D
P4
(c) Scenario 2
Class B
(80%)
Class E
(70%)
P2
P3
Class C
P4
Class D
P5
Class A
P1
(d) Scenario 3
Figure 5.3: Collage decoding scenarios
Specifically, in the collage, each compute nodei is assigned the box locationi. This en-
coding information is used while decoding outputs of the collage-cnn. From the outputs
of the collage-cnn, class prediction corresponding to each bounding box i is extracted
using the collage decoding algorithm, and this prediction corresponds to a backup pre-
diction for compute nodei. As the size ofN grows, more images must be packed into
the collage, which reduces the resolution of each image and can lower the accuracy of
collage-cnn predictions. Please note that this square grid encoding is one of the many
possible encodings. More generally, inputs to collage-cnn can be encoded as rectangular
grids of varying height and width.
Example: Figure 5.2 shows a collage inference system consisting ofN = 10 nodes.
In this illustration the load balancer receives 9 concurrent requests for image classifica-
tion. The load balancer manages 10 compute nodes on its back-end. Nine out of the ten
nodes run replicas of a s-cnn model for single image classification. The load balancer
forwards one inference request to each of these 9 s-cnn models. Concurrently the load
balancer also acts as an image encoder by creating a collage from these 9 images. For
57
collage encoding each of the nine input images is lowered in resolution and inserted into
a specific location to form the collage image. The input image to nodei goes into loca-
tioni in the collage image. This collage image is provided as input to node 10, which
runs the collage-cnn model. The predictions from the collage-cnn are processed using
the collage decoding algorithm. The output predictions from the ten nodes go to the load
balancer, which processes them and provides the final 9 predictions.
Decoding: The collage decoding algorithm extracts the best possible class predictions
for the N images from all the collage-cnn predictions. First, all the predictions with
confidence values less than detection threshold are ignored by the algorithm. In our
experiments, we use a detection threshold of 0.15. We observed increase in collage-
cnn accuracy as the threshold value is reduced till 0.15, and lowering it further did not
improve accuracy. The decoding algorithm calculates the Jaccard similarity coefficient,
also referred to as Intersection over Union, of each predicted bounding box with each
of the N ground truth bounding boxes that are used in creating the collages. Let area
of ground truth bounding box beA
gt
, area of predicted bounding box beA
pred
and area
of intersection between both the boxes be A
i
. Then jaccard similarity coefficient can
be computed using the formula:
A
i
A
gt
+A
pred
A
i
. The ground truth bounding box with the
largest similarity coefficient is assigned the class label of the predicted bounding box.
As a result, the image present in this ground truth bounding box is predicted as having
an object belonging to this class label. This is repeated for all the object predictions.
To illustrate the algorithm, consider example scenarios shown in figure 5.3. The
ground truth input collage is a 2x2 collage that is formed from four images. It has four
ground truth bounding boxes G1, G2, G3, and G4 which contain objects belonging to
58
classes A, B, C, and D respectively. Note that this ground truth bounding boxes are
created by the load balancer while encoding a collection of images. In scenario 1, the
collage model predicts four bounding boxes P1, P2, P3 and P4 with predicted image
labels as A, B, C and D, respectively. In this scenario: P1 would have largest similarity
value with G1, P2 with G2, P3 with G3 and P4 with G4. So, the decoding algorithm
predicts class A in G1, class B in G2, class C in G3, class D in G4. In scenario 2, three
bounding boxes are predicted by the model. Predicted box P1 is spread over G1, G2, G3
and G4. The similarity value of P1 with box G1 is:
1
3
, G2 is:
1
7
, G3 is:
1
7
and G4 is:
1
17
.
So, the algorithm predicts class A in G1, empty prediction in G2, class C in G3, class
D in G4. In scenario 3, collage model predicts 5 different bounding boxes. Assigning
classes A, C, D to boxes G1, G3, G4 respectively is straightforward. But both box P2
and box P3 have highest similarity values with ground truth box G2. Since box P2 has
higher confidence (80%) than box P3 (70%), collage decoding algorithm predicts G2 as
containing class B.
Providing final predictions: The outputs from collage decoding algorithm along
with predictions from all the s-cnn models are provided to the load balancer. The load
balancer provides the final predictions as shown in figure 5.2. If the predictions from all
the s-cnn models are available, the load balancer just provides these predictions as the
final predictions and discards the collage-cnn outputs, since there were no slowdowns.
In the case where predictions from any of the s-cnn models is not available i.e., there is a
slowdown, then the prediction from the collage-cnn corresponding to that s-cnn model is
used. It can be observed that the outputs from collage-cnn model can be used to tolerate
more than one request slowdown. The predictions from the collage-cnn model can be
59
used in the place of any missing predictions from the s-cnn models. In the rare scenarios
where there is a slow s-cnn node and the corresponding prediction from collage-cnn is
empty, or the collage-cnn model is also slow, the request to slow s-cnn model is repli-
cated. Prediction from this replicated request is used by the load balancer process.
Resource overheads: A 2 x 2 collage-cnn works on a 2 x 2 collage composed from
4 single images. It is used in a system where four individual images are sent to four
compute nodes, each running a s-cnn model, while the 2 x 2 collage is sent to the node
running the 2 x 2 collage-cnn model. In this system, the overhead of running the collage-
cnn model is 25% compute resources. This overhead can be reduced by using a 3 x 3
collage-cnn where one node provides redundancy for 9 nodes, each running a s-cnn
model. This system has approximately 11% overhead. As more images are combined
into a collage, the overhead of using the collage-cnn reduces. If the size of the collage
image is fixed, as more single images are packed into a collage the resolution of each
image gets reduced. This can reduce the accuracy of the collage-cnn models. We explore
this tradeoff between resource overheads, accuracy and latency of different collage-cnns
in the evaluation section.
Next we discuss the architecture of the collage-cnn and s-cnn models, generation of
training images and training of collage-cnn.
60
416
416
3
208
208
16
104
104
32
52
52
64
26
26
128
13
13
256
13
13
512
13
13
1024
7
7
256
5
5
512
3
3
210
Conv. Layer
3x3x16, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x32, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x64, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x128, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x256, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x512, Stride 1
Maxpool Layer
2x2, Stride 1
Conv. Layer
3x3x1024, Stride
1
Conv. Layer
5x5x256, Stride 2
Conv. Layer
3x3x512, Stride 1
Conv. Layer
3x3x210, Stride 1
1
2
3
4
5
6
7
8
9
Figure 5.4: Collage-cnn architecture
5.4 Collage-cnn architecture and implementation
5.4.1 S-cnn architecture
We used a pre-trained ResNet-34 model as the the single image s-cnn model for
ImageNet-1k dataset. Input to the model is an image of resolution 224 x 224 and the
output from the model is one of the 1000 possible class labels. This model has 33 convo-
lutional layers with a fully connected layer at the end to provide class predictions. This
model is taken from PyTorch [88] model zoo. A Pre-trained Resnet-32 model is used as
s-cnn model for CIFAR-10 dataset. Input to the model is an image of resolution 32 x 32
and the output from the model is one of the 10 possible class labels. This model has 31
convolutional layers with a fully connected layer at the end to provide class predictions.
This model is taken from Tensorflow [4] model zoo. Both the s-cnn models are out of
the box pre-trained models.
61
5.4.2 Collage-cnn architecture
During inference, collage-cnn acts as the backup classification model for a collec-
tion of individual inference models. This requirement places a design constraint on the
collage-cnn. The latency of the collage-cnn should be lower than or equal to the latency
of s-cnn. Since we use ResNet-34 as the s-cnn, the collage-cnn should have a latency
lower than ResNet-34. Our collage-cnn architecture is inspired by the yolov3-tiny model
architecture [92], which is a fast single shot object detection model. We used yolov3 ar-
chitecture as the reference because it is also a computationally efficient architecture i.e.,
it uses relatively fewer parameters to achieve high object detection accuracy compared to
alternate models. Collage-cnn architecture consists of a total of 10 convolutional layers.
The final 3 of the 10 convolutional layers are adapted depending on the grid size of the
input collage image. The input resolution to our collage-cnn model is 416 x 416 pixels,
which is the default input resolution of the yolov3-tiny model. The final output of the
network is a K x K x 210 tensor of predictions. The value of K is the grid dimension of
the input collage. If the shape of the input collages is 3 x 3, then the output of the net-
work is 3 x 3 x 210. Each output grid cell predicts two bounding boxes and confidence
scores for each of these boxes. Each bounding box prediction is of the form [x, y, width
of the box, height of the box, object confidence score, conditional class probabilities].
The (x,y) coordinates correspond to the center of the bounding box. For 100 classes (the
total number of image classes we used in this study), the total number of predictions per
grid cell is 210. The full network architecture for the 3x3 collage-cnn model is shown in
figure 5.4. Unlike yolov3-tiny model, there are no residual connections and upsampling
62
layers in collage-cnn. Yolov3-tiny uses these layers to detect extremely small objects
that may be present in an input frame. But in collage-cnn the objects to be classified in
collage images are large enough and there is no need for fine-grained object detection.
The collage-cnn is trained using a loss function based on the yolo [91] loss function.
The collage-cnn loss function penalizes errors in object confidence, classification and
bounding box predictions.
The collage-cnn outperforms the yolov3-tiny model while classifying collage images.
Customization of the network architecture enables the collage-cnn to have higher top-1
accuracy on collage images. Top-1 accuracy is the accuracy that the prediction of the
model, the one with the highest probability, is correct. The 3x3 collage-cnn, described
in figure 5.4 is 1.4% more accurate than the yolov3-tiny model while classifying 3x3
collage images. The 4x4 collage-cnn and 5x5 collage-cnn models are also 1.4% more
accurate while classifying 4x4 and 5x5 collage images, respectively. Since the collage-
cnn has a lower number of layers than the yolov3-tiny model, the inference latency of
the collage-cnn is 20% lower.
5.4.3 Training of collage-cnn models
The datasets we used in our experiments are CIFAR-10 and ImageNet-1k (ILSVRC-
2012). CIFAR-10 dataset consists of 60000 images divided into 10 object classes. 50000
images are provided for training and 10000 images for validation. ImageNet-1k dataset
consists of 1.2 million images divided across 1000 object classes for training and 50000
images for model validation. To train the collage-cnn, collages generated from images
63
of the training datasets are used. To validate the collage-cnn, collages generated from
images in validation datasets are used. In our experiments with ImageNet-1k dataset,
we use all the training and validation images belonging to 100 of the 1000 classes for
evaluations. The selected 100 classes correspond to 100 different objects. For example,
the 100 classes contain only one class of dog and not many classes of dogs as present
in the 1000 classes of ImageNet. We use 100 classes since it is close to the number
of classes used in object detection datasets like Microsoft Coco etc. for training and
evaluating object detection models like Yolo. Another reason to use 100 classes is that
the collage-cnn model can be trained in a reasonable time, and to explore the design
space using limited compute resources.
For the CIFAR-10 based collage dataset we uniformly and at random pickN images
from the 50000 training images to create each collage in the training dataset. For the
ImageNet-1k based collage dataset we first pick all the training images from the 100
classes. Then, we uniformly and at random pickN classes from the 100 classes. One
image from each of theseN classes is picked and all theN images are combined into a
single image. This image is resized to generate the collage image. TheN classes need
not all be different and some collages have multiple images belonging to same class.
The total possible number of collage images that can be generated is much larger
than the number of training images in the raw datasets. This is because there are many
permutations to choose from while combining different images into collages. This leads
to two advantages. First, it increases the size of the collage-cnn training dataset. Since
the task being learned by the collage-cnn is more challenging than the single image s-
cnn models, a larger training dataset can help increase the model’s accuracy. Second,
64
by generating more number of collage images for training, we try to prevent the model
from learning any spurious and fictitious inter-object correlations. In collage-cnn based
classification, objects belonging to any class can be present in any location in the image,
unlike in object detection.
In our experiments, the input resolution to collage-cnn model is set to 416 x 416
pixels. So, while forming collages each single image resolution is set to
416
p
N
;
416
p
N
pixels.
For the CIFAR-10 dataset, since each image is of the resolution of 32 x 32 pixels, the
single image resolution is not reduced even for largeN values. For ImageNet-1k dataset,
the resolution of single images is lowered even in the 2 x 2 collages, since each image
is of the resolution of 224 x 224 pixels. We use the python imaging library to lower the
resolution of each image before forming the collage.
For each collage image, the target output of the collage-cnn model consists ofN 5
values. For each of theN images in the collage there are 5 target outputs: class label,
x-coordinate of center of the bounding box, y-coordinate of center of bounding box,
width of bounding box, height of bounding box. Given a raw dataset of training images,
a python script generates the collage images by appropriately combining single images
and scaling down the collage image size. The script also generates the 5 target values for
each image in the collage.
65
5.5 Evaluation
In this section we first present the accuracy of collage-cnn compared to the ResNet
models. We then discuss the end-to-end system performance using collage-cnn. We end
by providing comparison of collage-cnn model to alternative redundancy models.
5.5.1 Training Parameters
The models are trained for 130K iterations using Stochastic Gradient Descent (SGD)
with the following hyper parameters: learning rate of 0.001, momentum of 0.9, decay
of 0.0005, and batch size of 64. While training collage-cnn on ImageNet collages of
shapes 4x4 and 5x5 a learning rate of 0.005 is used since using 0.001 caused divergence
in SGD. Each model training is performed on a single compute node consisting of a
GeForce Titan 1080 GPU equipped with 11 GB of GDDR, 32GB of DRAM and an
AMD Ryzen 3 1200 quad-core Processor. The training run time is ˜26 hours for 130K
iterations.
5.5.2 Accuracy of collage-cnn
Effects of increasing the training data: Size of the collage-cnn training data can
be increased using the different permutations possible when generating collages, as de-
scribed in previous section. We performed experiments to measure the effects of using
more collage images during training. We observe consistent improvements in validation
accuracy.
66
1: While training a collage-cnn model using 4 x 4 ImageNet collages, as the training
set size is doubled from 52K (52000) to 104K images, validation accuracy increased by
6.95%.
2: While training a collage-cnn model using 3 x 3 ImageNet based collages, as the train-
ing set size is doubled from 26K to 52K images, the validation accuracy increased by
1%.
3: While training a collage-cnn model using 3 x 3 CIFAR-10 based collages, as the train-
ing set size is increased from 10K to 50K images, the validation accuracy increased by
1.38%.
While training the collage-cnn models, the number of images across all collages is larger
than the number of single training images present in the corresponding dataset. For in-
stance, while training a collage-cnn with CIFAR-10 dataset we created 50,000 collages.
For ImageNet the total number of single training images in the 100 classes is 120K. For
training the collage-cnn models 208K collages are used.
CIFAR-10 Dataset: We measured the top-1 accuracy of collage-cnn and s-cnn mod-
els using validation images from CIFAR-10. The accuracy results are plotted in figure
5.5. The baseline s-cnn model has a accuracy of 92.2% whereas the 2x2 collage-cnn
models has a accuracy of 88.91%. Further, it can be seen that the accuracy of collage-
cnn models decreases gradually as the number of images per collage increases. As stated
earlier, when using the CIFAR-10 dataset the collage image resolution was not lowered,
since even a 5 x 5 CIFAR images can be fitted in a collage. Hence the gradual loss of ac-
curacy is due to the number of objects that must be detected by the collage-cnn increases
and the learning task of collage-cnn model becomes more challenging.
67
81.33
78.3
73.78
69.46
80.72
88.91
88.41
86.55
83.51
92.2
60
65
70
75
80
85
90
95
1x1 2x2 3x3 4x4 5x5
Top-1 Accuracy (%)
Collage input architecture
ImageNet Collage-CNN ImageNet S-CNN
CIFAR Collage-CNN CIFAR S-CNN
Figure 5.5: Accuracy on 100 classes of ImageNet-1k
Imagenet Dataset: Next we measured the top-1 accuracy of collage-cnn and s-cnn
models using validation images from ImageNet. Validation collages are generated using
the validation images from the 100 ImageNet classes. The resolution of each validation
image is lowered to fit into the collage. This is because each validation image has a
resolution of 224 x 224 and the collage image resolution is 416 x 416. The top-1 accuracy
results are plotted in figure 5.5. The accuracy of collage-cnn model for 2 x 2 collages is
similar to that of the baseline s-cnn model. This is likely because resolution of a single
image is only slightly reduced while generating 2 x 2 collages. As the number of images
per collage increases further the accuracy of collage-cnn model decreases gradually. It
can be observed that the rate of decrease in accuracy of collage-cnn model on ImageNet
is higher compared to CIFAR-10. As the number of single images per ImageNet collage
68
is increased, resolution of each image gets reduced significantly unlike with CIFAR-
10. Hence, the reduced image resolution compounded the complexity of detecting more
objects.
5.5.3 Stand alone inference using collage-cnn
On a cloud server with 2 CPUs and 4 GB of memory, we measure latency of the
collage-cnn model. It has ˜10% lower inference latency than the s-cnn model. The
inference latency is ˜0.14 seconds. The reason for collage-cnn inference being faster
than the s-cnn inference is that the collage-cnn uses a wider and shallower network;
wider network enables more parallelism and shallow layers reduce serial dependencies.
The latencies for encoding images into collages for 3 x 3, 4 x 4 and 5 x 5 collage-cnn
models are 0.01, 0.013, and 0.017 seconds respectively. Corresponding collage decoding
times are 0.01, 0.028, and 0.047 seconds respectively. Both encoding and decoding times
increase as the number of images per collage increases. However, they are significantly
smaller than the inference latency.
5.5.4 End-to-end system performance with collage-cnn
We implemented an online image classification system and deployed it on the Digital
Ocean cloud [1]. The system consists of a load balancer front node, multiple server
nodes running s-cnn and collage-cnn models. The load balancer front node performs
multiple tasks. It collects requests from clients and generates single image classification
requests to the s-cnn models. It also creates a collage from these single images and
69
(a) Baseline without replication
(b) Baseline with replication
(c) 3x3 collage system
Figure 5.6: Comparison using 9 s-cnn models
70
(a) Baseline without replication
(b) Baseline with replication
(c) 4x4 collage system
Figure 5.7: Comparison using 16 s-cnn models
71
(a) Baseline without replication
(b) Baseline with replication
(c) 5x5 collage system
Figure 5.8: Comparison using 25 s-cnn models
72
sends collage classification request to the collage-cnn. It can replicate any single image
requests if necessary. We use one Virtual Machine (VM) to host the front node andN
additional VMs to serve requests using the s-cnn and collage-cnn models. We performed
experiments withN = 9; 16; 25 nodes running s-cnn models and 1 node running collage-
cnn. Inference requests are generated using the validation images from the ImageNet
dataset.
We compare collage inference (third method) with two baseline methods. Implemen-
tation of each baseline is as follows.
1: First method is where the front node sends requests to the s-cnn servers and waits till
all of them respond. The front node does not replicate any slow and pending requests.
This is the no replication method.
2: In the second method, the front node sends requests to the s-cnn servers with a fixed
timeout on all requests. If a server is experiencing a slowdown and does not provide
prediction before the timeout, the request is replicated. This is the replication method.
Figures 5.6, 5.7, 5.8 shows the end to end latency distribution of the three methods
under different levels of redundancy. For requests to the collage-cnn model, the end-
to-end latency includes time spent in forming the collage image. The curve lines in the
plots show the estimated probability density function of the end to end latency calculated
using Kernel Density Estimation.
9 s-cnn + 1 collage-cnn: The collage inference system has 11% replication overhead.
The mean latency of the collage inference is similar to the replication method and lower
73
than the no replication method. The standard deviation of latency of collage inference is
3.9x and 2.7x lower than the no replication and replication methods respectively. The 99-
th percentile latency is 2x and 1.6x lower than the no replication and replication methods
respectively.
16 s-cnn+ 1 collage-cnn: The replication overhead of this collage system is 6%. The
mean latency of the collage system is lower than both replication and no replication
methods. The standard deviation of the inference latency is 2.5x and 2.1x lower than the
no replication and replication methods respectively. The 99-th percentile latency of the
inference system is 1.6x and 1.5x lower than the no replication and replication methods
respectively.
25 s-cnn+ 1 collage-cnn: This collage inference system has 4% overhead of repli-
cation. The mean latency of the inference system is significantly lower than both the
baselines. The standard deviation of latency of the system is 1.36x and 1.68x lower than
the no replication and replication methods respectively. The 99-th percentile latency of
the collage system is 1.2x and 1.4x lower than the no replication and replication methods
respectively.
Recovery accuracy: During the deployments of the 3x3 collage-cnn model, for each
request to the collage-cnn, 9 single image classification requests are sent to the s-cnn
models. About 18% of the requests encountered straggler nodes. Of these 18% re-
quests, collage-cnn predictions were available for 87%. For the rest 13% of the requests,
collage-cnn node was also one of the stragglers and it’s predictions were unavailable.
The accuracy of the available collage-cnn predictions was 78%. When 4x4 collage-cnn
74
Table 5.1: Usage of the 3x3 collage-cnn model
% of requests that encountered stragglers 18%
Collage-cnn predictions are available 87%
Accuracy of the collage-cnn predictions 78%
Table 5.2: Usage of the 4x4 collage-cnn model
Total requests that encountered stragglers 21%
Collage-cnn predictions are available 80%
Accuracy of used collage-cnn predictions 74%
models were deployed, for each request to the collage-cnn, 16 single image classifica-
tion requests are sent to the s-cnn models. 21% of these requests encountered stragglers.
Collage-cnn predictions were available for 80% of these requests and the prediction ac-
curacy was 74%. These statistics are shown in tables 5.1 and 5.2.
5.5.5 Comparison to alternate backup models
Apart from the above evaluated two baselines, we further compared the 3 x 3 collage-
cnn model with three alternative redundancy models. All comparisons are performed
using the ImageNet dataset.
Multi-image batch ResNet: In this study, we use the ResNet-34 model as the re-
dundancy model but with batching capability of 9 images per single batch. Inference
latency of this batch model is 6.1x larger than using a 3x3 collage-cnn model. Accu-
racy while using the batch ResNet-34 model is 80.7% whereas the accuracy while using
75
the 3x3 collage-cnn is 78.3%. Collage-cnn performs significantly better in terms of the
inference latency and slightly lower in terms of accuracy.
Lower resolution CNN: In this study, the CNN used has an architecture that is similar
to a 3x3 collage-cnn i.e., same number of convolution and pooling layers with the same
filter configuration, but takes a single image with resolution of 139 x 139 as input. 139 x
139 is same as the final resolution of each image in a 3x3 collage. Batch inference using
this CNN, with nine 139 x 139 single images, takes 1.8x longer than the 3x3 collage-cnn.
The overall accuracy of the predictions of this CNN is identical to that of collage-cnn.
Collage-cnn performs better in terms of the inference latency.
Multi-image batch MobileNet-v2: In this study we use the lower cost MobileNet-
v2 as the redundancy model. A 3x3 collage-cnn has 8 million parameters whereas a
MobileNet-v2 has 3.5 million parameters. At the full image resolution of 224 x 224,
MobileNet-v2’s top-1 accuracy of 81% is slightly higher than the collage-cnn’s accuracy
of 78.3%. However, it’s batch inference latency is 6x larger than using a collage-cnn
model. When the image resolution is lowered to 139 x 139, which is the resolution of
each image in a 3x3 collage, the accuracy of MobileNet-v2 declines to 71.4%. A 3x3
collage-cnn has a much higher top-1 accuracy of 78.3%. The inference latency with 9
lower resolution images per batch, using MobileNet-v2 is 2.6x larger than the latency
of doing inference using a collage-cnn. At both resolutions, the latency of a collage-
cnn model is significantly lower than the batch MobileNet-v2 model. Despite collage-
cnn having more parameters than MobileNet-v2, collage-cnn has lower latency. This is
because collage-cnn is designed to be a wide and shallow network. MobileNet-v2, on the
other hand, partitions each convolution operation into two stages to reduce the memory
76
and storage consumed by the network. However, this increases the depth and the latency
of the MobileNet model [101].
Parity models: A concurrent work [60] proposes a general parity models framework,
ParM, for coding-based resilient inference in tasks such as image classification, speech
recognition and object localization. Similar to collage-cnn models, ParM proposes us-
ing parity models as backup models to reconstruct unavailable predictions from slow
nodes. The framework allows for the parity model to be different for different infer-
ence tasks. During evaluations for image classification, ParM uses parity models having
the same architecture as the models they are backing up. In contrast, collage-cnn mod-
els use a custom architecture optimized for multi-object classification. As a result, the
collage-cnn models have lower inference latency than the parity models used in ParM.
ResNet models are used as the single image processing s-cnn models in both collage
inference and ParM. As previously described in section 5.5.3, the inference latency of a
collage-cnn is 10% lower than the s-cnn models it is backing up. Designing and using
a custom architecture also leads to significant increase in the classification accuracy. A
2x2 collage-cnn working on 4 CIFAR-10 images has a classification accuracy of 88.91%
whereas the corresponding parity model in ParM (k = 4) has an accuracy of 74%. Due to
this the collage-cnn models can provide a much better accuracy using the same compute
resources as ParM, or provide a similar accuracy as ParM using much lower compute
resources.
The evaluations and comparisons presented this section demonstrate the effectiveness
of using collage models to reduce latency variation over alternate redundancy methods.
77
5.6 Related work
Before concluding this chapter, we discuss additional related work from literature.
Clipper [19] is a low-latency online ML prediction serving framework. Clipper runs
an ensemble of models to increase prediction serving accuracy. The framework can ig-
nore predictions from the straggler nodes in the ensemble with a marginal loss compared
to ensemble accuracy. This approximation is used during the ensemble inference, i.e.,
when different models are predicting on the same input. In the Collage inference, the
collage models provide approximation when replicas of an s-cnn model predict on differ-
ent inputs. Shinjuku [55] is an operating system that implements preemptive scheduling
for reducing microsecond-scale tail latency. On the other hand, collage inference targets
reducing tail latency in the range of millisecond-scale and higher.
Many prior works target specific causes of tail latency and address them. Authors of
works [42, 102] propose techniques for latency prediction and adaptively selecting ap-
propriate replicas. Prior work [110] proposes prediction methods to prevent requests of
different types from being scheduled on a single VM. Techniques to mitigate tail latency
caused by network interference are discussed in [39]. Techniques to predict performance
and select optimal resource configuration are proposed in [106]. PerfIso [52] is a perfor-
mance isolation framework that co-locates batch jobs with production latency-sensitive
services to increase cpu utilization without impacting tail latency. Unlike these prior
works, Collage inference is general and independent of the causes of tail latency.
78
5.7 Chapter summary
Cloud-based prediction serving systems are being increasingly used to serve image
classification based requests. Serving requests at low latency, high accuracy, and low
resource cost becomes very important. In this chapter, we described collage inference,
where a coded redundancy model is used to reduce the tail latency during inference while
maintaining high accuracy. Collage inference uses novel collage-cnn models to provide
recovery from slowdown during runtime. Collage-cnn models provide a good tradeoff
between accuracy, resource cost, and tail latency. Deploying the models in the cloud, we
demonstrate that the 99-th percentile latency can be reduced by up to 2x than replication-
based approaches while maintaining high prediction accuracy. We can conclude that
collage inference is a promising approach to mitigate stragglers in distributed inference
systems.
79
Chapter 6
Origami inference
This chapter describes the Origami inference technique, which is for preserving pri-
vacy leveraging secure hardware enclaves. First, an introduction to TEEs and Intel SGX
enclaves is provided in section 6.1. Section 6.2 describes the overheads of secure en-
claves. Section 6.3 describes two key ideas that lead to the Origami inference technique.
Section 6.4 describes the conditional GAN threat models to verify input privacy. Section
6.5 describes the implementation details. Section 6.6 presents the experimental evalua-
tion of the technique against several baselines.
80
Privatedata
416
416
3
208
208
16
104
104
32
52
52
64
26
26
128
13
13
256
13
13
512
13
13
1024
7
7
256
5
5
512
3
3
210
Conv. Layer
3x3x16, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x32, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x64, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x128, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x256, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x512, Stride 1
Maxpool Layer
2x2, Stride 1
Conv. Layer
3x3x1024, Stride
1
Conv. Layer
5x5x256, Stride 2
Conv. Layer
3x3x512, Stride 1
Conv. Layer
3x3x210, Stride 1
1
2
3
4
5
6
7
8
9
User of the service
Non-secure CPU/GPU in the cloud
Return to user
DNN
Cloud service
Figure 6.1: Unsecure system
416
416
3
208
208
16
104
104
32
52
52
64
26
26
128
13
13
256
13
13
512
13
13
1024
7
7
256
5
5
512
3
3
210
Conv. Layer
3x3x16, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x32, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x64, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x128, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x256, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x512, Stride 1
Maxpool Layer
2x2, Stride 1
Conv. Layer
3x3x1024, Stride
1
Conv. Layer
5x5x256, Stride 2
Conv. Layer
3x3x512, Stride 1
Conv. Layer
3x3x210, Stride 1
1
2
3
4
5
6
7
8
9
User of the service
Secure CPU enclave in the cloud
Return to user
DNN
Cloud service
D
e
c
r
y
p
t
E
n
c
r
y
p
t Private
image
Figure 6.2: Secure system
81
6.1 Background
6.1.1 TEEs and Intel SGX
Trusted Execution Environments (TEEs) such as Intel SGX (Software Guard Exten-
sions), ARM TrustZone [17, 51] enable execution of programs in secure hardware en-
claves. They protect the confidentiality and integrity of the code and data that are exe-
cuted inside the enclaves. In Intel’s SGX a region of memory (128 MB by default) is re-
served only for enclaves and memory accesses to this region are restricted by CPU. Only
the instructions executing inside enclave can access this memory region. Non-enclave
code can initialize the enclave via SGX-related CPU instructions like ECREATE, EEN-
TER, EADD etc. Operating systems, root users and all other applications running on the
same machine are prevented from accessing this memory region. Intel SGX provides
support for remote attestation of an initialized enclave. This can be used by a remote
party to verify the code and data inside the enclave immediately after its initialization.
While there have been some demonstrated side channel attacks on SGX [11, 14], in this
work we assume that SGX execution can be secured using some of the recently published
schemes [111, 113].
6.2 Overheads of secure execution
To motivate the need for private inference, consider a cloud service which is used by a
health care provider to classify medical images of patients. The health care provider (user
of the service) sends private data (an image related to a patient) that may be processed
82
using the system depicted in figure 6.1. Even if the user encrypts and sends the data, it
needs to be decrypted before processing. The user image becomes visible to the service
and it is the responsibility of the service to ensure confidentiality of the user data.
TEEs like Intel SGX can be used to provide confidentiality since they support remote
attestation and are designed to protect the confidentiality of the data running inside the
secure enclaves. Consider the service using a TEE based system shown in figure 6.2. In
this system after the cloud service initializes the secure enclave, a user can get remote
attestation that the initialized enclave loaded the proper code to process the user data.
We assume that the model provided by the cloud service is verifiable by the user before
using the service, or the user of the service may load their own trusted models for cloud
execution. Then the user encrypts the image, before sending it to the cloud service. The
service processes the encrypted image completely inside a secure enclave. However, the
inference latency of this service can be an one or even two orders of magnitude longer
than the first approach.
In figure 6.3 we compare the inference runtimes of VGG-16 and VGG-19 models
executing on an unsecure CPU with two secure enclave configurations (experimental
setup details are shown later). In the first secure enclave configuration the model data is
loaded Just-In-Time (JIT), only when needed, and in the second enclave configuration
all the model data is pre-loaded. When compared to executing on a CPU with no privacy,
the VGG-16 model runs 18.3x slowly on a secure enclave with pre-loading of data and
6.4x slowly on a secure enclave with JIT loading of data. Along the same lines, the
VGG-19 model runs 16.7x slowly on a secure enclave with pre-loading of data and 6.5x
slowly on a secure enclave with JIT loading of data. The slowdowns are more drastic,
83
0.0069
0.121
0.77
2.21
0.0087
0.141
0.92
2.36
0
0.5
1
1.5
2
2.5
Full GPU Full CPU Secure enclave with
JIT loading
Secure enclave with
pre-loading
Run time (seconds)
Comparison of runtimes
VGG-16 VGG-19
Figure 6.3: Comparison of runtimes
when compared to executing the model entirely on an untrusted GPU – up to 321 times
slower. Eliminating or reducing this runtime slowdown on secure enclaves motivates
Origami inference to consider model partitioning.
6.3 Origami inference and two key ideas
6.3.1 Key idea 1: Model partitioning and offloading
Consider a service using a TEE based system with model partitioning as shown in
figure 6.4a. For example, with VGG-16, we propose that the 16 layers be split into two
tiers L, 16 - L layers. Prior to deploying the model in the cloud, the first tier with L
layers is embedded within the SGX enclave container. The second tier of 16L layers
84
416
416
3
208
208
16
104
104
32
52
52
64
26
26
128
13
13
256
13
13
512
13
13
1024
7
7
256
5
5
512
3
3
210
Conv. Layer
3x3x16, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x32, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x64, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x128, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x256, Stride 1
Maxpool Layer
2x2, Stride 2
Conv. Layer
3x3x512, Stride 1
Maxpool Layer
2x2, Stride 1
Conv. Layer
3x3x1024, Stride
1
Conv. Layer
5x5x256, Stride 2
Conv. Layer
3x3x512, Stride 1
Conv. Layer
3x3x210, Stride 1
1
2
3
4
5
6
7
8
9
Secure Enclave in a CPU Non-secure CPU/GPU
Return to user
Second part of DNN First part of CNN
The intermediate feature maps from
this layer forward are not secure
D
e
c
r
y
p
t
E
n
c
r
y
p
t
Private
image
User of the service
(X)
X
(a) Secure system with partitioning
is created as a separate container that can be executed in an open compute environment
inside the CPU or can be offloaded to an accelerator.
During operation, each inference request is encrypted by the user of the service and
sent to the cloud. The encrypted input is then received by the first tier in the SGX en-
clave which securely decrypts the input. The input is then processed in the firstL layers.
The output from the first tier is an intermediate feature map. The intermediate repre-
sentation is then fed to the second tier which may be executed on a GPU/CPU. As we
show in the next section, by appropriately selecting theL layers for first tier, our model
partitioning approach provides strong guarantees that prevent input reconstruction from
the intermediate representation generated by the first tier. Thus model partitioning can
protect input privacy while also reducing the amount of computation performed within
the SGX enclave.
85
0.141
0.317
0.38
0.449
0.93
0.0087
0.227
0.297
0.386
0.121
0.297
0.358
0.395
0.77
0.0069
0.226
0.294
0.339
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Full model
in CPU
Partition at
4th conv
layer and
offload to
CPU
Partition at
6th conv
layer and
offload to
CPU
Partition at
8th conv
layer and
offload to
CPU
Full model
in enclave
with JIT
loading
Full model
in GPU
Partition at
4th conv
layer and
offload to
GPU
Partition at
6th conv
layer and
offload to
GPU
Partition at
8th conv
layer and
offload to
GPU
Runtime (seconds)
Runtimes of different partitioning configurations
VGG-16 VGG-19
Figure 6.5: Runtime variation with different partitioning points
We measure the inference runtime of the VGG-16 and VGG-19 models partitioned at
different layers and plot these in figure 6.5. The first partition of the model is run inside
SGX enclave and the second partition is run on a CPU outside of SGX. For VGG16, as
partitioning point moves from 4th, 6th, to 8th convolutional layer, the inference slow-
downs are 2.5x, 3x, and 3.3x respectively. For the VGG-19 model the slowdowns are
2.3x, 2.7x, and 3.2x respectively. More critically if the second partition is offloaded to
a GPU, the slowdowns drop to about 50x, compared to up to 321x slowdown seen with
GPU only execution. Although model partitioning considerably improves performance
there is still a significant penalty that must be paid for private inference.
86
6.3.2 Key idea 2: Reducing SGX execution further with Slalom
While model partitioning reduces the amount of computing done within SGX there is
still a significant opportunity to lower the overheads. Slalom [105] proposed an approach
to enable compute intensive operations to be offloaded to a GPU from SGX while still
preserving privacy. Each layer of the DNN is split partially between GPU and SGX
enclave. The compute intensive convolutions (matrix multiplications) within a DNN
layer are offloaded to a GPU while allowing the non-linear operations (such as RELUs)
to be performed within an enclave. However, offloading the convolution layers exposes
the user input to an adversary. For instance, the first layer in a CNN model convolves the
input image with a sliding window of feature matrix. To prevent the private data exposure
Slalom uses a technique based on cryptography to blind the data before sending it to the
GPU. This technique adds noise i.e., random elements, which is kept private within
the SGX, to the input matrices within the enclave before releasing the data to the GPU.
These random elements are referred to as the blinding factors. This noisy input data, with
privately held blinding factors, creates the privacy guarantees by preventing an adversary
from observing (or even reconstructing) the inputs.
The GPU sends the results of its computation on the noisy data back to the SGX en-
clave. Because the GPU is performing only linear operations (matrix multiplications, in
particular) one can decode GPU results by subtracting the precomputed noisy compo-
nents. For this purpose, the Slalom uses the privately stored unblinding factors to extract
the computation result sans noise before applying non-linear functions within the en-
clave. Slalom’s reliance on decodability of computed data requires the approach to run
87
only linear operations on GPU while requiring SGX to perform all non-linear operations.
Non-linear operations on noisy data will essentially render the results undecodable. Thus
Slalommust pay the cost of blinding and unblinding every layer within a CNN model.
We analyzed the performance of Slalom compared to an execution without any pri-
vacy GPU. As we show in our result section (see Figure 6.14), the performance of Slalom
is about 10x slower. Hence, even though Slalom offloads most of its computations to a
GPU, it still pays non-trivial overheads. We analyzed the reasons for this slowdown. Our
experiments showed that unblinding or blinding 6MB features roughly takes 4 millisec-
onds and there are roughly 47MB and 51MB intermediates features to process per each
inference in VGG-16 and VGG-19. Hence, a significant fraction of the total execution
time is hobbled by the encoding and decoding of data.
6.3.3 Origami: Combining model splitting with blinding
Using cryptographic blinding can eliminate the cost of SGX overheads but leads to an
increase in blinding and deblinding overheads. These blinding overheads can be elim-
inated if one can avoid blinding the data once the input reconstruction capability is no
longer a concern. By avoiding blinding at the earliest possible DNN layer it is possible to
execute even the non-linear operations within a GPU thereby allowing an uninterrupted
execution of multiple CNN layers within a GPU. Origami framework combines both the
techniques. It first outsources only the linear operations of the layers within the first
partition while executing the non-linear operations inside the enclave. It then completely
outsources the second partition of a DNN for execution on a CPU or GPU.
88
6.4 Conditional GAN threat models
Offloading the second partition of the model to an unsecure CPU or GPU makes
all the intermediate data of these layers available to adversaries. An adversary in the
Origami framework is an agent that tries to use the observed intermediate data of the
model to reconstruct the input image sent by the user to the service. The intermediate
data of layers after the first partition are available to an adversary. Formally, let the
function computed by the hidden layers inside the enclave be . For some inputX, an
adversary can observe (X) and then use it to reconstructX [78].
6.4.1 Formal privacy definitions
Let a server S and a client C be the participants in the Origami inference. Let the
DNN model beM. We assume that the modelM is known or can be known to the server
S.
Input privacy: In Origami inference, an adversary S
cannot learn useful structural
information about the input image, x, from the client C. Structural information in an
image classification task includes the information about what objects are present in the
image.
The first partition of the model running inside the enclave does not leak any informa-
tion about the input image, similar to Slalom [105]. For the second partition that runs
outside the enclave, Origami uses Structural Similarity (SSIM) metric to quantify the
89
Reconstructed
image
The intermediate
feature maps
(X)
Generator Discriminator
Private
image
X'
X
c
c
The intermediate
feature maps
(X)
Figure 6.6: c-GAN adversary model
structural information that the adversary can learn. SSIM value indicates the similar-
ity between the input imagex and the reconstructed imagex
0
. We use the SSIM metric
instead of metrics like mean square error or mutual information since it specifically com-
pares structural information between two images. When choosing where to partition the
model, we use SSIM value and visually verify the similarity betweenx andx
0
.
In the next couple of subsections, we expand on the adversaries and the algorithm to
partition the models used in Origami inference.
6.4.2 C-GAN adversary
Origami inference uses an adversary model shown in figure 6.6, which is a conditional
Generative Adversarial Network (c-GAN) [83] to learn . GAN [36] is a network that
90
consists of two DNN models that evolve together, a generator modelG and a discrimi-
nator modelD. The generator modelG learns the training data distribution by striving
to generate ”fake” samples that are difficult for the discriminatorD to differentiate from
the real samples; at the same time the discriminator model D learns to distinguish if a
given sample is coming from the training data or synthesized by the generatorG. The
training process for both models resembles a two-player min-max game to optimize for
the following objective function, wherex is from the training dataset distribution andz
is noise sampled from a multi-variate Gaussian distribution:
min
G
max
D
fE
xpdata
[logD(x)] +E
zpz
[log(1D(G(z))]g (6.1)
Conditional GAN [83] builds on top of GAN and uses additional datac to condition both
the generator and discriminator models, yielding G(z;c) and D(x;c). This extension
enablesG to generate samples conditioned on some variablec.
Recall that in Origami the first tier uses blinding factors to preserve privacy, while the
second tier of computing is done entirely in the open. Thus in the context of Origami, the
adversary trains a c-GAN on the intermediate data generated from the first tier. The gen-
erator network G is trained to produce the private image inside the enclave as its output.
It takes (X) as input and trains to generate aX
0
as real as possible. A discriminator
network is trained to classify between a real image X and a fake image X
0
produced
by the generator, given (X). After the c-GAN is trained, the adversary can use the
generator to take in the intermediate data between the partitions and generate the private
input image.
91
6.4.3 Training the c-GAN adversary
The ability of an adversary to train the c-GAN model depends on two things
1. The information about the input image that is contained in the intermediate data.
2. The quantity of the training data.
Collecting a lot of training data requires the adversary to send a huge number of
queries to the service and observe all the corresponding intermediate data to collect
enough [(X),X] pairs. In this work we make the assumption that the adversary has
significant resource advantage to pay for the cloud service and the cloud service will not
limit the number of queries.
Given these strong adversary assumptions, the intermediate data should not provide
information to enable reconstruction of the input images. Then even collecting a large
amount of training data will not help the c-GAN model reach high accuracy. Consider a
limiting example of trying to reconstruct an input image using just the probabilities of the
last layer of the CNN model. In this example all the layers are running inside the enclave
and only the final softmax probabilites are available outside. It has been empirically
shown that using the softmax probabilities cannot give a good reconstruction of the input
image [27]. The softmax probabilities can contain information like the color of the
image but lack the information on reconstructing the exact objects in the image. Using
earlier layers as partitioning points increases the probability of an adversary successfully
reconstructing the user input whereas using deeper layers as partitioning points makes it
infeasible for an adversary to successfully reconstruct the user input.
92
6.4.4 Model partitioning algorithm
Origami framework finds the partitioning layer where splitting the model makes it
empirically infeasible for the adversary to perform a good reconstruction of the image.
The procedure followed is described in Algorithm 2. Beginning from the first layer
of the model the procedure empirically verifies, by training a c-GAN, if input images
can be reconstructed from the intermediate feature maps that are outputs of the current
layer. The metric used to measure the reconstruct-ability is the widely used Structural
Similarity (SSIM) [119] between the real images,X, and the reconstructed images,X
0
.
SSIM metric measures the similarity of structural information between two images. A
value of 0 indicates no structural similarity, whereas a value of 1 indicates complete
structural similarity. We use the SSIM metric instead of metrics like mean square error
or mutual information since it is specifically designed to compare structural information
between two images.
One observation we make during model partitioning is that if the feature maps from
a layerp cannot reconstruct the inputs, the c-GAN can reconstruct input from a deeper
layer. Thus it becomes necessary to verify c-GAN capabilities for deeper layers. So,
the procedure verifies reconstruct-ability at layers p + 1 and p + 2 etc., For example,
in a VGG-16 model we observed that it is infeasible for the c-GAN to reconstruct the
input images by using feature maps from the first max pool layer. However, using the
feature maps from the convolutional layers that follow the max pool layer it is feasible
to reconstruct the input images. In our understanding this happens because the feature
maps from the max pool layer lack the spatial relationships. However, feature maps
93
in the convolutional layers that follow the first max pool layer seem to recover enough
spatial relationships to get a good reconstruction of the input images. We provide more
details on the reconstruct-ability in the evaluation section.
Algorithm 2 Model partitioning algorithm
Input: Model (M) with (L) layers, c-GAN architecture, image dataset
Output: Partitioning layerp
for each layerl starting from first layer inL do
Collect the intermediate feature mapsmaps
i
that are the
outputs of layerl for every imagei in the dataset
Train a c-GAN using all the (i,maps
i
) pairs
Check for reconstruct-ability using SSIM metric
Letp be the layer such that it’s feature mapsmaps
p
cannot be used by the c-GAN to reconstruct the corresponding input images
Verify reconstruct-ability using maps of next two layers
p + 1,p + 2
if reconstruct-ability is infeasible forp + 1,p + 2 then
selectp as the partitioning point for the modelM
else Go back to the for loop and start from layerp + 1
6.5 Implementation
6.5.1 c-GAN architecture and implementation
We design the GeneratorG of the GAN model as an encoder-decoder network with
a series of residual blocks [44]. As shown in figure 6.7, the intermediate feature map
output taken from the partition layerp is first fed into the encoder, which is composed of
94
Figure 6.7: c-GAN architecture
3 convolutional layers to down-sample until it has a spatial dimension of 14 14. Then
this encoded feature map is fed into 4 residual blocks, before going through a series of
4 up-sampling blocks acting as a decoder to generate a fake image of dimension 224
224 3. The residual block consists of 3 3 stride 1 convolution, batch normalization
and ReLU. The up-sampling block consists of nearest-neighbor upsampling followed
by a 3 3 stride 1 convolution. Batch normalization and ReLU are applied after each
convolutional layer.
On the other hand, the Discriminator networkD first takes in an input image of di-
mension 224 224 3 and let it go through 2 convolutional layers to down-sample
until it has the same spatial dimension as its intermediate feature mapc produced from
the partition layerp in VGG-16. Then it is concatenated together with the intermediate
feature mapc as condition to be fed into a series of 5 consecutive convolutional layers,
95
and then a fully connected layer with Sigmoid activation, to predict whether the input
image is ”true” i.e., it is from the training dataset, or ”fake” generated by the Generator
networkG. Here the convolutional layers are 4 4 with stride 2 and except the last one
layer, they are all followed by batch normalization and LeakyReLU with 0.2 negative
slope.
The architectures of the Generator and Discriminator in the c-GAN are tuned, as
needed, to work with the different shapes of the intermediate feature maps from different
partitioning layers. The c-GAN models, the code to perform model partitioning and
collect the intermediate feature maps are written in Python language using Keras library.
We train the c-GAN adversary models using images from the Imagenet ILSVRC 2012
validation dataset [25]. This dataset comprises 50,000 images belonging to 1000 classes.
To check reconstruction at each partitioning layer, the c-GAN is trained for 150 epochs
on a single NVIDIA GTX 1080 Ti GPU with a learning rate of 0.0002. The training time
varies depending on the size of the feature maps of the partitioning layer. Training for
150 epochs for the 2nd layer took nearly seven days, and for the 7th layer it took nearly
one and half days.
6.5.2 Implementation of models in SGX
The code to execute the VGG-16 and VGG-19 models inside SGX enclaves is written
in Python and C++ using the SGXDNN library from Slalom. The code to execute the
models on CPU or GPU is written using Keras library in Python.
96
6.6 Evaluation
Here is the outline of our evaluation presentation. As a first step, we experimentally
evaluate the privacy guarantees of Origami inference, using c-GANs to reconstruct orig-
inal images from intermediate features from the two-tier implementation of VGG and
InceptionV3. Second, we demonstrate the performance of Origami inference, against
baseline strategies using Slalom’s SGXDNN library [105]. We compare our framework’s
performance with that on untrusted CPUs and GPUs to show how much overhead we still
pay to protect privacy.
6.6.1 Hardware configuration
We performed all our evaluations on a server consisting of a Intel Xeon E-2174G CPU
equipped with SGX capability and a NVIDIA GeForce GTX 1080 Ti GPU attached as
an accelerator. The CPU has 8 threads and 64 GB memory. It’s operating system is
Ubuntu 18.04. The GPU has 11 GB of GDDR memory.
6.6.2 Partitioning and input privacy
We train the c-GAN adversary models and evaluated the Origami framework using
images from the Imagenet ILSVRC 2012 validation dataset [25]. Image reconstructions
by the c-GAN using feature maps at different layers of the VGG-16 and Inception-v3,
while running the model partitioning algorithm 2, are shown in figures 6.9, 6.10.
97
0.91
0.94
0.22
0.83
0.11
0.21
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
layer1 layer2 layer3 layer4 layer6 layer7 layer9
SSIM Value
Similarity between reconstructed and real images
Figure 6.8: Similarity between real and reconstructed images at different partition layers
6.6.2.1 VGG-16 c-GAN reconstruction results
Figures 6.9(a) and (b) show a sample of real images sent for private inference. In
(c) and (d) we show the reconstructed images generated from the intermediate outputs
from a trained c-GAN. We show the reconstruction capability of c-GAN after the first
and second convolution layers of VGG-16. In (e) and (f) we show another sample real
images sent for private inference. In (g) and (h) we demonstrate the surprising result that
the c-GAN is unable to reconstruct the input after layer 3, but it was able to reconstruct
input image from layer 4. As discussed earlier, feature maps in the convolutional layers
that follow the first max pool layer seem to recover enough spatial relationships to get
a good reconstruction of the input images. In (i) and (j) we show real images, and the
reconstruction output from c-GAN using the intermediate outputs from layer 6 and 7 is
shown in (k) and (l). Clearly after layer 6, in VGG-16 the c-GAN is unable to reconstruct
the input, even when the c-GAN is trained with all input images.
98
(a) Real images (b) Real images
(c) Reconstructed images from layer 1 (conv) feature
maps
(d) Reconstructed images from layer 2 (conv) fea-
ture maps
(e) Real images (f) Real images
(g) Reconstructed images from layer 3 (max pool)
feature maps
(h) Reconstructed images from layer 4 (conv) fea-
ture maps
(i) Real images (j) Real images
(k) Reconstructed images from layer 6 (max pool)
feature maps
(l) Reconstructed images from layer 7 (conv) feature
maps
Figure 6.9: Real images and reconstructed images using feature maps of different layers in VGG-16
99
(a) Real images (b) Real images
(c) Reconstructed images from the 4th Inception
module output
(d) Reconstructed images from the 5th Inception
module output
(e) Real images (f) Real images
(g) Reconstructed images from the 6th Inception
module output
(h) Reconstructed images from the 7th Inception
module output
(i) Real images
(j) Reconstructed images from the 8th Inception
Module output
Figure 6.10: Real images and reconstructed images using feature maps of different Inception Modules in
Inception-v3
100
We show the average SSIM metric values between real and reconstructed images at
different layers of the VGG-16 model in figure 6.8. The variation of these values clearly
indicates the visual similarity (or its lack thereof) between real and reconstructed images
already shown in figure 6.9. The visual similarity is high for the first 2 layers, drops
significantly for the third layer, is again high for the fourth layer, then decreases and
stays below 0.2 for all layers past layer 7.
6.6.2.2 Inception-v3 c-GAN reconstruction results
Similarly, figure 6.10 (a) and (b) show samples of real images. In (c) and (d), we
show the reconstruction capability of c-GAN using the output features of the fourth and
the fifth inception modules. Images (e) and (f) are other sample real images. Images
(g) and (h) show the corresponding reconstructed images using the sixth and the seventh
inception module outputs. In (i) and (j) we show the real images and the reconstructed
images using the eighth inception module output. Visually, the reconstructed images
become blurrier and have less information about the original images.
6.6.3 Performance evaluation
We use three metrics to measure performances of Origami: (1) Inference runtime:
Total time taken to perform a single inference. (2) SGX enclave memory requirement.
SGX enclave memory is a precious resource and Intel SGX tool chain requires enclave
writers to specify enclave memory usage statically. During application runtime, alloca-
tion of memory bigger than specified will trigger an exception. (3) Power event recovery
101
time. Intel’s SGX capable processors will destroy the memory encryption keys inside
them when a power event, such as hibernation, happens [51]. Thus, SGX applications
need to recreate enclave after power events. Power Event recovery time is collected to
measure Origami inference’s ability to recover from unexpected power events.
Baseline strategies: We compare Origami with several baselines. The baseline strategy
(referred to as Baseline2 in the figures) performs lazy loading of model parameters into
SGX when loading fully connected layers that require more than 8MB memory. Param-
eters for such layers are loaded into enclave on demand. Although this strategy induces
a small inference performance penalty by distributing parts of model loading cost into
inference penalty, it can reduce the memory usage inside SGX enclave which avoids de-
structive page-swapping. Baseline 2 is also the baseline the original Slalom paper used.
Note that we also evaluated another baseline (baseline 1) where all the model parameters
are fully loaded before starting the inference but that baseline was much worse due to
page swapping costs and hence was discarded.
The Slalom/Privacy [105] model implements Slalom approach where all the layers
are interspersed between SGX and CPU/GPU. All linear computations are offloaded to
GPU after applying blinding, and all blinding, unblinding and non-linear computations
are performed within SGX. Unblinding factors are pre-computed and are not part of the
inference time. Blinding factors are generated on demand using the same Pseudo Ran-
dom Number Generator seed while unblinding factors are encrypted and stored outside
SGX enclave. When removing noise from intermediate features, Slalom/Privacy will
only fetch parts of unblinding factors needed for a given layer into SGX enclave. This
mode of operation is identical to the approach suggested in the original Slalom paper.
102
We also evaluate three model splitting only strategies, where the first L layers are
executed within SGX and the remaining layers are sent to CPU/GPU. We refer to them
as Split/x, where x is a positive integer signaling the xth layer, after which all layers are
offloaded to untrusted hardware.
6.6.3.1 SGX enclave memory usage
Table 6.1 shows the required SGX enclave memory, which is a major limiting factor
for using SGX. Baseline2 uses about 86MB memory, even though the VGG-16 model
size more than 500 MB, due to the lazy loading process. But it may pay on-demand
data loading penalty during inference. Origami (and Slalom) have about 2x lower mem-
ory overhead than this baseline. Slalom/Privacy requires 39MB SGX enclave memory,
12MB of which are used to temporarily store blinding/unblinding factors. Origami re-
quires the same amount of SGX enclave memory as Slalom/Privacy does. Both of them
have to accommodate enough blinding factors that can blind the largest intermediate fea-
ture map. The largest intermediate feature map for both Slalom/Privacy and Origami is
about 12MB. Also compared to simple model splitting process Origami does pay a slight
increase in memory penalty due to the use of blinding factors.
There are two main benefits of reduced memory requirement. Our framework can
retain its performance even with significantly low SGX enclave memory. Thus future
models that are much more complex and may need much larger model sizes may be
better accommodated in our framework. Second, reduced memory requirement also al-
lows more enclave applications to run simultaneously. Currently, the maximum available
103
1.0x 1.0x
2.3x 2.4x
2.6x
3.1x
3.4x
4.1x
10.1x
11.0x
12.7x
15.1x
Speedups
0.0x
5.0x
10.0x
15.0x
20.0x
VGG16 VGG19
Baseline 2 Split/10 Split/8 Split/6 Slalom/Privacy Origami
Inference Runtime Improvements Origami Offloading to GPU
Figure 6.11: Inference Runtime offloading to GPU
SGX memory on an Intel chip is 128MB. For Origami, there is still about 90MB free
physical memory that can be used for other SGX enclave applications or to even co-run
multiple private inference models.
6.6.3.2 Inference runtime
Figure 6.11 shows the average runtime comparison of Origami inference with various
baselines. In this figure all the offloaded computations are executed on GPU. Compared
to executing the entire model within SGX (baseline 2), Slalom achieves 10x speedup
on VGG16 (and 11x on VGG19). Origami achieves 12.7x and 15.1x speedups. Model
splitting at layer 6 (Split/6), which is the minimum layer split needed to protect privacy,
achieves only around 4X speedup, because the first six layers are essentially unable to
take advantage of the vast GPU acceleration capability.
To understand the vast slowdown seen in the baseline, figure 6.13 presents runtime
breakdown for baseline 2. The last three fully connected dense layers (Dense Layer 1,2
104
and 3) account for about 40% of baseline 2 runtime. We also analyzed how much time
is spent in each layer doing the data processing and data movement in and out of SGX
memory. We note that around 50% of the execution time in the fully connected dense
layers is spent on data movement. While Baseline 2’s active memory footprint can fit
within SGX physical protected memory, its dense layers are loaded on demand to prevent
SGX memory limit overflow. Thus the baseline has to pay the penalty of fetching parts of
parameters on demand. Note that pre-loading the entire model (our original baseline 1)
performs worse because it exceeds the SGX memory limit as well and pays significantly
more data movement penalty.
The bottleneck of Slalom/Privacy results from its need to blind and unblind inter-
mediate features to guarantee privacy across each and every VGG layer. Processing
intermediate features dominate Slalom/Privacy runtime. As mentioned earlier, unblind-
ing or blinding 6MB features roughly takes 4 milliseconds and there are roughly 47MB
and 51MB intermediates features to process in total for VGG-16 and VGG-19. Blinding
and unblinding intermediate features takes roughly 62 and 68 milliseconds.
Origami achieves speedup by limiting the blinding overheads to just a few layers
while enabling many layers to be offloaded to GPU.
Figure 6.12 shows the performance when all the offloaded computations are executed
on CPU (no GPU usage). The highly parallel convolution operations were limited by
the available CPU parallelism. Hence, Slalom achieves only about 2.9x speedup, while
Origami achieves about 3.9x speedup compared with baseline 2 on VGG-19. Note that
Slalom/Privacy has similar performance with Split/6 offloading to CPU. In this case
105
1.0x 1.0x
2.0x
2.1x
2.2x
2.4x
2.6x
2.9x
2.8x
2.9x
3.6x
3.9x
Speedups
0.0x
1.0x
2.0x
3.0x
4.0x
VGG16 VGG19
Baseline 2 Split/10 Split/8 Split/6 Slalom/Privacy Origami
Inference Runtime Improvements Origami Offloading to CPU
Figure 6.12: Inference Runtime offloading to CPU
Runtime Breakdowns
0%
25%
50%
75%
100%
VGG-16 VGG-19
Dense Layer 1 Dense Layer 2 Dense Layer 3 The Rest of Layers
Baseline 2 runtime breakdowns
Figure 6.13: Baseline 2 runtime breakdown
the cost of blinding/unblinding is roughly the same as the computational overhead of
executing the first six layers directly on the SGX. Slalom achieves smaller improvements
by offloading linear layers when the intermediate feature map is too big and untrusted
hardware is not significantly faster.
106
Table 6.1: Enclave memory requirements for VGG16
Model Type Required Size
Baseline 2 86MB
Split/6 29MB
Split/8 33MB
Split/10 35MB
Slalom/Privacy 39MB
Origami 39MB
6.6.3.3 Power event recovery
After a power event, SGX enclave applications have to be reinitialized. Table 6.2
shows the time to recover from a power event. Split models can recover much faster than
baseline 2 which need to reinitialize the model parameters before restarting the service.
Since Split models need less memory inside SGX enclave, fewer pages are encrypted
during enclave initialization, and less data are copied into SGX enclave during model
creation. Origami and Slalom/Privacy will have similar recovery time because they have
the same memory requirement.
Table 6.2: Recovery time from power events for VGG16
Model Type Required Time
Baseline 2 201ms
Split/6 51ms
Split/8 54ms
Split/10 59ms
107
1.00x 1.00x
112.13x
105.81x
43.21x
36.28x
11.08x
9.63x
8.81x
7.03x
Slowdowns
0.00x
25.00x
50.00x
75.00x
100.00x
125.00x
VGG16 VGG19
Full GPU Baseline 2 Split/6 Slalom/Privacy Origami
Inference Runtime Comparison with GPU
Figure 6.14: Inference runtime offloading to GPU
6.6.3.4 Comparison with non-private inference
Figure 6.14 and figure 6.15 present relative inference runtime compared to a baseline
where the entire model is executed in fast hardware without any privacy guarantees.
Compared with CPU, Origami takes 1.7x longer at most. Origami takes about 8x longer
when compared to running the entire model within a GPU.
The slowdown results from the use of SGX enclave. Origami and Slalom respectively
pay about 47 and 62 milliseconds of penalty for SGX operations related to blinding, un-
blinding and non-linear operations for VGG-16. Since a GPU is well suited for highly
parallel inference operations running the entire model within a GPU is still faster, when
no privacy guarantees are needed. These results show that supporting private execution
environments within accelerators has significant usage potential in future. When com-
pared with untrusted CPUs, Origami inference has reasonable performance overhead of
0.7x while providing strong privacy guarantees.
108
1.00x 1.00x
6.36x
6.55x
2.45x
2.25x 2.27x 2.26x
1.76x
1.67x
Slowdowns
0.00x
2.00x
4.00x
6.00x
8.00x
VGG16 VGG19
Full CPU Baseline 2 Split/6 Slalom/Privacy Origami
Inference Runtime Comparison with CPU
Figure 6.15: Inference runtime offloading to CPU
6.6.3.5 Performance evaluation summary
When compared with the baselines, our experiments showed that Origami framework,
while providing strong confidentiality guarantees, benefits from offloading computations
to fast hardware. Offloading greatly reduces computation and memory requirements in-
side SGX enclave. Reduction in memory requirements enables Origami inference to re-
covery faster from unexpected power events. Reduced memory requirement for Origami
also allows running more enclave applications simultaneously. When compared with
Slalom/Privacy, Origami inference benefits from reduced amount of computation as-
signed to the SGX enclave and has significantly lower inference latency.
109
6.7 Related work
Origami inference uses a combination of model partitioning, computational offload-
ing and data blinding to provide fast and private inference. We discuss the prior related
works for each technique here.
Model partitioning and offloading: Previous efforts like [16, 20, 37, 90] focus on im-
proving performance on mobile devices by partitioning and offloading computation.
Neurosurgeon [56] partitions DNNs and onloads them to mobile devices to reduce la-
tency and energy consumption. Although Origami performs model partitioning and of-
floading, it differs from all these works in it’s focus on protecting privacy of the data
being processed. Also, Origami only offloads computation to co-located CPUs and
GPUs. Authors of [40] propose a ternary DNN model partitioning approach to reduce
the overhead of using enclave. Origami differs from this in multiple ways. First, Origami
uses Slalom’s blinding and unblinding and never performs linear operations inside the
enclave. As a result, Origami has lower inference latency. Second, Origami protects
privacy against a powerful c-GAN adversary [83].
Image reconstruction: A common adversarial objective on image classification net-
works is image reconstruction. Authors of work [27] showed that intermediate feature
maps from early layers in CNNs leak information and that even non-adversarial mod-
els can be optimized to reconstruct input images based on these feature maps. Making
such informative feature maps available for a powerful adversary can compromise pri-
vacy of input. In our approach we choose the partition point of CNN models such that
110
the most informative feature maps are blinded inside SGX and only significant informa-
tion curtailed feature maps are completely offloaded. We also choose to protect against
reconstruction by evaluating against a conditional GAN adversary, since GAN based
models are the current state-of-the-art adversaries. The potency of a GAN trained with
conditional information has been demonstrated in works such as [72, 74].
Homomorphic encryption: Another paradigm of trying to provide privacy guarantees
when performing inference on machine learning models is the use of homomorphic en-
cryption. Recent works [10,45,54,89,96] have explored using this technique to provide
secure inference on models in the setting of adversaries trying to learn information about
the data. Encryption and processing on encrypted data contribute to prohibitively large
overheads and Origami inference is orders of magnitude faster than these methods. The
only overhead Origami incurs is the cryptographic blinding and unblinding operations
on the intermediate feature maps inside the enclave.
Adding noise for privacy: Authors of [108] train auxiliary models to inject noise and
perform data nullification before offloading data to the cloud for inference. They retrain
the DNN used in the cloud whereas in Origami the DNN model is not modified. Shredder
[82] partitions the DNN models between the edge and the cloud, and adds learned noise
to the intermediate data at the edge before offloading to the cloud. Shredder does not
make use of the enclaves and performs computation on the edge device. Unlike Shredder
where the noise is learned, in Origami the blinding factors are not learned.
Side channel attacks: Origami relies on the security guarantees provided by Intel SGX.
However, SGX has been shown to be prone to side channel attacks [11, 14] based on
111
speculative execution bugs like Spectre and Meltdown [58, 73]. Intel is making updates
to it’s hardware and SGX implementation to increase robustness against these attacks.
Techniques proposed in [111, 113] can be used to defend against bugs in speculative
execution. Finally, Origami is a general framework and can be applied to other enclave
architectures like Sanctum [18].
6.8 Chapter summary
In cloud-based inference services, protecting the privacy of the user data is very im-
portant. In this chapter, we proposed Origami Inference, which leverages hardware en-
claves to protect user data privacy. With Origami inference, we bring the performance
of using an enclave close to the performance of executing an inference request outside
of the enclave in a CPU or a GPU. Origami inference achieves this by combining model
partitioning with the blinding of data inside the hardware enclaves. We demonstrate
Origami’s privacy with a strong conditional GAN adversary and show excellent perfor-
mance over strong baselines during evaluations.
112
Chapter 7
Conclusion
To conclude the dissertation, we summarize the contributions and outline avenues for
future work.
7.1 Conclusion
Machine learning systems in the cloud face two main challenges: the incidence of
stragglers and protecting the privacy of user data. This dissertation proposes techniques
to build straggler-resilient and privacy-preserving machine learning systems in the cloud.
It builds on existing ideas from coded computing and secure enclave execution and sig-
nificantly extends them. S
2
C
2
technique distributes the coded data to compute nodes,
and during runtime, adaptively adjusts the computation work per node. Thereby it signif-
icantly reduces the total execution time of several applications. While running machine
learning training and graph processing applications, we demonstrate up to 39% reduction
in execution time.
113
Collage inference, the second proposed technique in this dissertation, proposes and
uses novel collage-cnn models to provide recovery from compute nodes that slow down
or stop during runtime. Collage-cnn models provide a good tradeoff between accuracy,
resource cost, and tail latency. When the collage-cnn models are deployed in the cloud,
we demonstrate that the 99-th percentile latency can be reduced by 2x. Collage infer-
ence is one of the first proposed techniques that extends coded computing to non-linear
computations like deep learning.
The third technique, Origami inference, leverages hardware enclaves to protect user
data privacy. Applying Origami inference, we bring the performance of running image
classification using an enclave close to executing it outside of the enclave in a CPU
or a GPU. Origami inference achieves this by combining model partitioning with the
blinding of data inside the hardware enclaves. We demonstrate the privacy of Origami
inference with a conditional GAN adversary using the structural similarity index as a
metric. Overall, Origami can increase performance by 1.4x over strong baselines.
7.2 Future work
There are many avenues for future work, a few of which are described below.
7.2.1 Coded computation at parameter servers in distributed training
As discussed before, a widely used approach to accelerate distributed training is to
distribute the gradient computations on multiple machines using data-parallel stochastic
114
gradient descent (SGD) [32,50,68,69,79,80,122,123]. Each worker node computes the
df=dx for the set of inputs it is training on and then sends these gradients to a parameter
server that aggregates the gradients and broadcasts the new parameters to each worker
node before they start the next iteration. However, this gradient aggregation can dra-
matically hinder the scalability of such systems, for it incurs significant communication
bottleneck as a single parameter server may need to receive matrix G of gradients from
multiple workers, and send matrix M
t+1
of model parameters to all workers (100s of MB
of data) every iteration.
Parallel aggregation aims at minimizing communication congestion by shifting from
centralized aggregation at one parameter server to parallel aggregation at multiple pa-
rameter servers. One natural approach is to vertically and evenly divide the data matrix
M intoN sub-matrices, each of which is stored on one parameter server; a simple ex-
ample is one row of the matrix at each parameter server. Each worker node may split
the gradient matrix G into N chunks (for example, one row of the matrix per chunk),
and transmit chunk G
i
to parameter server S
i
. Then when each server i receives the
gradient vector G
i
from all worker nodes, it then computes the partial set of parameters
M
t+1
i
= M
t
i
G
t
i
which it then broadcasts to all worker nodes. Each worker then
vertically concatenates the returned matrices from all the parameter servers to obtain the
final result. The difference between central and distributed parameter servers is illus-
trated in Figure 7.1. Since each worker relies on successfully retrieving the task results
fromallN servers, the distributed parameter server approach has a major drawback that
once any of the parameter servers runs slow, the next iteration at every worker gets stuck.
115
CAPS: One future research topic could be exploring coded aggregation through pa-
rameter servers (CAPS). CAPS can deal with slow or straggler parameter servers by
optimally creating redundant computation tasks. As shown in Fig. 7.2, the coded com-
puting scheme vertically partitions the data matrixM into 2 sub-matricesM
1
andM
2
,
and creates one redundant task by summingM
1
andM
2
. ThenM
1
,M
2
andM
1
+M
2
are stored on parameter servers 1, 2, and 3, respectively. Each worker node also splits the
G matrix at the end of each min-batch processing into three piecesG
1
,G
2
, andG
1
+G
2
and sends these sub-matrices to the three respective parameter servers. In the case of
Fig. 7.2, the final result at each worker is obtained once it receives the task results from
any 2 out of the three parameter servers, without needing to wait for the slow/straggler
parameter server. Let us assume parameter server 2 is a straggler, and the worker node
only collects results from servers 1 and 3. Then the worker node can compute by sub-
tracting the computed M
next
1
, from the computed (M
1
+ M
2
)
next
from parameter server
3.
While CAPS was described in the context of a particular parameter update mechanism
that uses a single learning rate, it can also be equally applied to other parameter update
mechanisms, such as the Nesterov Momentum and other momentum-based updates. As
an example consider the update rules for Nesterov momentum: M
t+1
= M
t
+
v
t1
(1 +) ( G
t
). This employs matrix-matrix subtraction, and extending CAPS
to Nesterov momentum is straightforward.
116
Worker 1 Worker N
………..
#
Central Parameter
Server
%&'(
= −+
,
-
,.#
-
#
%&'(
%&'(
-
Worker N
………..
"
#
$
#
"
$
Parameter Server 1 Parameter Server 2
Worker 1
"
−(
"
)
#
)*"
$
−(
$
)
#
)*"
Computation
of
+
,-./
"
"
$
"
"
"
"
,-./
………..
$
,-./
$
"
Figure 7.1: An illustration of using centralized parameter server (left) where all the workers send their gra-
dient matrix G to a single parameter server. In this case the parameter server becomes a central bottleneck
(as well as a single point of failure). The figure on the right shows a distributed parameter server with two
servers and each worker node sends first half of its gradient G
1
to the first parameter server and second
half G
2
to the second parameter server. Each parameter server computes one half of the new parameters
M
next
1
and M
next
2
and the workers concatenate the two halves to create the full parameter matrix M
next
"
#
"
+
#
Parameter Server 1 Parameter Server 2 Parameter Server 3
Worker 1
Worker N
………..
"
−'
"
)
*
)+"
(
"
+
#
)−''
.
)
#
.+"
*
)+"
#
−'
#
)
*
)+"
Computation
of
/
0123
"
"
#
"
"
"
+
#
"
"
*
#
*
"
*
+
#
*
.
)
/
0123
Figure 7.2: An illustration of CAPS scheme with three parameter servers. Each worker creates a special
encoded gradient matrix (G
i
1
+ G
i
2
) to communicate with server 3. In this example if parameter server 2 is
a straggler then results from server 1 and server 3 are used to reconstruct the next parameter matrix M at
each worker node.
117
7.2.2 Coded computation and hardware enclaves
In Origami inference, described in chapter 6, the overhead of running models in-
side enclaves is reduced by minimizing the number of computations that run inside the
enclaves. It is achieved by performing cryptographic blinding and unblinding on the
intermediate feature maps inside the enclaves. An alternative to cryptographic blinding
and unblinding is to use coded computing. Coded computing can further reduce the
number of computations that run inside the enclaves. An example use case would be to
apply MDS-coding on each of the intermediate feature maps before sending them out of
the enclaves to hardware accelerators. Exploring coded computing inside enclaves is an
exciting avenue for future work.
7.2.3 Adding theoretical guarantees to Origami inference
Origami inference, described in chapter 6, uses the empirical observation that c-GAN
adversary models cannot reconstruct the input images. One way to strengthen this em-
pirical privacy guarantees is by adding differential privacy mechanisms in Origami in-
ference. Differential privacy [29] aims at providing provable privacy guarantees during
the analysis of sensitive data. , referred to as the privacy budget, is the parameter that
controls the privacy guarantee of a differentially private mechanism. can be >= 0.
In general, a smaller value of corresponds to a stronger privacy guarantee and lower
utility value. In differential privacy, the tradeoff is between the privacy of the data vs. its
utility.
118
68.02
71.56
72.35 72.35
67.5
68
68.5
69
69.5
70
70.5
71
71.5
72
72.5
73
0 20 40 60 80 100 120 140 160 180
Top-1 accuracy (%)
Epsilon ()
Top-1 accuracy vs Epsilon ()
Figure 7.3: The tradeoff of top-1 classification accuracy vs.
Recent works have proposed differential privacy mechanisms for deep learning train-
ing and inference [5, 108]. Utility in deep learning refers to the prediction accuracy of
models. We adapted the algorithm 1 of ARDEN [108] to measure privacy guarantees
provided by Origami inference. Our differentially private mechanism performs the fol-
lowing operations on the intermediate feature maps at the partition layer in the CNN
model before they are outsourced from the enclave.
• Norm bound the intermediate feature maps using the global sensitivity measured
during training.
• Add random noise sampled from Laplacian distribution on top of the bounded data.
A smaller value of corresponds to a larger value of added noise. A larger value of
noise, in turn, can reduce the classification accuracy of the CNN models.
119
We implemented this mechanism as a preliminary study on top of the Origami infer-
ence to measure the tradeoff between added noise and classification accuracy of VGG-
16. The tradeoff is plotted in figure 7.3. There is no loss in accuracy despite the added
noise until = 105. Reducing further adds more noise to the intermediate features, and
the classification accuracy of the model decreases. Hence, Origami inference is differen-
tially private without any loss in accuracy (utility) for = 105. Retraining the VGG-16
model can help further reduce the without losing accuracy. Instead of retraining the
whole model, retraining only the top layers of the model that run outside the enclave
should be sufficient. This direction is another avenue for future work.
120
Reference List
[1] Digital ocean. https://www.digitalocean.com.
[2] CS Toronto datasets, 2000. http://www.cs.toronto.edu/
˜
tsap/
experiments/datasets/index.html.
[3] Apache Hadoop, 2014. http://hadoop.apache.org/.
[4] Mart´ ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
Tensorflow: A system for large-scale machine learning. In 12thfUSENIXg Sym-
posium on Operating Systems Design and Implementation (fOSDIg 16), pages
265–283, 2016.
[5] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov,
Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceed-
ings of the 2016 ACM SIGSAC Conference on Computer and Communications
Security, CCS ’16, pages 308–318, New York, NY , USA, 2016. ACM.
[6] Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon,
Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz
121
Wesolowski, and Laxmikant Kale. Parallel programming with migratable objects:
Charm++ in practice. SC, 2014.
[7] Tiago Alves. Trustzone : Integrated hardware and software security. 2004.
[8] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective
straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX
Conference on Networked Systems Design and Implementation, nsdi’13, pages
185–198, Berkeley, CA, USA, 2013. USENIX Association.
[9] Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica,
Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers in map-reduce
clusters using mantri. InOSDI, volume 10, page 24, 2010.
[10] Ahmad Al Badawi, Jin Chao, Jie Lin, Chan Fook Mun, Sim Jun Jie, Benjamin
Hong Meng Tan, Xiao Nan, Khin Mi Mi Aung, and Vijay Ramaseshan Chan-
drasekhar. The alexnet moment for homomorphic encryption: Hcnn, the first
homomorphic CNN on encrypted data with gpus. CoRR, abs/1811.00778, 2018.
[11] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank
Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, and Raoul Strackx.
Foreshadow: Extracting the keys to the intel SGX kingdom with transient out-
of-order execution. In 27th USENIX Security Symposium (USENIX Security 18),
page 991–1008, Baltimore, MD, August 2018. USENIX Association.
[12] Stuart K. Card, George G. Robertson, and Jock D. Mackinlay. The information
visualizer, an information workspace. In Proceedings of the SIGCHI Conference
122
on Human Factors in Computing Systems, CHI ’91, pages 181–186, New York,
NY , USA, 1991. ACM.
[13] Manmohan Chaubey and Erik Saule. Replicated data placement for uncertain
scheduling. In Proceedings of the 2015 IEEE International Parallel and Dis-
tributed Processing Symposium Workshop, IPDPSW ’15, pages 464–472, Wash-
ington, DC, USA, 2015. IEEE Computer Society.
[14] Guoxing Chen, Sanchuan Chen, Yuan Xiao, Yinqian Zhang, Zhiqiang Lin, and
Ten H. Lai. Sgxpectre: Stealing intel secrets from sgx enclaves via speculative
execution. 2019IEEEEuropeanSymposiumonSecurityandPrivacy, Jun 2019.
[15] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz.
Revisiting distributed synchronous sgd. arXivpreprintarXiv:1604.00981, 2016.
[16] Byung-Gon Chun, Sunghwan Ihm, Petros Maniatis, Mayur Naik, and Ashwin
Patti. Clonecloud: Elastic execution between mobile device and cloud. In Pro-
ceedings of the Sixth Conference on Computer Systems, EuroSys ’11, pages 301–
314, New York, NY , USA, 2011. ACM.
[17] Victor Costan and Srinivas Devadas. Intel sgx explained. IACRCryptologyePrint
Archive, 2016:86, 2016.
[18] Victor Costan, Ilia Lebedev, and Srinivas Devadas. Sanctum: Minimal hardware
extensions for strong software isolation. InProceedingsofthe25thUSENIXCon-
ference on Security Symposium, SEC’16, pages 857–874, Berkeley, CA, USA,
2016. USENIX Association.
123
[19] Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gon-
zalez, and Ion Stoica. Clipper: A low-latency online prediction serving system.
In 14th USENIX Symposium on Networked Systems Design and Implementation
(NSDI17), pages 613–627, Boston, MA, 2017. USENIX Association.
[20] Eduardo Cuervo, Aruna Balasubramanian, Dae-ki Cho, Alec Wolman, Stefan
Saroiu, Ranveer Chandra, and Paramvir Bahl. Maui: Making smartphones last
longer with code offload. In Proceedings of the 8th International Conference
on Mobile Systems, Applications, and Services, MobiSys ’10, pages 49–62, New
York, NY , USA, 2010. ACM.
[21] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-
based fully convolutional networks. InAdvancesinneuralinformationprocessing
systems, pages 379–387, 2016.
[22] Jeffrey Dean and Luiz Andr´ e Barroso. The tail at scale. Communications of the
ACM, 56:74–80, 2013.
[23] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on
large clusters. CommunicationsoftheACM, 51(1):107–113, 2008.
[24] Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-
aware cluster management. In Proceedings of the 19th International Conference
on Architectural Support for Programming Languages and Operating Systems,
ASPLOS ’14, pages 127–144, New York, NY , USA, 2014. ACM.
124
[25] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In Proc. IEEE conf. Computer Vision
andPatternRecognition(CVPR), pages 248–255, 2009.
[26] Peter A. Dinda. Online prediction of the running time of tasks. In Proceedings
of the 2001 ACM SIGMETRICS International Conference on Measurement and
Modeling of Computer Systems, SIGMETRICS ’01, pages 336–337, New York,
NY , USA, 2001. ACM.
[27] Alexey Dosovitskiy and Thomas Brox. Inverting convolutional networks with
convolutional networks. CoRR, abs/1506.02753, 2015.
[28] Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. Short-dot: Computing
large linear transforms distributedly using coded short dot products. In Advances
InNeuralInformationProcessingSystems, pages 2092–2100, 2016.
[29] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential
privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–
407, 2014.
[30] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg.
Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659,
2017.
[31] Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, and Esa
Hyytia. Reducing latency via redundant requests: Exact analysis. InProceedings
of the 2015 ACM SIGMETRICS International Conference on Measurement and
125
Modeling of Computer Systems, SIGMETRICS ’15, pages 347–360, New York,
NY , USA, 2015. ACM.
[32] Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. Large-scale
matrix factorization with distributed stochastic gradient descent. In Proceedings
of the 17th ACM SIGKDD international conference on Knowledge discovery and
datamining, pages 69–77. ACM, 2011.
[33] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference
oncomputervision, pages 1440–1448, 2015.
[34] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition, pages
580–587, 2014.
[35] Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. Ap-
proxhadoop: Bringing approximations to mapreduce frameworks. In Proceed-
ings of the Twentieth International Conference on Architectural Support for Pro-
grammingLanguagesandOperatingSystems, ASPLOS ’15, pages 383–397, New
York, NY , USA, 2015. ACM.
[36] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adver-
sarial networks. ArXiv, abs/1406.2661, 2014.
126
[37] Mark S. Gordon, D. Anoushe, Jamshidi Scott, Mahlke Z. Morley, and Mao Xu
Chen. Comet: Code offload by migrating execution transparently. In In Proc.
USENIXOSDI, 2012.
[38] Priya Goyal, Piotr Doll´ ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski,
Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large
minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677,
2017.
[39] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, An-
drew W. Moore, Steven Hand, and Jon Crowcroft. Queues don’t matter when you
can jump them! In Proceedings of the 12th USENIX Conference on Networked
Systems Design and Implementation, NSDI’15, page 1–14, USA, 2015. USENIX
Association.
[40] Zhongshu Gu, Heqing Huang, Jialong Zhang, Dong Su, Ankita Lamba, Dimitrios
Pendarakis, and Ian Molloy. Securing input data of deep learning inference sys-
tems via partitioned enclave execution. CoRR, abs/1807.00969, 2018.
[41] Vipul Gupta, Shusen Wang, Thomas Courtade, and Kannan Ramchandran.
Oversketch: Approximate matrix multiplication for the cloud. arXiv preprint
arXiv:1811.02653, 2018.
[42] Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Sum-
into, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. Mittos: Sup-
porting millisecond tail tolerance with fast rejecting slo-aware os interface. In
127
Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17,
page 168–183, New York, NY , USA, 2017. Association for Computing Machin-
ery.
[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. InProceedingsoftheIEEEconferenceoncomputervision
andpatternrecognition, pages 770–778, 2016.
[44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In The IEEE Conference on Computer Vision and Pattern
Recognition(CVPR), June 2016.
[45] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. Cryptodl: Deep neural
networks over encrypted data. CoRR, abs/1711.05189, 2017.
[46] Raymond Hill. A First Course in Coding Theory. Oxford Applied Linguistics.
Clarendon Press, 1986.
[47] Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory. Neural
Comput., 9(8):1735–1780, November 1997.
[48] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision applications. arXivpreprint
arXiv:1704.04861, 2017.
[49] Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas
Wenisch, Ronald G. Dreslinski, Jason Mars, and Lingjia Tang. Reining in long
128
tails in warehouse-scale computers with quick voltage boosting using adrenaline.
ACMTrans.Comput.Syst., 35(1):2:1–2:33, March 2017.
[50] F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer. Firecaffe: near-linear
acceleration of deep neural network training on compute clusters. InCVPR, 2016.
[51] Intel. Intel software guard extensions sdk for linux. 2019.
[52] Calin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Sya-
mala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack
Zhang, and Junhua Wang. Perfiso: Performance isolation for commercial latency-
sensitive services. In 2018 USENIX Annual Technical Conference (USENIX ATC
18), pages 519–532, Boston, MA, July 2018. USENIX Association.
[53] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick
Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Da-
ley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra
Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg,
John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski,
Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Ku-
mar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan
Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Ma-
hony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix,
Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan
Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov,
129
Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gre-
gory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Wal-
ter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance
analysis of a tensor processing unit. In Proceedings of the 44th Annual Interna-
tional Symposium on Computer Architecture, ISCA ’17, page 1–12, New York,
NY , USA, 2017. Association for Computing Machinery.
[54] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. Gazelle: A
low latency framework for secure neural network inference. In Proceedings of
the27thUSENIXConferenceonSecuritySymposium, SEC’18, pages 1651–1668,
Berkeley, CA, USA, 2018. USENIX Association.
[55] Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David
Mazi` eres, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for second-
scale tail latency. In 16th USENIX Symposium on Networked Systems Design
and Implementation (NSDI 19), pages 345–360, Boston, MA, February 2019.
USENIX Association.
[56] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Ja-
son Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between
the cloud and mobile edge. In Proceedings of the Twenty-Second International
ConferenceonArchitecturalSupportforProgrammingLanguagesandOperating
Systems, ASPLOS ’17, pages 615–629, New York, NY , USA, 2017. ACM.
130
[57] Harshad Kasture and Daniel Sanchez. Ubik: Efficient cache sharing with strict
qos for latency-critical workloads. In Proceedings of the 19th International Con-
ferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSys-
tems, ASPLOS ’14, pages 729–742, New York, NY , USA, 2014. ACM.
[58] Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner
Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, and et al.
Spectre attacks: Exploiting speculative execution. 2019 IEEE Symposium on Se-
curityandPrivacy(SP), May 2019.
[59] J. Kosaian, K. V . Rashmi, and S. Venkataraman. Learning a Code: Machine
Learning for Approximate Non-Linear Coded Computation. ArXiv e-prints, June
2018.
[60] Jack Kosaian, K. V . Rashmi, and Shivaram Venkataraman. Parity models:
A general framework for coding-based resilience in ML inference. CoRR,
abs/1905.00863, 2019.
[61] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural information
processingsystems, pages 1097–1105, 2012.
[62] Sanjeev Krishnan Laxmikant Kale. Charm++: A portable concurrent object ori-
ented system based on c++. InProceedingsofOOPSLA’93, pages 91–108. ACM
Press, 1993.
131
[63] Yann LeCun, L´ eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. ProceedingsoftheIEEE, 86(11):2278–
2324, 1998.
[64] Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and
Kannan Ramchandran. Speeding up distributed machine learning using codes. In
2016 IEEE International Symposium on Information Theory (ISIT), pages 1143–
1147, July 2016.
[65] Kangwook Lee, Ramtin Pedarsani, and Kannan Ramchandran. On schedul-
ing redundant requests with cancellation overheads. IEEE/ACM Trans. Netw.,
25(2):1279–1290, April 2017.
[66] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and
sub-millisecond quality-of-service. InProceedingsoftheNinthEuropeanConfer-
ence on Computer Systems, EuroSys ’14, pages 4:1–4:14, New York, NY , USA,
2014. ACM.
[67] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. Tales of the
tail: Hardware, os, and application-level sources of tail latency. In Proceedings
of the ACM Symposium on Cloud Computing, SOCC ’14, pages 9:1–9:14, New
York, NY , USA, 2014. ACM.
[68] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed,
Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling
distributed machine learning with the parameter server. InOSDI, 2014.
132
[69] Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. Communication
efficient distributed machine learning with the parameter server. InNIPS, 2014.
[70] Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. A uni-
fied coding framework for distributed computing with straggling servers. e-print
arXiv:1609.01690, Sept. 2016. A shorter version to appear in IEEE NetCod 2016.
[71] M. Lichman. UCI machine learning repository, 2013.
[72] Zinan Lin, Ashish Khetan, Giulia C. Fanti, and Sewoong Oh. Pacgan: The power
of two samples in generative adversarial networks. CoRR, abs/1712.04086, 2017.
[73] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas,
Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg.
Meltdown, 2018.
[74] Longfei Liu, Sheng Li, Yisong Chen, and Guoping Wang. X-gans: Image recon-
struction made easy for extreme cases. CoRR, abs/1808.04432, 2018.
[75] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,
Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In
Europeanconferenceoncomputervision, pages 21–37. Springer, 2016.
[76] David Lo, Liqun Cheng, Rama Govindaraju, Luiz Andr´ e Barroso, and Christos
Kozyrakis. Towards energy proportionality for large-scale latency-critical work-
loads. In Proceeding of the 41st Annual International Symposium on Computer
Architecuture, ISCA ’14, pages 301–312, Piscataway, NJ, USA, 2014. IEEE Press.
133
[77] Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. Hop: Heterogeneity-
aware decentralized training. In Proceedings of the Twenty-Fourth International
ConferenceonArchitecturalSupportforProgrammingLanguagesandOperating
Systems, pages 893–907. ACM, 2019.
[78] A. Mahendran and A. Vedaldi. Understanding deep image representations by in-
verting them. In 2015 IEEE Conference on Computer Vision and Pattern Recog-
nition(CVPR), pages 5188–5196, June 2015.
[79] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies
for the structured perceptron. In NAACL, pages 456–464. Association for Com-
putational Linguistics, 2010.
[80] Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S
Mann. Efficient large-scale distributed training of conditional maximum entropy
models. InNIPS, pages 1231–1239, 2009.
[81] Frank McKeen, Ilya Alexandrovich, Alex Berenzon, Carlos V . Rozas, Hisham
Shafi, Vedvyas Shanbhogue, and Uday R. Savagaonkar. Innovative instructions
and software model for isolated execution. In Proceedings of the 2Nd Interna-
tionalWorkshoponHardwareandArchitecturalSupportforSecurityandPrivacy,
HASP ’13, pages 10:1–10:1, New York, NY , USA, 2013. ACM.
[82] Fatemehsadat Mireshghallah, Mohammadkazem Taram, Prakash Ramrakhyani,
Dean Tullsen, and Hadi Esmaeilzadeh. Shredder: Learning noise distributions to
protect inference privacy, 2019.
134
[83] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.
CoRR, abs/1411.1784, 2014.
[84] Krishna Narra, Zhifeng Lin, Ganesh Ananthnarayanan, Salman Avestimehr, and
Murali Annavaram. Collage inference: Using coded redundancy for lowering
latency variation in distributed image classification systems. In To appear at the
InternationalConferenceonDistributedComputingSystems(ICDCS)2020.
[85] Krishna Giri Narra, Zhifeng Lin, Mehrdad Kiamari, Salman Avestimehr, and Mu-
rali Annavaram. Slack squeeze coded computing for adaptive straggler mitigation.
InProceedingsoftheInternationalConferenceforHighPerformanceComputing,
Networking,StorageandAnalysis, pages 1–16, 2019.
[86] Krishna Giri Narra, Zhifeng Lin, Yongqin Wang, Keshav Balasubramaniam, and
Murali Annavaram. Privacy-preserving inference in machine learning services
using trusted execution environments. arXivpreprintarXiv:1912.03485, 2019.
[87] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. Hogwild!:
A lock-free approach to parallelizing stochastic gradient descent. In Proceedings
of the 24th International Conference on Neural Information Processing Systems,
NIPS’11, pages 693–701, USA, 2011. Curran Associates Inc.
[88] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch,
2017.
135
[89] Le Phong, Yoshinori Aono, Takuya Hayashi, Lihua Wang, and Shiho Moriai.
Privacy-preserving deep learning via additively homomorphic encryption. IEEE
TransactionsonInformationForensicsandSecurity, PP:1–1, 12 2017.
[90] Moo-Ryong Ra, Anmol Sheth, Lily Mummert, Padmanabhan Pillai, David
Wetherall, and Ramesh Govindan. Odessa: Enabling interactive perception ap-
plications on mobile devices. In Proceedings of the 9th International Conference
on Mobile Systems, Applications, and Services, MobiSys ’11, pages 43–56, New
York, NY , USA, 2011. ACM.
[91] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look
once: Unified, real-time object detection. In Proceedings of the IEEE conference
oncomputervisionandpatternrecognition, pages 779–788, 2016.
[92] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv
preprintarXiv:1804.02767, 2018.
[93] Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Amir Salman
Avestimehr. Coded computation over heterogeneous clusters. In 2017 IEEE In-
ternationalSymposiumonInformationTheory(ISIT), July 2017.
[94] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In Advances in neural
informationprocessingsystems, pages 91–99, 2015.
[95] Paul Renteln. Manifolds, Tensors, and Forms: An Introduction for Mathemati-
ciansandPhysicists. Cambridge University Press, 2013.
136
[96] Theo Ryffel, Edouard Dufour Sans, Romain Gay, Francis Bach, and David
Pointcheval. Partially encrypted machine learning using functional encryption.
CoRR, abs/1905.10214, 2019.
[97] Nihar B. Shah, Kangwook Lee, and Kannan Ramchandran. When do redundant
requests reduce latency ? In201351stAnnualAllertonConferenceonCommuni-
cation,Control,andComputing(Allerton), pages 731–738, Oct 2013.
[98] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xi-
angyang Xue. Dsod: Learning deeply supervised object detectors from scratch.
In The IEEE International Conference on Computer Vision (ICCV), volume 3,
page 7, 2017.
[99] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. CoRR, abs/1409.1556, 2014.
[100] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXivpreprintarXiv:1409.1556, 2014.
[101] Pravendra Singh, Vinay Verma, Piyush Rai, and Vinay Namboodiri. Hetconv:
Heterogeneous kernel-based convolutions for deep cnns. 03 2019.
[102] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cutting
tail latency in cloud data stores via adaptive replica selection. In 12th USENIX
Symposium on Networked Systems Design and Implementation (NSDI 15), pages
513–527, Oakland, CA, May 2015. USENIX Association.
137
[103] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Proceedings of the IEEE conference on computer
visionandpatternrecognition, pages 1–9, 2015.
[104] Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. Gra-
dient coding. arXivpreprintarXiv:1612.03301, 2016.
[105] Florian Tram` er and Dan Boneh. Slalom: Fast, verifiable and private execution of
neural networks in trusted hardware. ArXiv, abs/1806.03287, 2018.
[106] Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and
Ion Stoica. Ernest: Efficient performance prediction for large-scale advanced an-
alytics. In 13th USENIX Symposium on Networked Systems Design and Imple-
mentation (NSDI 16), pages 363–378, Santa Clara, CA, March 2016. USENIX
Association.
[107] Da Wang, Gauri Joshi, and Gregory Wornell. Efficient task replication for
fast response times in parallel computation. SIGMETRICS Perform. Eval. Rev.,
42(1):599–600, June 2014.
[108] Ji Wang, Jianguo Zhang, Weidong Bao, Xiaomin Zhu, Bokai Cao, and Philip S.
Yu. Not just privacy: Improving performance of private deep learning in mobile
cloud. CoRR, abs/1809.03428, 2018.
[109] Rich Wolski, Neil Spring, and Chris Peterson. Implementing a performance fore-
casting system for metacomputing: The network weather service. InProceedings
138
ofthe1997ACM/IEEEConferenceonSupercomputing, SC ’97, pages 1–19, New
York, NY , USA, 1997. ACM.
[110] Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey. Bobtail:
Avoiding long tails in the cloud. In Proceedings of the 10th USENIX Confer-
ence on Networked Systems Design and Implementation, nsdi’13, page 329–342,
USA, 2013. USENIX Association.
[111] M. Yan, J. Choi, D. Skarlatos, A. Morrison, C. Fletcher, and J. Torrellas. In-
visispec: Making speculative execution invisible in the cache hierarchy. In 2018
51stAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture(MICRO),
pages 428–441, Oct 2018.
[112] Yaoqing Yang, Pulkit Grover, and Soummya Kar. Coding method for parallel iter-
ative linear solver. toappearAdvancesInNeuralInformationProcessingSystems
(NIPS), 2017.
[113] Jiyong Yu, Mengjia Yan, Artem Khyzha, Adam Morrison, Josep Torrellas, and
Christopher W. Fletcher. Speculative taint tracking (stt): A comprehensive pro-
tection for speculatively accessed data. In Proceedings of the 52Nd Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, pages
954–968, New York, NY , USA, 2019. ACM.
[114] Qian Yu, Mohammad Maddah-Ali, and A. Salman Avestimehr. Polynomial codes:
an optimal design for high-dimensional coded matrix multiplication. Intoappear
AdvancesInNeuralInformationProcessingSystems(NIPS), 2017.
139
[115] Qian Yu, Netanel Raviv, Jinhyun So, and A Salman Avestimehr. Lagrange coded
computing: Optimal design for resiliency, security and privacy. arXiv preprint
arXiv:1806.00939, 2018.
[116] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXivpreprint
arXiv:1605.07146, 2016.
[117] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica.
Improving mapreduce performance in heterogeneous environments. In Proceed-
ingsofthe8thUSENIXConferenceonOperatingSystemsDesignandImplemen-
tation, OSDI’08, pages 29–42, Berkeley, CA, USA, 2008. USENIX Association.
[118] Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. Treadmill: Attribut-
ing the source of tail latency through precise load testing and statistical inference.
In Proceedings of the 43rd International Symposium on Computer Architecture,
ISCA ’16, pages 456–468, Piscataway, NJ, USA, 2016. IEEE Press.
[119] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality
assessment: from error visibility to structural similarity. IEEE Transactions on
ImageProcessing, 13(4):600–612, April 2004.
[120] Jingge Zhu, Ye Pu, Vipul Gupta, Claire Tomlin, and Kannan Ramchandran. A
sequential approximation framework for coded distributed optimization. CoRR,
abs/1710.09001, 2017.
140
[121] Timothy Zhu, Michael A. Kozuch, and Mor Harchol-Balter. Workloadcompactor:
Reducing datacenter cost while providing tail latency slo guarantees. InProceed-
ings of the 2017 Symposium on Cloud Computing, SoCC ’17, pages 598–610,
New York, NY , USA, 2017. ACM.
[122] Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. A fast parallel
sgd for matrix factorization in shared memory systems. InProceedingsofthe7th
ACMconferenceonRecommendersystems, pages 249–256. ACM, 2013.
[123] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized
stochastic gradient descent. InNIPS, pages 2595–2603, 2010.
141
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
3D deep learning for perception and modeling
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Fast and label-efficient graph representation learning
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Human appearance analysis and synthesis using deep learning
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Toward counteralgorithms: the contestation of interpretability in machine learning
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Smart buildings: employing modern technology to create an integrated, data-driven, intelligent, self-optimizing, human-centered, building automation system
Asset Metadata
Creator
Narra, Hema Venkata Krishna Giri
(author)
Core Title
Building straggler-resilient and private machine learning systems in the cloud
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
09/15/2020
Defense Date
07/21/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Cloud,cloud computing,coded computing,convolutional neural network,deep learning,distributed machine learning,machine learning,machine learning systems,MLSYS,OAI-PMH Harvest,privacy,stragglers,SysML
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Avestimehr, Amir Salman (
committee member
), Krishnamachari, Bhaskar (
committee member
), Raghavan, Barath (
committee member
)
Creator Email
narra@usc.edu,narrakrishnagiri@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-367992
Unique identifier
UC11666189
Identifier
etd-NarraHemaV-8954.pdf (filename),usctheses-c89-367992 (legacy record id)
Legacy Identifier
etd-NarraHemaV-8954.pdf
Dmrecord
367992
Document Type
Dissertation
Rights
Narra, Hema Venkata Krishna Giri
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cloud computing
coded computing
convolutional neural network
deep learning
distributed machine learning
machine learning
machine learning systems
MLSYS
stragglers
SysML