Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Edge-cloud collaboration for enhanced artificial intelligence
(USC Thesis Other)
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Edge-Cloud Collaboration for Enhanced Artificial Intelligence
by
Amir Erfan Eshratifar
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Electrical and Computer Engineering)
December 2021
Copyright 2021 Amir Erfan Eshratifar
Acknowledgements
I would like to thank University of Southern California for providing me a space to discover
my research interests. I’m deeply grateful for the valuable supports of my defense chair, Prof.
Richard Leahy, and helpful comments of my qualification exam committee members: Prof. Pier-
luigi Nuzzo, Prof. Leana Golubchik, Prof. Paul Bogdan, and Prof. Keith Chugg.
I would also like to thank David Eigen, Michael Gormish, and Matthew Zeiler for hosting me
at Clarifai company with very fruitful internship experiences.
Thanks also to all colleagues we worked together: Amirhossein Esmaili, Mohammad Saeed
Abrishami, Mahdi Nazemi.
Finally, I would like to thank my parents, family, and friends in both USA and Iran.
ii
Table of Contents
Acknowledgements ii
List of Tables vi
List of Figures vii
Abstract xi
Chapter 1: Introduction 1
Chapter 2: JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile
Cloud Computing Services 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Problem definition and modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Energy and Latency Profiling . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 JointDNN Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 ILP Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3.1 Performance Efficient Computation Offloading ILP Setup for
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3.2 Energy Efficient Computation Offloading ILP Setup for Inference 16
2.2.3.3 Performance Efficient Computation Offloading ILP Setup for
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3.4 Energy Efficient Computation Offloading ILP Setup for Training 18
2.2.3.5 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Deep Architecture Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Mobile and Server Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Communication Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Communication Dominance . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Layer Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Related work and comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
iii
Chapter 3: Towards Collaborative Intelligence Friendly Architectures for Deep Learn-
ing 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Butterfly Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Latency and Energy Improvements . . . . . . . . . . . . . . . . . . . . . 47
3.3.3 Server Load Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4 Comparison to Other Feature Compression Techniques . . . . . . . . . . . 49
3.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 4: BottleNet: A Deep Learning Architecture for Intelligent Mobile Cloud Com-
puting Services 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Bottleneck Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Non-differentiable Lossy Compression Aware Training . . . . . . . . . . . 58
4.2.3 Architecture Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Latency and Energy Improvements . . . . . . . . . . . . . . . . . . . . . 63
4.3.3 The Efficacy of Compression-aware Training . . . . . . . . . . . . . . . . 66
4.3.4 Comparison to Other Feature Compression Techniques . . . . . . . . . . . 67
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Chapter 5: Runtime Deep Model Multiplexing for Reduced Latency and Energy Con-
sumption Inference 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Contrastive Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.2 Learning the Model Multiplexer . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.3 Multiplexing process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6: Information Obfuscation for Privacy-Preserving Data Valuation 88
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1.1 Data Valuation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1.2 Data Obfuscation . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
iv
6.2.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.1.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.1.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.1.3 The adversary model . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.2 Data Valuation Training Overview . . . . . . . . . . . . . . . . . . . . . . 94
6.2.3 Data Valuation Training Formulations . . . . . . . . . . . . . . . . . . . . 94
6.2.4 Information Obfuscation Problem Formulation . . . . . . . . . . . . . . . 97
6.2.4.1 Obfuscation for Linear Layers . . . . . . . . . . . . . . . . . . . 98
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.1 Removing High/Low Quality Samples . . . . . . . . . . . . . . . . . . . . 105
6.3.2 Corrupted Sample Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3.3 Comparison to other obfuscation methods . . . . . . . . . . . . . . . . . . 106
6.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Chapter 7: Conclusion and Future Works 109
References 110
v
List of Tables
2.1 Parameter Definition of Graph Model . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Benchmark Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Mobile networks specifications in the U.S. . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Workload reduction of the cloud server in different mobile networks . . . . . . . . 29
3.1 Mobile device specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Server platform specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Wireless networks parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 The End-to-End Latency, mobile energy consumption, and offloaded data size for
different partition points in ResNet-50 using the proposed method . . . . . . . . . 47
3.5 Comparison of the proposed method with mobile-only and cloud-only approaches . 48
4.1 Wireless networks parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Comparison of mobile-only and cloud-only approaches with the proposed method
(BottleNet) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 The latency, percentage of local inference, and accuracy of mobile-only, cloud-
only and hybrid (multiplexing) methods. mobilenet v2 and resnext101 32x8d are
used as the mobile and cloud deep models, respectively. . . . . . . . . . . . . . . . 82
5.2 The FLOPs, latency, accuracy of six of the state-of-art CNN models. The Called
column shows the percentage of inputs which are decided to be predicted by the
corresponding model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
vi
List of Figures
2.1 Different computation partitioning methods. (a) Mobile only: computation is com-
pletely done on the mobile device. (b) Cloud only: raw input data is sent to the
cloud server, computations is done on the cloud server and results are sent back to
the mobile device. (c) JointDNN: DNN architecture is partitioned at the granularity
of layers, each layer can be computed either on cloud or mobile. . . . . . . . . . . 7
2.2 Typical layer size in (a) Discriminative (b) Autoencoder (c) Generative models. . . 9
2.3 Latency of grouped and separated execution of convolution operator. . . . . . . . . 10
2.4 Computation model in linear topology. . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Graph representation of mobile cloud computing optimal scheduling problem for
linear topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 JointDNN graph model. The shortest path from S to F determines the schedule of
executing the layers on mobile or cloud. . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 (a) A residual building block in DNNs(b) Transformation of a residual building
block to be able to be used in JointDNN’s shortest path based scheduler . . . . . . 28
2.8 Latency and energy improvements for different batch sizes during inference over
the base case of mobile-only and cloud-only approaches. . . . . . . . . . . . . . . 29
2.9 Latency and energy improvements for different batch sizes during training over the
base case of mobile-only and cloud-only approaches. . . . . . . . . . . . . . . . . 30
2.10 (a) Latency of one epoch of online training using JointDNN algorithm vs percent-
age of updated weights (b) Latency of mobile-only inference vs. batch size. . . . . 31
2.11 Interesting schedules of execution for three types of DNN architectures while mo-
bile/cloud are allowed to use up to half of their computing resources. . . . . . . . . 31
2.12 (a) Execution time of AlexNet optimized for performance (b) Mobile energy con-
sumption of AlexNet optimized for energy (c) Data size of the layers in AlexNet
and the scheduled computation, where the first nine layers are computed on the
mobile and the rest on the cloud, which is the optimal solution w.r.t. both perfor-
mance and energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vii
2.13 Layer output after passing the input image through convolution, normalization and
ReLU [39] layers. Channels are preserving the general structure of the input image
and large ratio of the output data is black (zero) due to existence of relu. Tiling is
used to put all 96 channels together. . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.14 Compression Ratio (CR) and ratio of zero valued neurons (ZR) for different layers
of (a) AlexNet and (b) VGG16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 The butterfly unit. It takes a tensor with D channels and shrinks it into a tensor
with D
r
channels using the reduction unit, where D
r
D. It outputs a tensor with
the same dimension as input using the restoration unit. . . . . . . . . . . . . . . . 39
3.2 The butterfly unit architecture. It consists of the reduction unit on the mobile side,
and the restoration unit on the cloud side. . . . . . . . . . . . . . . . . . . . . . . 41
3.3 The proposed method overview. A shallow model and the reduction unit on the
mobile device extracts a dense representation of the input, which is uploaded to
the cloud. Then, on the cloud, after applying the restoration function on the dense
representation, the rest of the inference procedure is followed. . . . . . . . . . . . 41
3.4 ResNet-50 architecture and its 16 residual blocks. The solid lines represent identity
shortcuts, and the dashed lines represent projection shortcuts. . . . . . . . . . . . 41
3.5 Input image size of the model and the size of output feature tensor of each residual
block in ResNet-50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Residual block architecture with (a) identity and (b) projection shortcut. . . . . . . 43
3.7 Accuracy levels when choosing different D
r
values in the butterfly unit for all of
its 16 possible locations in ResNet-50 (i.e. after RB1 to RB16). In this figure,
only the results corresponding to D
r
values of 1-5 are presented. However, RB14,
RB15, and RB16 require the minimum D
r
of 10 to maintain the accuracy of the
proposed method at or above 74% (less than 2% accuracy loss). . . . . . . . . . . 47
4.1 Overview of the proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Learnable dimension reduction and restoration units along the (a) channel and (b)
spatial dimension of features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 The bottleneck unit embedded with a non-differentiable lossy compression (e.g.,
JPEG). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Embedding non-differentiable compression (e.g., JPEG) in DNN architecture. We
approximate the pair of JPEG compressor and decompressor units by identity func-
tion to make the model differentiable in backpropagation. . . . . . . . . . . . . . . 59
4.5 Input image size and the size of output feature tensor of the convolutional layers in
(a) VGG-19 and (b) ResNet-50 models. . . . . . . . . . . . . . . . . . . . . . . . 64
viii
4.6 The comparison between the accuracy loss of the proposed compression-aware
training and a naively trained model for different compressed feature size values,
when the bottleneck unit is placed after RB1 in ResNet-50. . . . . . . . . . . . . . 66
5.1 The percentage of ImageNet’s[78] validation set images that can be predicted cor-
rectly by a certain model but can not be correctly predicted by another model. As
an example, alexnet, as our worst performing model, can correctly predict 2.8% of
the inputs that the largest model, resnext101 32x8d, cannot. . . . . . . . . . . . . 70
5.2 Deep learning-powered mobile application deployment options. (a) and (b) show
the status quo approaches of cloud-only and mobile-only approaches. In (c) a
model multiplexer is called on the input which decides whether the input can be
classified correctly on-device or should be offloaded to the cloud due to its com-
plexity. (d) demonstrates multiplexing among a set of models (more than two) in
the cloud intelligent service providers. . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 The t-SNE visualization of feature space of our benchmark models on the valida-
tion set of ImageNet dataset. The feature space of correct and incorrect predictions
are highly overlapped. This overlap shows that predicting whether the prediction
of a certain model will be correct is a hard task. . . . . . . . . . . . . . . . . . . . 75
5.4 The target embedding space. The feature maps of the inputs are distributed in the
space such that when a group of models can all predict the label of input correctly,
their embeddings are close to each other. Also, when a group of models can predict
the label correctly while another group of models can not, the distance between
their embeddings is increased. This will lead to a feature space similar to a Venn
diagram. For instance, the red region on top shows the samples which can be only
predicted correctly by model 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Model multiplexer training procedure and its architecture. In the first step, the
models we are multiplexing from are trained using the contrastive loss. The con-
trastive loss allows the learned embeddings to be grouped into regions where each
region determines the expertise domain of a subset of models. In the second step,
we distill the learned embeddings from the first step into the multiplexer by adding
a distillation loss function. The multiplexer outputs a set of weights where each
weight determines the confidence of its corresponding model about the prediction
correctness. We also show where each loss function is applied to in the figure. . . . 77
5.6 The t-SNE visualization of feature space of validation set of ImageNet dataset
for the benchmark models trained using the proposed loss function. Left: mobile-
cloud collaborative inference using mobilenet v2 on the mobile side and resnext101 32x8d
on the cloud side. Right: Ensemble of six benchmark CNNs which is suitable for
cloud based intelligent services which host the replicas of the most-accurate model.
For instance, instead of replicating resnext101 32x8d on six different servers, one
can host these six CNNs plus the multiplexer which achieves less compute resource
usage and higher accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
ix
6.1 Data valuation for machine learning. The data providers send their valuable datasets
to the model provider and receive monetary profits. However, the data providers
prefer to learn the value of their dataset without sharing the raw information with
the model provider. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Privacy-preserving data valuation using information obfuscation. The data provider
clients will extract the feature vector of the input data and obfuscate them, which
is then sent to the model owner for computing the data values. The data values are
then sent back to the clients, which enables the clients to decide whether they want
to sell their raw data given the valuation. . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 The training pipeline for jointly learning the data value estimator and obfuscation
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Accuracy of the predictor model trained with the curated training set, in which the
most/the least valued training samples are removed according to the estimated data
values. (a)-(f) represent the scenario in which the training set labels are not noisy.
(g)-(l) represent the scenario in which 20% of the training set labels are noisy. . . . 103
6.5 The performance of discovering corrupted samples in six different datasets where
20% of the samples have noisy labels. Random represents the case where we
have no prior knowledge on the clean and corrupted samples, thus the fraction
of the discovered corrupted samples is equal to the fraction of inspected samples.
Optimal represents the case where we only inspect the corrupted samples with
optimal accuracy, thus after inspecting 20% of samples it saturates. . . . . . . . . 104
6.6 The comparison of different obfuscation methods on the least valued sample re-
moval task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
x
Abstract
The big data era and Artificial Intelligence (AI) technologies have created enormous opportuni-
ties for driving intelligent insights from data. Internet-of-Things (IoT) is a significant source of
generating data. However, the compute capability and privacy concerns of IoT devices are still
challenging issues. Machine Learning (ML) models, the core element of AI applications, are ei-
ther entirely hosted on the edge (small models) or a cloud server (large models) in the status-quo
approaches. However, when a device with limited compute capability wants to perform a highly
complex task, the cloud-only approach is the only choice. On the other hand, the downside of the
cloud-only approach is the requirement to transfer significant amounts of data to the cloud over a
network, compromise the user’s privacy, and put computational pressure on the cloud servers. For
example, edge devices are becoming increasingly powerful and energy-efficient, and not utilizing
their compute resources is wasteful. This research proposes methods for collaborative computation
of machine learning models between edge devices and the cloud servers to enhance users’ latency,
energy, and privacy. Collaboration in computation translates to sharing computing resources and
services, while collaboration in privacy translates to sharing partial information about the raw pri-
vate data. Thus, the collaborative computation methods benefit the edge devices by providing
lower latency and mobile energy consumption than the best case of mobile-only or cloud-only
approaches.
This thesis introduces two frameworks (JointDNN, BottleNet) for splitting the computation be-
tween a mobile device and a cloud server. The next problem studied in this thesis is conditioning
the inference tasks based on the task complexity. With this approach, the complexity of a given
xi
inference task is estimated locally on the device. For instance, in the image classification appli-
cations, if the background is a simple monotone color, the task is easy, and if the background is
cluttered and the object is occluded, then the task is complex. As a result, the tasks which are
determined to be easy do not leave the user’s device. The other problem which is addressed in
this thesis is data valuation. A data valuation algorithm distributes a monetary budget to every
data point relative to a specific machine learning model. However, the data providers do not like
to reveal their raw data to a model provider for data valuation. Furthermore, in many real-world
applications, multiple individuals often contribute the datasets that support queries and machine
learning (ML) and rely on massive crowdsourcing. An example is a voice recognition system
whose training data are gathered from many users. There are data marketplaces providing access
to data, e.g., IOTA, DAWEX, Xignite; however, the critical challenge in these marketplaces is how
to allocate fair revenue securely. We propose a privacy-preserving data valuation framework based
on optimizing mutual information between obfuscated and non-obfuscated vectors.
xii
Chapter 1
Introduction
DNN architectures are promising solutions in achieving remarkable results in a wide range of
machine learning applications, including, but not limited to computer vision, speech recognition,
language modeling, and autonomous cars. Currently, there is a major growing trend in introducing
more advanced DNN architectures and employing them in end-user applications. The consider-
able improvements in DNNs are usually achieved by increasing computational complexity which
requires more resources for both training and inference [1]. Recent research directions to make this
progress sustainable are: development of Graphical Processing Units (GPUs) as the vital hardware
component of both servers and mobile devices [2], design of efficient algorithms for large-scale
distributed training [3] and efficient inference [4], compression and approximation of models [5],
and most recently introducing collaborative computation of cloud and fog as known as dew com-
puting [6].
Deployment of cloud servers for computation and storage is becoming extensively favorable
due to technical advancements and improved accessibility. Scalability, low cost, and satisfactory
Quality of Service (QoS) made offloading to cloud a typical choice for computing-intensive tasks.
On the other side, mobile-device are being equipped with more powerful general-purpose CPUs
and GPUs. Very recently there is a new trend in hardware companies to design dedicated chips
to better tackle machine-learning tasks. For example, Apple’s A11 Bionic chip [7] used in iPhone
X uses a neural engine in its GPU to speed up the DNN queries of applications such as face
identification and facial motion capture [8].
1
In the status-quo approaches, there are two methods for DNN inference: mobile-only and
cloud-only. In simple models, a mobile device is sufficient for performing all the computations. In
the case of complex models, the raw input data (image, video stream, voice, etc.) is uploaded to
and then the required computations are performed on the cloud server. The results of the task are
later downloaded to the device. The effects of raw input and feature compression are studied in [9]
and [10].
Despite the recent improvements of the mobile devices mentioned earlier, the computational
power of mobile devices is still significantly weaker than the cloud ones. Therefore, the mobile-
only approach can cause large inference latency and failure in meeting QoS. Moreover, embed-
ded devices undergo major energy consumption constraints due to battery limits. On the other
hand, cloud-only suffers communication overhead for uploading the raw data and downloading
the outputs. Moreover, slowdowns caused by service congestion, subscription costs, and network
dependency should be considered as downsides of this approach [11].
The superiority and persistent improvement of DNNs depend heavily on providing a huge
amount of training data. Typically, this data is collected from different resources and later fed into
a network for training. The final model can then be delivered to different devices for inference
functions. However, there is a trend of applications requiring adaptive learning in online environ-
ments, such as self-driving cars and security drones [12][13]. Model parameters in these smart
devices are constantly being changed based on their continuous interaction with their environ-
ment. The complexity of these architectures with an increasing number of parameters and current
cloud-only methods for DNN training implies a constant communication cost and the burden of
increased energy consumption for mobile devices. The main difference of collaborative training
and cloud-only training is that the data transferred in the cloud-only approach is the input data and
model parameters but in the collaborative approach, it is layer(s)’s output and a portion of model
parameters. Therefore, the amount of data communicated can be potentially decreased[14].
2
Automatic partitioning of computationally extensive tasks over the cloud for optimization of
performance and energy consumption has been already well-studied [15]. Most recently, scal-
able distributed hierarchy structures between the end-user device, edge, and cloud have been sug-
gested [16] which are specialized for DNN applications. However, exploiting the layer granularity
of DNN architectures for run-time partitioning has not been studied thoroughly yet.
In this thesis, we propose two algorithms for efficient splitting of a deep neural network be-
tween a mobile device and a cloud server which are addressed in Chapters 2 and 4. Moreover,
we propose a multiplexing algorithm for selecting the proper DNN given an input and resource
constraints which is covered in Chapter 5. At the end, in Chapter 6, a privacy-preserving frame-
work for private data valuation is proposed which can be used by data providers (mobile devices)
to learn the value of their data without revealing the raw information to the model provider (cloud
server).
3
Chapter 2
JointDNN: An Efficient Training and Inference Engine for
Intelligent Mobile Cloud Computing Services
Deep learning models are being deployed in many mobile intelligent applications. End-side ser-
vices, such as intelligent personal assistants, autonomous cars, and smart home services often
employ either simple local models on the mobile or complex remote models on the cloud. How-
ever, recent studies have shown that partitioning the DNN computations between the mobile and
cloud can increase the latency and energy efficiencies. In this paper, we propose an efficient, adap-
tive, and practical engine, JointDNN, for collaborative computation between a mobile device and
cloud for DNNs in both inference and training phase. JointDNN not only provides an energy and
performance efficient method of querying DNNs for the mobile side but also benefits the cloud
server by reducing the amount of its workload and communications compared to the cloud-only
approach. Given the DNN architecture, we investigate the efficiency of processing some layers
on the mobile device and some layers on the cloud server. We provide optimization formulations
at layer granularity for forward- and backward-propagations in DNNs, which can adapt to mobile
battery limitations and cloud server load constraints and quality of service. JointDNN achieves up
to 18 and 32 times reductions on the latency and mobile energy consumption of querying DNNs
compared to the status-quo approaches, respectively.
4
2.1 Introduction
DNN architectures are promising solutions in achieving remarkable results in a wide range of
machine learning applications, including, but not limited to computer vision, speech recognition,
language modeling, and autonomous cars. Currently, there is a major growing trend in introducing
more advanced DNN architectures and employing them in end-user applications. The consider-
able improvements in DNNs are usually achieved by increasing computational complexity which
requires more resources for both training and inference [1]. Recent research directions to make this
progress sustainable are: development of Graphical Processing Units (GPUs) as the vital hardware
component of both servers and mobile devices [2], design of efficient algorithms for large-scale
distributed training [3] and efficient inference [4], compression and approximation of models [5],
and most recently introducing collaborative computation of cloud and fog as known as dew com-
puting [6].
Deployment of cloud servers for computation and storage is becoming extensively favorable
due to technical advancements and improved accessibility. Scalability, low cost, and satisfactory
Quality of Service (QoS) made offloading to cloud a typical choice for computing-intensive tasks.
On the other side, mobile-device are being equipped with more powerful general-purpose CPUs
and GPUs. Very recently there is a new trend in hardware companies to design dedicated chips
to better tackle machine-learning tasks. For example, Apple’s A11 Bionic chip [7] used in iPhone
X uses a neural engine in its GPU to speed up the DNN queries of applications such as face
identification and facial motion capture [8].
In the status-quo approaches, there are two methods for DNN inference: mobile-only and
cloud-only. In simple models, a mobile device is sufficient for performing all the computations. In
the case of complex models, the raw input data (image, video stream, voice, etc.) is uploaded to
and then the required computations are performed on the cloud server. The results of the task are
later downloaded to the device. The effects of raw input and feature compression are studied in [9]
and [10].
5
Despite the recent improvements of the mobile devices mentioned earlier, the computational
power of mobile devices is still significantly weaker than the cloud ones. Therefore, the mobile-
only approach can cause large inference latency and failure in meeting QoS. Moreover, embed-
ded devices undergo major energy consumption constraints due to battery limits. On the other
hand, cloud-only suffers communication overhead for uploading the raw data and downloading
the outputs. Moreover, slowdowns caused by service congestion, subscription costs, and network
dependency should be considered as downsides of this approach [11].
The superiority and persistent improvement of DNNs depend heavily on providing a huge
amount of training data. Typically, this data is collected from different resources and later fed into
a network for training. The final model can then be delivered to different devices for inference
functions. However, there is a trend of applications requiring adaptive learning in online environ-
ments, such as self-driving cars and security drones [12][13]. Model parameters in these smart
devices are constantly being changed based on their continuous interaction with their environ-
ment. The complexity of these architectures with an increasing number of parameters and current
cloud-only methods for DNN training implies a constant communication cost and the burden of
increased energy consumption for mobile devices. The main difference of collaborative training
and cloud-only training is that the data transferred in the cloud-only approach is the input data and
model parameters but in the collaborative approach, it is layer(s)’s output and a portion of model
parameters. Therefore, the amount of data communicated can be potentially decreased[14].
Automatic partitioning of computationally extensive tasks over the cloud for optimization of
performance and energy consumption has been already well-studied [15]. Most recently, scal-
able distributed hierarchy structures between the end-user device, edge, and cloud have been sug-
gested [16] which are specialized for DNN applications. However, exploiting the layer granularity
of DNN architectures for run-time partitioning has not been studied thoroughly yet.
In this work, we are investigating the inference and training of DNNs in a joint platform of
mobile and cloud as an alternative to the current single-platform methods as illustrated in Fig-
ure 2.1. Considering DNN architectures as an ordered sequence of layers, and the possibility of
6
5. 6 . .. 9 .4
3. 1 . .. 2 .2
(a) (b) (c)
Figure 2.1: Different computation partitioning methods. (a) Mobile only: computation is com-
pletely done on the mobile device. (b) Cloud only: raw input data is sent to the cloud server, com-
putations is done on the cloud server and results are sent back to the mobile device. (c) JointDNN:
DNN architecture is partitioned at the granularity of layers, each layer can be computed either on
cloud or mobile.
computation of every layer either on mobile or cloud, we can model the DNN structure as a Di-
rected Acyclic Graph (DAG). The parameters of our real-time adaptive model are dependent on
the following factors: mobile/cloud hardware and software resources, battery capacity, network
specifications, and QoS. Based on this modeling, we show that the problem of finding the optimal
computation schedule for different scenarios, i.e. best performance or energy consumption, can be
reduced to the polynomial-time shortest path problem.
To present realistic results, we made experiments with fair representative hardware of mobile
device and cloud. To model the communication costs between platforms, we used various mobile
network technologies and the most recent reports on their specifications in the U.S.
DNN architectures can be categorized based on functionality. These differences enforce spe-
cific type and order of layers in architecture, directly affecting the partitioning result in the collab-
orative method. For discriminative models, used in recognition applications, the layer size gradu-
ally decreases going from input toward output as shown in Figure 2.2. This sequence suggests the
computation of the first few layers on the mobile device to avoid excessive communication cost of
uploading large raw input data. On the other hand, the growth of the layer output size from input
to output in generative models which are used for synthesizing new data, implies the possibility
of uploading a small input vector to the cloud and later downloading one of the last layers and
performing the rest of computations on the mobile device for better efficiency. Interesting mo-
bile applications like image-to-image translation are implemented with autoencoder architectures
7
whose middle layers sizes are smaller compared to their input and output. Consequently, to avoid
huge communication costs, we expect the first and last layers to be computed on the mobile device
in our collaborative approach. We examined eight well-known DNN benchmarks selected from
these categories to illustrate their differences in the collaborative computation approach.
As we will see in Section 2.4, the communication between the mobile and cloud is the main
bottleneck for both performance and energy in the collaborative approach. We investigated the
specific characteristics of CNN layer outputs and introduced a loss-less compression approach to
reduce the communication costs while preserving the model accuracy.
State-of-the-art work for collaborative computation of DNNs [14] only considers one offload-
ing point, assigning computation of its previous layers and next layers on the mobile and cloud
platforms, respectively. We show that this approach is non-generic and fails to be optimal, and in-
troduced a new method granting the possibility of computation on either platform for each layer in-
dependent of other layers. Our evaluations show that JointDNN significantly improves the latency
and energy up to 3 and 7 respectively compared to the status-quo single platform approaches
without any compression. The main contributions of this paper can be listed as:
• Introducing a new approach for the collaborative computation of DNNs between the mobile
and cloud
• Formulating the problem of optimal computation scheduling of DNNs at layer granularity
in the mobile cloud computing environment as the shortest path problem and integer linear
programming (ILP)
• Examining the effect of compression on the outputs of DNN layers to improve communica-
tion costs
• Demonstrating the significant improvements in performance, mobile energy consumption,
and cloud workload achieved by using JointDNN
8
(a)
(b)
(c)
Input data Output data
Figure 2.2: Typical layer size in (a) Discriminative (b) Autoencoder (c) Generative models.
2.2 Problem definition and modeling
In this section, we explain the general architecture of DNN layers and our profiling method. More-
over, we elaborate on how cost optimization can be reduced to the shortest path problem by intro-
ducing the JointDNN graph model. Finally, we show how the constrained problem is formulated
by setting up ILP.
2.2.1 Energy and Latency Profiling
There are three methods in measuring the latency and energy consumption of each layer in neural
networks [17]:
Statistical Modeling: In this method, a regression model over the configurable parameters of
operators (e.g. filter size in the convolution) can be used to estimate the associated latency and
energy. This method is prone to large errors because of the inter-layer optimizations performed by
DNN software packages. Therefore, it is necessary to consider the execution of several consecutive
operators grouped during profiling. Many of these software packages are proprietary, making
access to inter-layer optimization techniques impossible.
In order to illustrate this issue, we designed two experiments with 25 consecutive convolutions
on NVIDIA Pascal
™
GPU using cuDNN
®
library [18]. In the first experiment, we measure the
latency of each convolution operator separately and set the total latency as the sum of them. In the
second experiment, we execute the grouped convolutions in a single kernel together and measure
the total latency. All parameters are located on the GPU’s memory in both experiments, avoiding
9
0 5 10 15 20 25
Layer Number
0
50
100
150
Latency (ms)
Separate execution Grouped execution
Figure 2.3: Latency of grouped and separated execution of convolution operator.
any data transfer from the main memory to make sure results are exactly representing the actual
computation latency.
As we see in Figure 2.3, there is a large error gap between separated and grouped execution
experiments which grows as the number of convolutions is increased. This observation confirms
that we need to profile grouped operators to have more accurate estimations. Considering the
various consecutive combination of operators and different input sizes, this method requires a very
large number of measurements, not to mention the need for a complex regression model.
Analytical Modeling: To derive analytical formulations for estimating the latency and energy
consumption, it is required to obtain the exact hardware and software specifications. However,
the state-of-the-art in latency modeling of DNNs [19] fails to estimate layer-level delay within
an acceptable error bound, for instance, underestimating the latency of a fully connected layer
with 4096 neurons by around 900%. Industrial developers do not reveal the detailed hardware
architecture specifications and the proprietary parallel computing architectures such as CUDA
®
,
therefore, the analytical approach could be quite challenging [20].
Application-specific Profiling: In this method, the DNN architecture of the application being
used is profiled in run-time. The number of applications in a mobile device using neural networks is
generally limited. In conclusion, this method is more feasible, promising higher accuracy estima-
tions. We have chosen this method for the estimation of energies and latencies in the experiments
of this paper.
10
Figure 2.4: Computation model in linear topology.
Figure 2.5: Graph representation of mobile cloud computing optimal scheduling problem for linear
topology.
2.2.2 JointDNN Graph Model
First, we assume that a DNN is presented by a sequence of distinct layers with a linear topology as
depicted in Figure 2.4. Layers are executed sequentially, with output data generated by one layer
feeds into the input of the next one. We denote the input and output data sizes of k
th
layer as a
k
andb
k
, respectively. Denoting the latency (energy) of layer k asw
k
, where k= 1;2;:::;n, the total
latency (energy) of querying the DNN iså
n
k=1
w
k
.
The mobile cloud computing optimal scheduling problem can be reduced to the shortest path
problem, from node S to F, in the graph of Figure 2.5. Mobile Execution cost of the k
th
layer
(C(ME
k
)) is the cost of executing the k
th
layer in the mobile while the cloud server is idle. Cloud
Execution cost of the k
th
layer (C(CE
k
)) is the executing cost of the k
th
layer in the cloud server
while the mobile is idle. Uploading the Input Data cost of the k
th
layer is the cost of uploading
output data of the (k-1)
th
layer to the cloud server(UID
k
). Downloading the Input Data cost of
the k
th
layer is the cost of downloading output data of the (k-1)
th
layer to the mobile(DOD
k
). The
costs can refer to either latency or energy. However, as we showed in Section 2.2.1, the assumption
of linear topology in DNNs is not true and we need to consider all the consecutive grouping of the
layers in the network. This fact suggests the replacement of linear topology by a tournament graph
as depicted in Figure 2.6. We define the parameters of this new graph, JointDNN graph model, in
Table 2.1.
11
M 1:k
k=1:n
ME1:k
EU1,k
M 2:k
k=2:n
ME2:k Ω1
Φ1
M n-1:k
k=n-1:n
MEn-1:k Ωn-2
Φn-2
M n:k
k=n:n
MEn:k Ωn-1
Φn-1
C 1:k
k=1:n
CE1:k
U1
ED1,k
C 2:k
k=2:n
CE2:k Γ1
ψ1
C n-1:k
k=n-1:n
CEn-1:k Γn-2
ψn-2
C n:k
k=n:n
CEn:k Γn-1
ψn-1
ED2,k
EDn-1,k
EDn,k
EU2,k
EUn-1,k
EUn,k
S S
F F
Πc
Πm
. . . . . .
. . . . . .
Figure 2.6: JointDNN graph model. The shortest path from S to F
determines the schedule of executing the layers on mobile or cloud.
In this graph, node C
i: j
represents that the layers i to j are computed on the cloud server,
while node M
i: j
represents that the layers i to j are computed on the mobile device. An edge
between two adjacent nodes in JointDNN graph model is associated with four possible cases:
1) A transition from the mobile to the mobile, which only includes the mobile computation cost
(ME
i; j
) 2) A transition from the cloud to the cloud, which only includes the cloud computation cost
(CE
i; j
) 3) A transition from the mobile to the cloud, which includes the mobile computation cost
and uploading cost of the inputs of the next node (EU
i; j
= ME
i; j
+UID
j+1
) 4) A transition from
the cloud to the mobile, which includes the cloud computation cost and downloading cost of the
inputs of the next node (ED
i; j
= CE
i; j
+ DOD
j+1
). Under this formulation, we can transform the
computation scheduling problem to finding the shortest path from S to F. Residual networks are a
class of powerful and easy-to-train architectures of DNNs [21]. In residual networks, as depicted
in Figure 2.7 (a), the output of one layer is fed into another layer with a distance of at least two.
Thus, we need to keep track of the source layer (node 2 in Figure 2.7) to know that this layer is
computed on the mobile or the cloud. Our standard graph model has a memory of one which is
the very previous layer. We provide a method to transform the computation graph of this type of
network to our standard model, JointDNN graph.
In this regard, we add two additional chains of size k1, where k is the number of nodes in the
residual block (3 in Figure 2.7). One chain represents the case of computing layer 2 on the mobile
and the other one represents the case of computing layer 2 on the cloud. In Figure 2.7, we have
12
only shown the weights that need to be modified, where D
2
and U
2
are the cost of downloading
and uploading the output of layer 2, respectively.
By solving the shortest path problem in the JointDNN graph model, we can obtain the optimal
scheduling of inference in DNNs. The online training consists of one inference and one back-
propagation step. The total number of layers is noted by N consistently throughout this paper so
there are 2N layers for modeling training, where the second N layers are the mirrored version of the
first N layers, and their associated operations are the gradients of the error function concerning the
DNN’s weights. The main difference between the mobile cloud computing graph of inference and
online training is the need for updating the model by downloading the new weights from the cloud.
We assume that the cloud server performs the whole back-propagation step separately, even if it
is scheduled to be done on the mobile, therefore, there is no need for the mobile device to upload
the weights that are updated by itself to save mobile energy consumption. The modification in the
JointDNN graph model is adding the costs of downloading weights of the layers that are updated
in the cloud to ED
i; j
. The shortest path problem can be solved in polynomial time efficiently.
However, the problem of the shortest path subjected to constraints is NP-Complete [22]. For
instance, assuming our standard graph is constructed for energy and we need to find the shortest
path subject to the constraint of the total latency of that path is less than a time deadline (QoS).
However, there is an approximation solution to this problem, ”LARAC” algorithm [23], the nature
of our application does not require to solve this optimization problem frequently, therefore, we aim
to obtain the optimal solution. We can constitute a small look-up table of optimization results for
a different set of parameters (e.g. network bandwidth, cloud server load, etc.). We provide the ILP
formulations of DNN partitioning in the following sections.
13
2.2.3 ILP Setup
2.2.3.1 Performance Efficient Computation Offloading ILP Setup for Inference
We formulated the scheduling of inference in DNNs as an ILP with tractable number of variables.
In our method, first we profile the delay and energy consumption of consecutive layers of size m2
f1;2;:::;Ng. Thus, we will have
N+(N 1)+:::+ 1= N(N+ 1)=2
(2.1)
number of different profiling values for delay and energy. Considering layer i to layer j to be
computed either on the mobile device or cloud server, we assign two binary variables m
i; j
and c
i; j
,
respectively. Download and upload communication delays needs to be added to the execution time,
when switching from/to cloud to/from mobile, respectively.
T
computation
=
n
å
i=1
n
å
j=i
(m
i; j
:T
mobile
L
i; j
+ c
i; j
:T
cloud
L
i; j
)
(2.2)
T
communication
=
n
å
i=1
n
å
j=i
n
å
k= j+1
m
i; j
:c
j+1;k
:T
upload
L
j
+
n
å
i=1
n
å
j=i
n
å
k= j+1
c
i; j
:m
j+1;k
:T
download
L
j
+
n
å
i=1
c
1;i
:T
upload
L
i
+
n
å
i=1
c
i;n
:T
download
L
n
(2.3)
T
total
= T
computation
+ T
communication
(2.4)
14
T
mobile
L
i; j
and T
cloud
L
i; j
represent the execution time of the i
th
layer to the j
th
layer on the mo-
bile and cloud, respectively. T
download
L
i
and T
upload
L
i
represent the latency of downloading and
uploading the output of the i
th
layer, respectively. Considering each set of the consecutive layers,
whenever m
i; j
and one offc
j+1;k
g
k= j+1:n
are equal to one, the output of the j
th
layer is uploaded
to the cloud. The same argument applies to downloading. We also note that the last two terms
in Eq. 2.3 represent the condition by which the last layer is computed on the cloud and we need
to download the output to the mobile device, and the first layer is computed on the cloud and we
need to upload the input to the cloud, respectively. To support for residual architectures, we need
to add a pair of download and upload terms similar to the first two terms in Eq. 2.3 for the starting
and ending layers of each residual block. In order to guarantee that all layers are computed exactly
once, we need to add the following set of constraints:
8l2 1 : n :
l
å
i=1
n
å
j=l
(m
i; j
+ c
i; j
)= 1 (2.5)
Because of the non-linearity of multiplication, an additional step is needed to transform Eq. 2.3
to the standard form of ILP. We define two sets of new variables:
u
i; j
= m
i; j
:
n
å
k= j+1
c
j+1;k
d
i; j
= c
i; j
:
n
å
k= j+1
m
j+1;k
(2.6)
15
with the following constraints:
u
i; j
m
i; j
u
i; j
n
å
k= j+1
c
j+1;k
m
i; j
+
n
å
k= j+1
c
j+1;k
u
i; j
1
d
i; j
c
i; j
d
i; j
n
å
k= j+1
m
j+1;k
c
i; j
+
n
å
k= j+1
m
j+1;k
d
i; j
1
(2.7)
The first two constraints ensure that u
i; j
will be zero if either m
i; j
or å
n
l= j+1
c
j+1;l
are zero.
The third inequality guarantees that u
i; j
will take value one if both binary variables, m
i; j
and
å
n
l= j+1
c
j+1;l
, are set to one. The same reasoning works for d
i; j
. In summary, the total number
of variables in our ILP formulation will be 4N(N+ 1)=2, where N is total number of layers in the
network.
2.2.3.2 Energy Efficient Computation Offloading ILP Setup for Inference
Because of the nature of the application, we only care about the energy consumption on the mobile
side. We formulate ILP as follows:
E
computation
=
n
å
i=1
n
å
j=i
m
i; j
:E
mobile
L
i; j
(2.8)
16
E
communication
=
n
å
i=2
n
å
j=i
m
i; j
:E
download
L
i
+
n
å
i=1
n1
å
j=i
m
i; j
:E
upload
L
j
+(
n
å
i=1
(1 m
1;i
)(n 1)):E
upload
L
1
+(
n
å
i=1
(1 m
i;n
)(n 1)):E
download
L
n
(2.9)
E
total
= E
computation
+ E
communication
(2.10)
E
mobile
L
i; j
and E
cloud
L
i; j
represent the amount of energy required to compute the i
th
layer to the j
th
layer on the mobile and cloud, respectively. E
download
L
i
and E
upload
L
i
represent the energy required
to download and upload the output of i
th
layer, respectively. Similar to performance efficient ILP
constraints, each layer should be executed exactly once:
8m2 1 : n :
m
å
i=1
n
å
j=m
m
i; j
1
(2.11)
The ILP problem can be solved for different set of parameters (e.g. different uplink and down-
load speeds), and then the scheduling results can be stored as a look-up table in the mobile device.
Moreover because the number of variables in this setup is tractable solving ILP is quick. For in-
stance, solving ILP for AlexNet takes around 0.045 seconds on Intel(R) Core(TM) i7-3770 CPU
with MATLAB®’s intlinprog() function using primal simplex algorithm.
17
2.2.3.3 Performance Efficient Computation Offloading ILP Setup for Training
The ILP formulation of online training phase is very similar to that of inference. In online training
we have 2N layers instead of N obtained by mirroring the DNN, where the second N layers are
backward propagation. Moreover, we need to download the weights that are updated in the cloud
to the mobile. We assume that the cloud server always has the most updated version of the weights
and does not require the mobile device to upload the updated weights. The following terms need
to be added for the ILP setup of training:
T
computation
=
2n
å
i=1
2n
å
j=i
(m
i; j
:T
mobile
L
i; j
+ c
i; j
:T
cloud
L
i; j
) (2.12)
T
communication
=
2n
å
i=1
2n
å
j=i
2n
å
k= j+1
m
i; j
:c
j+1;k
:T
upload
L
j
+
2n
å
i=1
2n
å
j=i
2n
å
k= j+1
c
i; j
:m
j+1;k
:T
download
L
j
+
n
å
i=1
c
1;i
:T
upload
L
i
+
2n
å
i=n+1
2n
å
j=i
c
i; j
:T
download
W
i
(2.13)
T
total
= T
computation
+ T
communication
(2.14)
2.2.3.4 Energy Efficient Computation Offloading ILP Setup for Training
E
computation
=
2n
å
i=1
2n
å
j=i
m
i; j
:E
mobile
L
i; j
(2.15)
18
E
communication
=
2n
å
i=2
2n
å
j=i
m
i; j
:E
download
L
i
+
2n
å
i=1
2n1
å
j=i
m
i; j
:E
upload
L
j
+(
2n
å
i=1
(1 m
1;i
)(2n 1)):E
upload
L
1
+(
2n
å
i=n+1
2n
å
j=i
(1 m
i; j
)(n 1)):E
download
W
i
(2.16)
E
total
= E
computation
+ E
communication
(2.17)
2.2.3.5 Scenarios
There can be different optimization scenarios defined for ILP as listed below:
• Performance efficient computation: In this case, it is sufficient to solve the ILP formulation
for performance efficient computation offloading.
• Energy efficient computation: In this case, it is sufficient to solve the ILP formulation for
energy efficient computation offloading.
• Battery budget limitation: In this case, based on the available battery, the operating system
can decide to dedicate a specific amount of energy consumption to each application. By
adding the following constraint to the performance efficient ILP formulation, our framework
would adapt to battery limitations:
E
computation
+ E
communication
E
ubound
(2.18)
19
• Cloud limited resources: In the presence of cloud server congestion or limitations on user’s
subscription, we can apply execution time constraints to each application to alleviate the
server load:
n
å
i=1
n
å
j=i
c
i; j
:T
cloud
L
i; j
T
ubound (2.19)
• QoS: In this scenario, we minimize the required energy consumption while meeting a spec-
ified deadline:
minfE
computation
+ E
communication
g
T
computation
+ T
communication
T
QoS
(2.20)
This constraint could be applied to both energy and performance efficient ILP formulations.
2.3 Evaluation
2.3.1 Deep Architecture Benchmarks
Since the architecture of neural networks depends on the type of application, we have chosen three
common application types of DNNs as shown in Table 2.2:
1. Discriminative neural networks are a class of models in machine learning for modeling the
conditional probability distribution P(yjx). This class generally is used in classification and
regression tasks. AlexNet[24], OverFeat[25], VGG16[26], Deep Speech[27], ResNet[21],
and NiN[28] are well-known discriminative models we use as benchmarks in this experi-
ment. Except for Deep Speech, used for speech recognition, all other benchmarks are used
in image classification tasks.
20
Algorithm 1 JointDNN engine optimal scheduling of DNNs
function JointDNN(N;L
i
;D
i
;NB;NP) Input : 1: N: number of layers in the DNN
2: L
i
ji= 1 : N: layers in the DNN
3: D
i
ji= 1 : N: data size at each layer
4: NB: mobile network bandwidth
5: NP: mobile network uplink and downlink power consumption
Output: Optimal schedule of DNN
for i= 0; i< N; i= i+ 1 do
for j= 0; j< N; j= j+ 1 do
Latency
i; j
;Energy
i; j
= ProfileGroupedLayers(i; j)
end
end
G,S,F = ConstructShortestPathGraph(N,L
i
,D
i
,NB,NP) //S and F are start and finish nodes and G is
the JointDNN graph model
if no constraints then
schedule = ShortestPath(G,S,F)
else
if Battery Limited Constraint then
E
comm
+ E
comp
E
ubound
schedule = PerformanceEfficientILP(N,L
i
,D
i
,NB,NP)
end
if Cloud Server Contraint then
å
n
i=1
å
n
j=i
c
i; j
:T
cloud
L
i; j
T
ubound
schedule = PerformanceEfficientILP(N,L
i
,D
i
,NB,NP)
end
if QoS then
T
comm
+ T
comp
T
QoS
schedule = EnergyEfficientILP(N,L
i
,D
i
,NB,NP)
end
;
end
return schedule
21
2. Generative neural networks model the joint probability distribution P(x;y), allowing gen-
eration of new samples. These networks have applications in Computer Vision [29] and
Robotics [30], which can be deployed on a mobile device. Chair [31] is a generative model
we use as a benchmark in this work.
3. Autoencoders are another class of neural networks used to learn a representation for a data
set. Their applications are image reconstruction, image to image translation, and denoising
to name a few. Mobile robots can be equipped with autoencoders to be used in their computer
vision tasks. We use Pix2Pix [32], as a benchmark from this class.
2.3.2 Mobile and Server Setup
We used the Jetson TX2 module developed by NVIDIA
®
[33], a fair representation of mobile com-
putation power as our mobile device. This module enables efficient implementation of DNN appli-
cations used in products such as robots, drones, and smart cameras. It is equipped with NVIDIA
Pascal®GPU with 256 CUDA cores and a shared 8 GB 128 bit LPDDR4 memory between GPU
and CPU. To measure the power consumption of the mobile platform, we used INA226 power
sensor [34].
NVIDIA
®
Tesla
®
K40C [35] with 12 GB memory serves as our server GPU. The computation
capability of this device is more than one order of magnitude compared to our mobile device.
2.3.3 Communication Parameters
To model the communication between platforms, we used the average download and upload speed
of mobile Internet [36, 37] for different networks (3G, 4G and Wi-Fi) as shown in Table 2.3.
The communication power for download (P
d
) and upload (P
u
) is dependent on the network
throughput (t
d
and t
u
). Comprehensive examinations in [38] indicates that uplink and downlink
22
power can be modeled with linear equations (Eq. 2.21) fairly accurate with less than 6% error rate.
Table 2.3 shows the parameter values of this equation for different networks.
P
u
=a
u
t
u
+b
P
d
=a
d
t
d
+b
(2.21)
2.4 Results
The latency and energy improvements of inference and online training with our engine for 8 dif-
ferent benchmarks are shown in Figures 2.8 and 2.9, respectively. We considered the best case
of mobile-only and cloud-only as our baseline. JointDNN can achieve up to 66% and 86% im-
provements in latency and energy consumption, respectively during inference. Communication
cost increases linearly with batch size while this is not the case for computation cost and it grows
with a much lower rate, as depicted in 2.10(b). Therefore, a key observation is that as we increase
the batch size, the mobile-only approach becomes more preferable.
During online training, the huge communication overhead of transmitting the updated weights
will be added to the total cost. Therefore, to avoid downloading this large data, only a few back-
propagation steps are computed in the cloud server. We performed a simulation by varying the
percentage of updated weight. As the percentage of updated weights increases, the latency and
energy consumption becomes constant which is shown in Figure 2.10. This is the result of the
fact that all the backpropagations will be performed on the mobile device and weights are not
transferred from the cloud to the mobile. JointDNN can achieve improvements up to 73% in
latency and 56% in energy consumption during inference.
Different patterns of scheduling are demonstrated in Figure 2.11. They represent the optimal
solution in the Wi-Fi network while optimizing for latency while mobile/cloud is allowed to use up
to half of their computing resources. They show how the computations in DNN is divided between
the mobile and the cloud. As can be seen, discriminative models (e.g. AlexNet), inference follows
23
a mobile-cloud pattern and training follows a mobile-cloud-mobile pattern. The intuition is that
the last layers are computationally intensive (fully connected layers) but with small data sizes,
which require a low communication cost, therefore, the last layers tend to be computed on the
cloud. For generative models (e.g. Chair), the execution schedule of inference is the opposite of
discriminative networks, in which the last layers are generally huge and in the optimal solution
they are computed on the mobile. The reason behind not having any improvement over the base
case of mobile-only is that the amount of transferred data is large. Besides, cloud-only becomes
the best solution when the amount of transferred data is small (e.g. generative models). Lastly,
for autoencoders, where both the input and output data sizes are large, the first and last layers are
computed on the mobile.
JointDNN pushes some parts of the computations toward the mobile device. As a result, this
will lead to less workload on the cloud server. As we see in Table 2.4, we can reduce the cloud
server’s workload up to 84% and 53% on average, which enables the cloud provider to provide ser-
vice to more users, while obtaining higher performance and lower energy consumption compared
to single-platform approaches.
2.4.1 Communication Dominance
Execution time and energy breakdown for AlexNet, which is noted as a representative for the
state-of-the-art architectures deployed in cloud servers, is depicted in Figure 2.12. The cloud-only
approach is dominated by the communication costs. As demonstrated in Figure 2.12, 99%, 93%
and 81% of the total execution time are used for communication in case of 3G, 4G, and Wi-Fi,
respectively. This relative portion also applies to energy consumption. Comparing the latency and
energy of the communication to those of mobile-only approach, we notice that the mobile-only
approach for AlexNet is better than the cloud-only approach in all the mobile networks. We apply
loss-less compression methods to reduce the overheads of communication, which will be covered
in the next section.
24
2.4.2 Layer Compression
The preliminary results of our experiments show that more than 75% of the total energy and delay
cost in DNNs are caused by communication in the collaborative approach. This cost is directly
proportional to the size of the layer being downloaded to or uploaded from the mobile device.
Because of the complex feature extraction process of DNNs, the size of some of the intermediate
layers are even larger than the network’s input data. For example, this ratio can go as high as 10
in VGG16. To address this bottleneck, we investigated the compression of the feature data before
any communication. This process can be applied to different DNN architecture types; however, we
only considered CNNs due to their specific characteristics explained later in detail.
CNN architectures are mostly used for image and video recognition applications. Because
of the spatially local preservation characteristics of conv layers, we can assume that the outputs
of the first convolution layers are following the same structure as the input image, as shown in
Figure 2.13. Moreover, a big ratio of layer outputs is expected to be zero due to the presence of the
ReLU layer. Our observations shows that the ratio of neurons equal to zero (ZR) varies from 50%
to 90% after relu in CNNs. These two characteristics, layers being similar to the input image, and
a large proportion of their data being a single value, suggest that we can employ existing image
compression techniques to their output.
There are two general categories of compression techniques, lossy and loss-less [40]. In loss-
less techniques, the exact original information is reconstructed. On the contrary, lossy techniques
use approximations and the original data cannot be reconstructed. In our experiments, we exam-
ined the impact of compression of layer outputs using PNG, a loss-less technique, based on the
encoding of frequent sequences in an image.
Even though the data type of DNN parameters in typical implementations is 32-bits floating-
points, most image formats are based on 3-bytes RGB color triples. Therefore, to compress the
layer in the same way as 2D pictures, the floating-point data should be quantized into 8-bits fixed-
point. Recent studies show representing the parameters of DNNs with only 4-bits affects the
accuracy, not more than 1% [5]. In this work, we implemented our architectures with an 8-bits
25
fixed-point and presented our baseline without any compression and quantization. The layers of
CNN contain numerous channels of 2D matrices, each similar to an image. A simple method is to
compress each channel separately. In addition to extra overhead of file header for each channel,
this method will not take the best of the frequent sequence decoding of PNG. One alternative is
locating different channels side by side, referred to as tiling, to form a large 2D matrix representing
one layer as shown in Figure 2.13. It should be noted that 1D fully connected layers are very small
and we did not apply compression on them.
The Compression Ratio (CR) is defined as the ratio of the size of the layer (8-bit) to the size of
the compressed 2D matrix in PNG. Looking at the results of compression for two different CNN
architectures in Figure 2.14, we can observe a high correlation between the ratio of pixels being
zero (ZR) and CR. PNG can compress the layer data up to 5:8 and 3:5 by average, therefore the
communication costs can be reduced drastically. By replacing the compressed layer’s output and
adding the cost of the compression process itself, which is negligible compared to DNN operators,
in JointDNN formulations, we achieve an extra 4:9 and 4:6 improvements in energy and latency
on average, respectively.
2.5 Related work and comparison
General Task Offloading Frameworks. There are existing prior arts focusing on offloading com-
putation from the mobile to the cloud[15, 41–46]. However, all these frameworks share a limiting
feature that makes them impractical for computation partitioning of the DNN applications.
These frameworks are programmer annotations dependent as they make decisions about pre-
specified functions, whereas JointDNN makes scheduling decisions based on the model topology
and mobile network specifications in run-time. Offloading in function level, cannot lead to efficient
partition decisions due to layers of a given type within one architecture can have significantly
different computation and data characteristics. For instance, a specific convolution layer structure
can be computed on mobile or cloud in different models in the optimal solution.
26
Neurosurgeon [14] is the only prior art exploring a similar computation offloading idea in
DNNs between the mobile device and the cloud server at layer granularity. Neurosurgeon assumes
that there is only one data transfer point and the execution schedule of the efficient solution starts
with mobile and then switches to the cloud, which performs the whole rest of the computations.
Our results show this is not true especially for online training, where the optimal schedule of
execution often follows the mobile-cloud-mobile pattern. Moreover, generative and autoencoder
models follow a multi-transfer points pattern. Also, the execution schedule can start with the cloud
especially in case of generative models where the input data size is large. Furthermore, inter-
layer optimizations performed by DNN libraries are not considered in Neurosurgeon. Moreover,
Neurosurgeon only schedules for optimal latency and energy, while JointDNN adapts to different
scenarios including battery limitation, cloud server congestion, and QoS. Lastly, Neurosurgeon
only targets simple CNN and ANN models, while JointDNN utilizes a graph-based approach to
handle more complex DNN architectures like ResNet and RNNs.
27
Table 2.1: Parameter Definition of Graph Model
Param. Description of Cost
CE
i: j
Executing layers i to j on the cloud
ME
i: j
Executing layers i to j on the mobile
ED
i; j
CE
i: j
+ DOD
j
EU
i; j
ME
i: j
+ UID
j
f
k
All the following edges:8i= 1 : k 1 ED
i;k1
W
k
All the following edges:8i= 1 : k 1 ME
i;k1
Y
k
All the following edges:8i= 1 : k 1 EU
i;k1
G
k
All the following edges:8i= 1 : k 1 CE
i;k1
P
m
All the following edges:8i= 1 : n ME
i;n
P
c
All the following edges:8i= 1 : n ED
i;n
U
1
Uploading the input of the first layer
1 2 3 4 5
3 4
2 3 4
3 4
1 2 3 4
1
5
5
(a)
(b)
Figure 2.7: (a) A residual building block in DNNs(b) Transformation of a residual building block
to be able to be used in JointDNN’s shortest path based scheduler
Table 2.2: Benchmark Specifications
Type Model Layers
Discriminative
AlexNet 21
OverFeat 14
Deep Speech 10
ResNet 70
VGG16 37
NiN 29
Generative Chair 10
Autoencoder Pix2Pix 32
28
Table 2.3: Mobile networks specifications in the U.S.
Param. 3G 4G Wi-Fi
Download speed (Mbps) 2.0275 13.76 54.97
Upload speed (Mbps) 1.1 5.85 18.88
a
u
(mW/Mbps) 868.98 438.39 283.17
a
d
(mW/Mbps) 122.12 51.97 137.01
b (mW) 817.88 1288.04 132.86
1 2 3 4 5 6 7 8 9 10
0.0%
25.0%
50.0%
Latency improvement (%)
Pix2Pix, Performance Efficient Inference
1 2 3 4 5 6 7 8 9 10
40.0%
60.0%
80.0%
Energy improvement (%)
Pix2Pix, Energy Efficient Inference
1 2 3 4 5 6 7 8 9 10
-10.0%
0.0%
10.0%
Deep Speech, Performance Efficient Inference
1 2 3 4 5 6 7 8 9 10
-10.0%
0.0%
10.0%
Deep Speech, Energy Efficient Inference
1 2 3 4 5 6 7 8 9 10
0.0%
25.0%
50.0%
VGG16, Performance Efficient Inference
1 2 3 4 5 6 7 8 9 10
0.0%
25.0%
50.0%
VGG16, Energy Efficient Inference
1 2 3 4 5 6 7 8 9 10
0.0%
20.0%
NiN, Performance Efficient Inference
1 2 3 4 5 6 7 8 9 10
-10.0%
0.0%
10.0%
NiN, Energy Efficient Inference
1 2 3 4 5 6 7 8 9 10
-10.0%
0.0%
10.0%
Latency improvement (%)
Chair, Performance Efficient Inference
1 2 3 4 5 6 7 8 9 10
Batch size
-10.0%
0.0%
10.0%
Energy improvement (%)
Chair, Energy Efficient Inference
1 2 3 4 5 6 7 8 9 10
0.0%
50.0%
Overfeat, Performance Efficient Inference
1 2 3 4 5 6 7 8 9 10
Batch size
0.0%
50.0%
Overfeat, Energy Efficient Inference
3G 4G Wi-Fi
1 2 3 4 5 6 7 8 9 10
0.0%
50.0%
AlexNet, Performance Efficient Inference
1 2 3 4 5 6 7 8 9 10
Batch size
0.0%
50.0%
AlexNet, Energy Efficient Inference
1 2 3 4 5 6 7 8 9 10
0.0%
20.0%
ResNet, Performance Efficient Inference
1 2 3 4 5 6 7 8 9 10
Batch size
0.0%
25.0%
50.0%
ResNet, Energy Efficient Inference
Figure 2.8: Latency and energy improvements for different batch sizes during inference over the
base case of mobile-only and cloud-only approaches.
Table 2.4: Workload reduction of the cloud server in different mobile networks
Optimization Target 3G (%) 4G (%) Wi-Fi (%)
Latency 84 49 12
Energy 73 49 51
29
1 2 3 4 5 6 7 8 9 10
0.0%
20.0%
40.0%
Latency improvement (%)
Pix2Pix, Performance Efficient Training
1 2 3 4 5 6 7 8 9 10
20.0%
40.0%
Energy improvement (%)
Pix2Pix, Energy Efficient Training
1 2 3 4 5 6 7 8 9 10
0.0%
25.0%
50.0%
Deep Speech, Performance Efficient Training
1 2 3 4 5 6 7 8 9 10
0.0%
50.0%
Deep Speech, Energy Efficient Training
1 2 3 4 5 6 7 8 9 10
0.0%
25.0%
50.0%
VGG16, Performance Efficient Training
1 2 3 4 5 6 7 8 9 10
0.0%
25.0%
50.0%
VGG16, Energy Efficient Training
1 2 3 4 5 6 7 8 9 10
0.0%
20.0%
NiN, Performance Efficient Training
1 2 3 4 5 6 7 8 9 10
-10.0%
0.0%
10.0%
NiN, Energy Efficient Training
1 2 3 4 5 6 7 8 9 10
0.0%
50.0%
Latency improvement (%)
Chair, Performance Efficient Training
1 2 3 4 5 6 7 8 9 10
Batch size
0.0%
20.0%
Energy improvement (%)
Chair, Energy Efficient Training
1 2 3 4 5 6 7 8 9 10
0.0%
25.0%
50.0%
Overfeat, Performance Efficient Training
1 2 3 4 5 6 7 8 9 10
Batch size
0.0%
25.0%
50.0%
Overfeat, Energy Efficient Training
3G 4G Wi-Fi
1 2 3 4 5 6 7 8 9 10
0.0%
20.0%
40.0%
AlexNet, Performance Efficient Training
1 2 3 4 5 6 7 8 9 10
Batch size
0.0%
25.0%
50.0%
AlexNet, Energy Efficient Training
1 2 3 4 5 6 7 8 9 10
0.0%
20.0%
40.0%
ResNet, Performance Efficient Training
1 2 3 4 5 6 7 8 9 10
Batch size
0.0%
20.0%
40.0%
ResNet, Energy Efficient Training
Figure 2.9: Latency and energy improvements for different batch sizes during training over the
base case of mobile-only and cloud-only approaches.
30
0 20 40 60 80 100
Percentage of the weights updated during backpropagation
(a)
0
2000
4000
Latency (ms)
1 2 3 4 5 6 7 8 9 10
Batch size during inference
(b)
0
1000
2000
3000
Latency (ms)
Pix2Pix
Deep Speech
VGG16
NiN
Chair
Overfeat
AlexNet
ResNet
Figure 2.10: (a) Latency of one epoch of online training using JointDNN algorithm vs percentage
of updated weights (b) Latency of mobile-only inference vs. batch size.
AlexNet, Wi-Fi, Latency Efficient, Training
Mobile Cloud
Chair, Wi-Fi, Latency Efficient, Training
Pix2Pix, Wi-Fi, Latency Efficient, Training AlexNet, Wi-Fi, Latency Efficient, Inference
Pix2Pix, Wi-Fi, Latency Efficient, Inference Chair, Wi-Fi, Latency Efficient, Inference
Figure 2.11: Interesting schedules of execution for three types of DNN architectures while mo-
bile/cloud are allowed to use up to half of their computing resources.
31
Mobile-only
Cloud-only (3G)
JointDNN (3G)
Cloud-only (4G)
JointDNN (4G)
Cloud-only (Wi-Fi)
JointDNN (Wi-Fi)
(a)
0
200
400
600
800
1000
Latency (ms)
Mobile computation
Cloud computation
Communication
Mobile-only
Cloud-only (3G)
JointDNN (3G)
Cloud-only (4G)
JointDNN (4G)
Cloud-only (Wi-Fi)
JointDNN (Wi-Fi)
(b)
0
200
400
600
800
1000
Energy (mJ)
Mobile computation
Communication
data
conv1
relu1
lrn1
pool1
conv2
relu2
lrn2
pool2
conv3
relu3
conv4
relu4
conv5
relu5
pool5
fc6
relu6
fc7
relu7
fc8
softmax
(c)
0
100K
200K
Data size (bytes)
Executed on the mobile
Executed on the cloud
Figure 2.12: (a) Execution time of AlexNet optimized for performance (b) Mobile energy con-
sumption of AlexNet optimized for energy (c) Data size of the layers in AlexNet and the scheduled
computation, where the first nine layers are computed on the mobile and the rest on the cloud,
which is the optimal solution w.r.t. both performance and energy.
32
Figure 2.13: Layer output after passing the input image through convolution, normalization and
ReLU [39] layers. Channels are preserving the general structure of the input image and large ratio
of the output data is black (zero) due to existence of relu. Tiling is used to put all 96 channels
together.
33
conv1
relu1
lrn1
maxpool1
conv2
relu2
lrn2
maxpool2
conv3
relu3
conv4
relu4
conv5
relu5
maxpool5
(a)
0
1
2
3
4
5
6
Compression Ratio
conv1_1
conv1_2
pool1
conv2_1
conv2_2
pool2
conv3_1
conv3_2
conv3_3
pool3
conv4_1
conv4_2
conv4_3
pool4
conv5_1
conv5_2
conv5_3
pool5
(b)
0
2
4
6
Compression Ratio
CR ZR
0
20
40
60
80
Zero Portion(%)
0
20
40
60
80
Zero Portion(%)
Figure 2.14: Compression Ratio (CR) and ratio of zero valued neurons (ZR) for different layers of
(a) AlexNet and (b) VGG16.
34
Chapter 3
Towards Collaborative Intelligence Friendly Architectures for
Deep Learning
Modern mobile devices are equipped with high-performance hardware resources such as graphics
processing units (GPUs), making the end-side intelligent services more feasible. Even recently,
specialized silicons as neural engines are being used for mobile devices. However, most mobile
devices are still not capable of performing real-time inference using very deep models. Compu-
tations associated with deep models for today’s intelligent applications are typically performed
solely on the cloud. This cloud-only approach requires significant amounts of raw data to be up-
loaded to the cloud over the mobile wireless network and imposes considerable computational and
communication load on the cloud server. Recent studies have shown that the latency and energy
consumption of deep neural networks in mobile applications can be notably reduced by splitting
the workload between the mobile device and the cloud. In this approach, referred to as collabora-
tive intelligence, intermediate features computed on the mobile device are offloaded to the cloud
instead of the raw input data of the network, reducing the size of the data needed to be sent to the
cloud. In this paper, we design a new collaborative intelligence friendly architecture by introduc-
ing a unit responsible for reducing the size of the feature data needed to be offloaded to the cloud
to a greater extent, where this unit is placed after a selected layer of a deep model. This unit is
referred to as the butterfly unit. The butterfly unit consists of the reduction unit and the restoration
unit. The outputs of the reduction unit is offloaded to the cloud server on which the computations
35
associated with the restoration unit and the rest of the inference network are performed. Both the
reduction and restoration units use a convolutional layer as their main component. The inference
outcomes are sent back to the mobile device. The new network architecture, including the intro-
duced butterfly unit after a selected layer of the underlying deep model, is trained end-to-end. Our
proposed method, across different wireless networks, achieves on average 53 improvements for
end-to-end latency and 68 improvements for mobile energy consumption compared to the status
quo cloud-only approach for ResNet-50, while the accuracy loss is less than 2%.
3.1 Introduction
Recent advances in deep neural networks (DNNs) have contributed to the state-of-the-art perfor-
mance in various artificial intelligence (AI)-based applications such as image classification [47,
48], object detection [49], speech recognition [50], natural language processing [51], and so forth.
Consequently, mobile and internet of things (IoT) devices are increasingly relying on theses DNNs
to improve their performance in such AI-based applications. However, the storage and compu-
tation requirements of most of the state-of-the-art deep models limit the fully deployment of the
inference network on mobile devices. Therefore, as the most common way for deployment of most
of the DNN-based applications on mobile devices, the input data of DNN is sent to cloud servers,
and the computations associated with the inference network are performed fully on the cloud side
[52].
One of the downsides of the cloud-only approach is the fact that it requires the mobile edge
devices to send considerable amounts of data, which can be images, audio, and video, over the
wireless network to the cloud. This leads to notable latency and energy overheads on the mobile
device. Furthermore, in a scenario where a large number of mobile devices send a vast amount of
simultaneous bit streams to the cloud server, the imposed computation workload of simultaneously
executing numerous deep models could become a bottleneck on the cloud server.
36
Recently, inspired by the progress in the computation power and energy efficiency of mobile
devices, there has been a body of research studies investigating the strategy of pushing a portion of
the workload from cloud servers to mobile edge devices, where both the mobile and cloud execute
the inference network collaboratively. As a result, a concept named collaborative intelligence has
been introduced [52–58]. In collaborative intelligence, the deep network is split at an intermediate
layer between the mobile and cloud. In other words, instead of sending the original raw data
from the mobile device to the cloud and executing the inference network fully on the cloud side,
the computations associated with the initial layers are performed on the mobile side. Then, the
computed feature tensor of the last assigned layer on the mobile side could be sent to the cloud
for executing the remained computation layers of the inference network. By allocating a portion
of the inference network to the mobile side, the imposed workload on the cloud reduces, where
this results in the increased throughput on the cloud. Furthermore, in some deep models which
are based on convolutional neural networks (CNNs), e.g. AlexNet [47], the feature data volume
generally shrinks as we go deeper in the model, and it might become even less than the model input
size after a number of layers [52, 53, 55]. Therefore, by computing a few layers on the mobile,
the amount of data needed to be sent to the cloud in the collaborative intelligence framework
can decrease compared to the cloud-only approach. This can lead to reduced energy and latency
overheads on the mobile side.
According to a recent study done in [55] for different hardware and wireless connectivity con-
figurations, the optimal operating point for the inference network in terms of latency and/or energy
consumption is associated with dividing the network between the mobile and cloud, and not the
common cloud-only approach, or the mobile-only approach (in case the deep model is able to be
executed fully on the mobile device). The optimal point of split depends on the computational and
data characterization of DNN layers and is usually at a point deep in the inference network. The
approach [53] has extended the work of [55] and included model training and additional network
architectures. The network is again split between the mobile and cloud, but the data can flow in
both ways in order to optimize the efficiency of both the inference and training. In summary, in the
37
research works studying the collaborative intelligence framework, a given deep network is split be-
tween the mobile device and the cloud without any modification to the network architecture itself
[52–58].
In this paper, we investigate the problem of altering a given deep network architecture before
the partitioning of it between the mobile and cloud. For this purpose, on the mobile side, we
introduce a reduction unit right before uploading the feature tensor to the cloud. The reduction
unit is stacked to the end of the computation layers assigned to the mobile side. The computation
associated with the reduction unit is also done on the mobile side. The purpose of this unit is
reducing the feature data volume needed to be sent to the cloud via the wireless network to a greater
extent, since the latency and energy overheads associated with the wireless upload link in state-of-
the-art approaches for collaborative intelligence still contribute to the major portion of the energy
consumption of the inference network on the mobile side and end-to-end latency [53]. Accordingly,
on the cloud side, we introduce a restoration unit which is stacked before the computation layers
assigned to the cloud. Both the reduction and restoration units use a convolutional layer as their
main component. The dimension of the convolution layers used in reduction and restoration units
are set in a way so that the dimension of the input tensor of the reduction unit is equal to the
dimension of the output tensor of the restoration unit. We refer to the combination of reduction
and restoration units as the butterfly unit (see Fig. 3.1). The new network architecture, including
the introduced butterfly unit after a selected layer of the underlying deep model, is trained end-to-
end, while in other works which have considered compression for reducing the feature data volume
needed to be sent to the cloud, they have added non-learnable compression techniques (e.g. JPEG)
to an already trained model [52, 57, 58].
The rest of the paper is structured as follows: Section 3.2 elaborates more on the details of
the proposed butterfly unit and the proposed DNN partitioning algorithm. Section 3.3 provides
the obtained improvements in terms of end-to-end latency and the mobile energy consumption. It
also discusses the flexibility of network partitioning point based on the load level of the cloud and
mobile, and the wireless network conditions. Finally, Section 3.4 concludes the paper.
38
Restoration
Unit
Reduction
Unit
D-d D-d
D r -d
D r << D
Figure 3.1: The butterfly unit. It takes a tensor with D channels and shrinks it into a tensor with D
r
channels using the reduction unit, where D
r
D. It outputs a tensor with the same dimension as
input using the restoration unit.
3.2 Proposed Method
3.2.1 Butterfly Unit
The butterfly unit, as shown in Fig. 3.2, consists of two components: 1) the reduction unit, and 2)
the restoration unit. The input to the reduction unit is a tensor of size (batch size, width, height,
D). A convolution filter of size (1, 1, D, D
r
) is applied to the input, producing a tensor of size
(batch size, width, height, D
r
) as the output of the reduction unit. The output tensor of the re-
duction unit is the shrunk representation of its input along the channel axis (D
r
D), and it is
the tensor which is uploaded to the cloud server. On the cloud side, in the restoration unit, by
applying a convolution filter of size (1, 1, D
r
, D), we restore the dimension of the original input
of the butterfly unit to proceed the rest of the inference. The butterfly unit is placed after one of
the layers in a DNN. The intuition behind decreasing the tensor dimension along the channel axis
in the reduction unit is the fact that typically each channel preserves the visual structure of the
input. Therefore, we can expect this non-expensive 11 convolution can keep enough information
of the feature data. In addition, depending on the architecture of the underlying deep model, the
feature tensor size varies layer by layer, typically increasing in channel sizes. Therefore, as we go
deeper in the model, more channels would be required in the output of the reduction unit, D
r
, for
maintaining the accuracy of the model.
From the perspective of the mobile device, the location of the butterfly unit is desired to be
closer to the input layer so that the mobile device computes fewer layers. However, from the
39
perspective of the cloud server, we want to push more computations towards the mobile device in
order to reduce the data center workloads. Particularly, when the cloud server and/or the wireless
network are congested, pushing computations towards the mobile device is advantageous. As a
result, there is a trade-off in choosing the location of the butterfly unit in the inference network.
3.2.2 Partitioning Algorithm
The proposed algorithm, for choosing the location of the butterfly unit and the proper value of D
r
,
comprises three main steps: 1) Training, 2) Profiling, and 3) Selection. In the training phase, we
train M models, where each model is associated with placing the butterfly unit after a different
arbitrary layer among the total N layers of the inference network (M N). For each model, via
linear search, we find and choose the minimum D
r
for the butterfly unit that reaches a pre-defined
acceptable accuracy. During the profiling phase, for each of M models, we measure the latency
values corresponding to the computation of layers assigned to the mobile side, the reduction unit,
the wireless up-link of the shrunk feature data to the cloud, the restoration unit, and the computa-
tions of layers assigned to the cloud side. Furthermore, for energy consumption, we measure the
values associated with the computation of layers assigned to the mobile side, the reduction unit,
and the wireless up-link of the shrunk feature data. These measurements vary for each of M mod-
els, and the current load level of the mobile, cloud, or wireless network conditions. In the end, in
the selection phase, depending on whether the target is minimizing the end-to-end latency or the
mobile energy consumption, we select the best partitioning among M available options.
The full procedure for choosing the location of the butterfly unit and the proper value of D
r
is
shown in Algorithm 3.
40
Batch Normalization
Relu
1x1 Conv
Reduction Unit
D-d
Batch Normalization
Relu
1x1 Conv
Restoration Unit
D
r
-d
D
r
<< D
D-d
Figure 3.2: The butterfly unit architecture. It consists of the reduction unit on the mobile side, and
the restoration unit on the cloud side.
Reduction Unit
Mobile
Shallow Model
Cloud Deep Model
Restoration Unit
Figure 3.3: The proposed method overview. A shallow model and the reduction unit on the mobile
device extracts a dense representation of the input, which is uploaded to the cloud. Then, on the
cloud, after applying the restoration function on the dense representation, the rest of the inference
procedure is followed.
7x7 conv, 64/2
3x3 max pool, /2
Image
224x224x3
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 128/2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256/2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512/2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
7x7 avg pool
100 fc, softmax
RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9 RB10 RB11 RB12 RB13 RB14 RB15 RB16
Figure 3.4: ResNet-50 architecture and its 16 residual blocks. The solid lines represent identity
shortcuts, and the dashed lines represent projection shortcuts.
41
Algorithm 2 The proposed DNN partitioning algorithm
Inputs:
N: number of layers in the DNN
M: number of partitioning points in the DNN (M N)
fP
j
j j= 1::Mg: location of each partition point
fF
i
ji= 1::Ng: feature data size at each layer
fC
i
ji= 1::Ng: output channel size of each layer
K
mobile
: current load level of mobile
K
cloud
: current load level of cloud
t
mobile
; p
mobile
( j;K
mobile
)j j= 1::M: latency and power on the mobile corresponding to partition j
and load K
m
t
cloud
( j;K
cloud
)j j= 1::M: latency on the cloud corresponding to partition j and load K
cloud
NB: wireless network bandwidth
PU: wireless network up-link power consumption
Output: Best partitioned model
// Training phase
for j= 1; j M; j= j+ 1 do
for k= 1; k C
P
j
; k= k+ 1 do
Place butterfly unit of D
r
= k after P
j
Train()
if accuracy is acceptable then
Store as jth partitioned model
break
end
end
end
// Profiling phase
for j= 1; j M; j= j+ 1 do
T M
j
= t
mobile
( j;K
mobile
)
PM
j
= p
mobile
( j;K
mobile
)
TC
j
= t
cloud
( j;K
cloud
)
TU
j
= F
P
j
=NB
end
// Selection phase
if target is min latency then
return argmin
j=1::M
(T M
j
+ TU
j
+ TC
j
)
end
if target is min energy then
return argmin
j=1::M
(T M
j
PM
j
+ TU
j
PU)
end
42
Input RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9 RB10 RB11 RB12 RB13 RB14 RB15 RB16
802816
401408
200704
100352
Data Size (B)
Figure 3.5: Input image size of the model and the size of output feature tensor of each residual
block in ResNet-50.
conv
conv
relu
relu
x
F(x)+x
(a)
F(x)+Ws x
(b)
conv
conv
relu
relu
x
W s
Figure 3.6: Residual block architecture with (a) identity and (b) projection shortcut.
43
3.3 Evaluation
3.3.1 Experimental Setup
We evaluate our proposed method on NVIDIA Jetson TX2 board[33], which is equipped with
NVIDIA Pascal™ GPU which fairly represents the computing power of mobile devices. Our server
platform is equipped with NVIDIA Geforce® GTX 1080 Ti GPU, which has almost 30x more
computing power compared to our mobile platform. The detailed specifications of our mobile and
server platforms are presented in Table 3.1 and Table 3.2, respectively. We measure the GPU power
consumption on our mobile platform using INA226 power monitoring sensor with a sampling
rate of 500 kHz[34]. For our wireless network settings, the average upload speed of different
wireless networks, 3G, 4G, and Wi-Fi, in the U.S. are used in our experiments [36, 37]. We use
the transmission power models of [38] for wireless networks, which have estimation errors less
than 6%. The power level for up-link is P
u
=a
u
t
u
+b , where t
u
is the up-link throughput, anda
u
andb are regression coefficients of power models. The values for our power model parameters are
presented in Table 3.3.
We prototype the proposed method by implementing the inference networks, both for the
mobile device and cloud server, using NVIDIA TensorRT™ [59], which is a platform for high-
performance deep learning inference. It includes a deep learning inference optimizer and run-time
that delivers low latency and high-throughput for deep learning inference applications. TensorRT
is equipped with cuDNN[18], a GPU-accelerated library of primitives for DNNs. TensorRT sup-
ports three precision modes for creating the inference graph: FP32 (single precision), FP16 (half
precision), and INT8. However, our mobile device does not support INT8 operations on its GPU.
Therefore, we use FP16 mode for creating the inference graph from the trained model graph, where
for training itself single precision mode (32-bit) is used. As demonstrated in [60], 8-bit quantiza-
tion would be enough for even challenging tasks like ImageNet [61] classification. Therefore, we
quantize FP16 data types to 8 bits only for uploading the feature tensor to the cloud. We implement
our client-server interface using Thrift [62], an open source flexible RPC interface for inter-process
44
communication. To allow for flexibility in the dynamic selection of partition points, both the mo-
bile and cloud host all possible M partitioned models. For each of M models, the mobile and cloud
store only their assigned layers. At run-time, depending on the load of the mobile and cloud, wire-
less network conditions, and the optimization goal (minimizing for latency or energy), only one of
M partitioned models is selected. Given a partition decision, execution begins on the mobile device
and cascades through the layers of the DNN leading up to that partition point. Upon completion of
that layer and the reduction unit, the mobile sends the output of the reduction unit from the mobile
device to the cloud. Cloud server then executes the computations associated with the restoration
unit and remaining DNN layers. Upon the completion of the execution of the last DNN layer on
the cloud, the final result is sent back to the mobile device.
We evaluate the proposed method on one of the promising and mostly used DNN architectures,
ResNet [48]. DNNs are hard to train because of the notorious vanishing/exploding gradient issue,
which hampers the convergence of the model. As a result, as the network goes deeper, its perfor-
mance gets saturated or even starts degrading rapidly [63]. The core idea of ResNet is introducing
a so-called “identity shortcut connection” that skips one or more layers. The output of a residual
block (RB) with identity mapping will be y= F(x;W)+ x instead of traditional y= F(x;W). The
argument behind ResNet’s good performance is that stacking layers should not degrade the net-
work performance, because we could simply stack identity mappings (layers that do nothing, i.e.,
y= x) upon the current model, and the resulting architecture would perform the same. It indicates
that the deeper model should not produce a training error higher than its shallower counterparts.
They hypothesize that letting the stacked layers fit a residual mapping is easier than letting them
directly fit the desired underlying mapping. If the dimensions change, there are two cases: 1) in-
creasing dimensions: The shortcut still performs identity mapping, with extra zero entries padded
with the increased dimension. 2) decreasing dimensions: A projection shortcut is used to match
the dimensions of x and y using the formula of y= F(x;W)+W
s
x, as shown in Fig. 3.6.
ResNet architecture comes with flexible number of layers (e.g. 34, 50, 101, etc.). In our
experiments, we use ResNet-50. There are 16 residual blocks in ResNet-50 [48]. Using Algorithm
45
3, we obtain 16 models where each model is corresponding to placing the butterfly unit after one
of 16 residual blocks. The detailed architecture, and the data size of layer outputs of ResNet-
50 are demonstrated in Fig. 3.4, and Fig. 3.5, respectively. As indicated in Fig. 3.5, the size of
intermediate feature tensors in ResNet-50 are larger than the input size up until RB14, which is
relatively deep in the model. Therefore, merely splitting this network between the mobile and
cloud for collaborative intelligence may not perform better than the cloud-only approach in terms
of latency and mobile energy consumption, since a large portion of the workload is pushed toward
the mobile.
We evaluate the proposed method on miniImageNet [64] dataset, a subset of ImageNet dataset,
which includes 100 classes and 600 examples per each class. We use 85% of whole dataset exam-
ples as the training set and the rest as the test set. We randomly crop a 224224 region out of each
sample for data augmentation and train each of the models for 90 epochs.
Table 3.1: Mobile device specifications
Component Specification
System NVIDIA Jetson TX2 Developer Kit
GPU NVIDIA Pascal™, 256 CUDA cores
CPU HMP Dual Denver + Quad ARM® A57/2 MB L2
Memory 8 GB 128 bit LPDDR4 59.7 GB/s
Table 3.2: Server platform specifications
Component Specification
GPU NVIDIA Geforce® GTX 1080 Ti, 12GB GDDR5
CPU Intel® Xeon® CPU E7- 8837 @ 2.67GHz
Memory 64 GB DDR4
Table 3.3: Wireless networks parameters
Param. 3G 4G Wi-Fi
t
u
(Mbps) 1.1 5.85 18.88
a
u
(mW/Mbps) 868.98 438.39 283.17
b (mW) 817.88 1288.04 132.86
46
1 2 3 4 5
RB1
0.74
1 2 3 4 5
RB2
0.74
1 2 3 4 5
RB3
0.74
1 2 3 4 5
RB4
0.74
1 2 3 4 5
RB5
0.74
1 2 3 4 5
RB6
0.74
1 2 3 4 5
RB7
0.74
1 2 3 4 5
RB8
0.74
1 2 3 4 5
RB9
0.74
1 2 3 4 5
RB10
0.74
1 2 3 4 5
RB11
0.74
1 2 3 4 5
RB12
0.74
1 2 3 4 5
RB13
0.74
1 2 3 4 5
RB14
0.74
1 2 3 4 5
RB15
0.74
1 2 3 4 5
RB16
0.74
Figure 3.7: Accuracy levels when choosing different D
r
values in the butterfly unit for all of
its 16 possible locations in ResNet-50 (i.e. after RB1 to RB16). In this figure, only the results
corresponding to D
r
values of 1-5 are presented. However, RB14, RB15, and RB16 require the
minimum D
r
of 10 to maintain the accuracy of the proposed method at or above 74% (less than
2% accuracy loss).
3.3.2 Latency and Energy Improvements
The accuracy of ResNet-50 model for miniImageNet dataset without the butterfly unit is 76%.
We refer to this accuracy as the target accuracy. The accuracy results of the proposed method
by placing the butterfly unit after each of the 16 residual blocks are demonstrated in Fig. 3.7.
According to Fig. 3.7, as we increase the number of channels in the reduction unit, accuracy
improves but larger feature tensors are needed to be transferred to the cloud. By assuming an
acceptable error of 2% compared to the target accuracy, placing the butterfly unit after residual
blocks 1-3, 4-7, 8-13, and 14-16, requires the D
r
of 1, 2, 5, and 10, in order to maintain the
accuracy of the proposed method at or above 74% (less than 2% accuracy loss), respectively.
Table 3.4: The End-to-End Latency, mobile energy consumption, and offloaded data size for dif-
ferent partition points in ResNet-50 using the proposed method
Layer RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9 RB10 RB11 RB12 RB13 RB14 RB15 RB16
Offloaded Data (KB) 3.1 3.1 3.1 1.6 1.6 1.6 1.6 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5
Latency 3G (ms) 23.7 24.7 25.6 15.0 15.9 16.8 17.7 14.3 15.4 16.2 17.1 17.9 18.8 16.1 17.1 17.9
Energy 3G (mJ) 21.6 22.4 23.3 13.7 14.4 15.4 16.2 13.1 13.9 14.7 15.5 16.4 17.2 14.8 15.7 16.6
Latency 4G (ms) 5.2 6.1 6.9 5.8 6.7 7.6 8.5 8.6 9.6 10.5 11.2 12.1 13.1 13.1 14.2 15.1
Energy 4G (mJ) 9.8 11.6 13.2 10.9 12.7 14.3 15.9 12.6 13.1 14.3 15.2 16.3 17.0 14.4 16.8 17.2
Latency Wi-Fi (ms) 2.4 3.3 4.1 4.3 5.2 6.1 7.0 7.7 8.6 9.4 10.7 11.1 12.2 12.9 13.8 14.7
Energy Wi-Fi (mJ) 4.8 6.8 8.7 9.1 11.2 13.1 14.9 12.1 12.7 13.9 14.8 15.5 16.3 14.1 16.1 16.6
Table 3.4 presents the latency and mobile energy consumption associated with placing the
butterfly unit with proper D
r
size (with the accuracy loss less than 2%) after each residual block, for
different wireless networks when there is no congestion in the mobile, cloud, and wireless network.
47
Table 3.5: Comparison of the proposed method with mobile-only and cloud-only approaches
Setup Latency (ms) Energy (mJ) Butterfly Unit Location Offloaded Data (B) Accuracy
Mobile-only - 15.7 20.5 - 0 76.1
Cloud-only
3G 1101 1047.4 - 150528 76.1
4G 208.4 528.3 - 150528 76.1
Wi-Fi 98.1 342.1 - 150528 76.1
Collaborative
3G 14.3 13.1 After RB8 980 74.0
4G 5.2 9.8 After RB1 3136 74.1
Wi-Fi 2.4 4.8 After RB1 3136 74.1
Table 3.5 shows the selected partition points by our algorithm for the goal of minimum end-to-end
latency and mobile energy consumption, while the acceptable 2% accuracy loss is reached, across
three different wireless configurations (3G, 4G, and Wi-Fi) and when there is no congestion on the
mobile, cloud, and wireless network (These selected partitions are also highlighted in Table 3.4).
Note that the best partitioning for the goal of minimum end-to-end latency is the same as the
best partitioning for the goal of minimum mobile energy consumption in each wireless network
settings. This is mainly due to the fact that end-to-end latency and mobile energy consumption
are proportional to each other since the dominant portion of both of them are associated with the
wireless transfer overheads of the intermediate feature tensor. Obtained results for cloud-only and
mobile-only approaches are also provided in Table 3.5.
Latency Improvement - As demonstrated in Table 3.5, using our proposed method, the end-
to-end latency achieves 77, 40, 41 improvements over the cloud-only approach in 3G, 4G,
and Wi-Fi networks, respectively.
Energy Improvement - As demonstrated in Table 3.5, using our proposed method, the mobile
energy consumption achieves 80, 54, and 71 improvements over the cloud-only approach in
3G, 4G, and Wi-Fi networks, respectively.
In the case of 4G and Wi-Fi, the mobile device is only required to compute DNN layers upon
the completion of the RB1 and the reduction unit. In the case of 3G, the mobile device should
compute DNN layers upon the completion of the RB8 and the reduction unit.
48
3.3.3 Server Load Variations
Data centers typically experience fluctuating load patterns. High server utilization leads to in-
creased service times for DNN queries. Using our proposed method, by training multiple DNNs
split on different layers and storing corresponding partitions in the mobile and cloud, the best
model can be selected at run-time by the mobile, based on the current server load level, by peri-
odically pinging the server during the mobile idle period. This leads to avoiding long latencies of
DNN queries caused by high user demands. If the server is congested, we can move the partition
point into deeper layers which this pushes more of the workload towards the mobile device. As
a result, the computation load of the mobile device will increase. In summary, depending on the
server load, the partition point can be changed while preserving the accuracy and still offloading
less data than raw input.
Consequently, this new compute paradigm not only reduces the end-to-end latency and mobile
energy consumption but also reduces the workload required on the data center, leading to the
shorter query service time and higher query throughput.
3.3.4 Comparison to Other Feature Compression Techniques
In the collaborative intelligence works which have considered the compression of intermediate
features before uploading them to the cloud, the obtained compression ratios are significantly less
compared to our work. For instance, as reported in [52], the maximum achieved compression ratio
is reported as 3.3. However, with the proposed trainable butterfly unit, we achieve up to 256
compression ratio when the butterfly unit is placed after RB1 (in which the channel size is reduced
from 256 to 1). This shows that in collaborative intelligence framework, the compression using the
proposed learnable butterfly unit can significantly perform better than traditional compressors.
49
3.4 Conclusion and Future Work
As the core component of today’s intelligent services, DNNs have been traditionally executed in
the cloud. Recent studies have shown that the latency and energy consumption of DNNs in mobile
applications can be considerably reduced using collaborative intelligence framework, where the
inference network is divided between the mobile and cloud and intermediate features computed
on the mobile device are offloaded to the cloud instead of the raw input data of the network,
reducing the size of the data needed to be sent to the cloud. With these insights, in this work,
we develop a new partitioning scheme that creates a bottleneck in a neural network using the
proposed butterfly unit, which alleviates the communication costs of feature transfer between the
mobile and cloud to a greater extent. It can adapt to any DNN architecture, hardware platform,
wireless connections, and mobile and server load levels, and selects the best partition point for the
minimum end-to-end latency and/or mobile energy consumption at run-time. The new network
architecture, including the introduced butterfly unit after a selected layer of the underlying deep
model, is trained end-to-end. Our proposed method, across different wireless networks, achieves
on average 53 improvements for end-to-end latency and 68 improvements for mobile energy
consumption compared to the status quo cloud-only approach for ResNet-50, while the accuracy
loss is less than 2%.
As a future work, the extent of reduction in the feature data size which is transferred between
the mobile and cloud can be explored. Furthermore, the efficacy of the proposed method can
be investigated for different DNN architectures and mobile/server load variations. Additionally,
collaborative intelligence frameworks by considering the advent of revolutionary 5G technology
can be studied.
50
Chapter 4
BottleNet: A Deep Learning Architecture for Intelligent Mobile
Cloud Computing Services
Recent studies have shown the latency and energy consumption of deep neural networks can be
significantly improved by splitting the network between the mobile device and cloud. This paper
introduces a new deep learning architecture, called BottleNet, for reducing the feature size needed
to be sent to the cloud. Furthermore, we propose a training method for compensating for the poten-
tial accuracy loss due to the lossy compression of features before transmitting them to the cloud.
BottleNet achieves on average 5.1 improvement in end-to-end latency and 6.9 improvement in
mobile energy consumption compared to the cloud-only approach with no accuracy loss.
4.1 Introduction
Mobile and Internet of Things (IoT) devices are increasingly relying on deep neural networks
(DNNs), as these networks provide state-of-the-art performance in various intelligent applications
[47–51]. Due to limited computational and storage resources of mobile devices, which prohibits
full deployment of advanced deep models on these devices (the mobile-only method), the most
common deployment approach of most of the DNN-based applications on mobile devices relies
on using the cloud. In this approach, which is referred to as the cloud-only approach, the deep
network is fully placed on the cloud, and thus the input is sent from the mobile to the cloud for
51
performing the computations associated with the inference network, and the output is sent back to
the mobile device.
The cloud-only approach requires mobile devices to send vast amounts of data (e.g. images,
audios and videos) over the wireless network to the cloud. This can give rise to considerable energy
and latency overheads on the mobile device. Furthermore, pushing all computations toward the
cloud can lead to congestion in a scenario where many mobile devices simultaneously send data
to the cloud. As a compromise between the mobile-only and the cloud-only approach, recently,
a body of research work has been investigating the idea of splitting a deep inference network
between the mobile and cloud [52–58]. In this approach, which is referred to as collaborative
intelligence, the computations associated with initial layers of the inference network are performed
on the mobile device, and the feature tensor (activations) of the last computed layer is sent to the
cloud for the remainder of computations. The main motivation for collaborative intelligence is the
fact that in many applications which are based on deep convolutional neural networks (CNNs), the
feature volume of layers will shrink in size as we go deeper in the model [52, 53, 55]. Therefore,
computing a few layers on the mobile and then sending the last computed feature tensor to the
cloud can reduce the latency and energy overheads of wireless transfer of the data to the cloud
compared to sending the input of the model directly to the cloud. Furthermore, pushing a portion
of computations onto the mobile devices can reduce the congestion on the cloud and hence increase
its throughput.
Based on a recent study done for various hardware and wireless connectivity configurations
[55], the optimal operating point for the inference network in terms of latency and/or mobile energy
consumption is associated with dividing the network between the mobile and cloud, and not the
common cloud-only approach, or the mobile-only approach (in case the deep model could be
deployed fully on the mobile device). The optimal point of split depends on the computational and
data characterization of DNN layers and is usually at a point deep in the inference network.
In research studies investigating collaborative intelligence, a given deep network is split be-
tween the mobile device and the cloud without any modification to the network architecture itself
52
Reduction Unit
Mobile
Shallow Model
Cloud Deep Model
Restoration Unit
Decompressor
Compressor
Figure 4.1: Overview of the proposed method.
[52, 53, 55–58]. In this paper, we investigate altering the underlying deep model architecture to
make it collaborative intelligence friendly. For this purpose, we mainly focus on altering the un-
derlying deep model in a way that the feature data size needed to be transmitted to the cloud is
reduced. This is because in the studies investigating collaborative intelligence, the latency and
energy overheads of the wireless data transfer to the cloud yet play a major role in the total mobile
energy consumption and the end-to-end latency [53]. Therefore, reducing the transmitted data size
to cloud is generally beneficial. For this purpose, we add a non-expensive learnable reduction unit
after the layers assigned to be computed on the mobile device, and the output of this unit is then
compressed using conventional compressors (e.g., JPEG) and sent to the cloud. Correspondingly, a
decompressor and a learnable restoration unit is added before the layers assigned on the cloud. The
main components of reduction and restoration units are convolutional layers which their dimen-
sions are determined in a way that the input of the reduction unit and the output of the restoration
unit have the same dimensionality.
An overview of the proposed method is shown in Fig. 4.1. The insertion location and size
of the the reduction and restoration units in the underlying DNN are determined as explained in
Section 4.2.3. Since by inserting the reduction unit, a data bottleneck is created in the model,
the combination of the learnable reduction unit, compressor and decompressor, and the learnable
restoration unit is referred to as the bottleneck unit, and the new network architecture including
the bottleneck unit is referred to as BottleNet, which is trained end-to-end. For the reduction unit,
53
we evaluate and compare dimension reductions along both the channel and spatial dimensions of
intermediate feature tensors as explained in Section 4.2.1.
As we see in Section 4.3, an obvious benefit of using the proposed bottleneck unit is in deep
models where feature tensor sizes are relatively high, such as ResNet [48] and VGG [65]. In
such networks, the layer in which the feature size is less than the input size is either not present
or lies very deep inside in the network. Therefore, if we want to merely split the network and
send the intermediate feature tensor to the cloud as in [53, 55], we need to compute considerable
number of layers on the mobile. This will push a major workload to the mobile which will likely
result in higher latency and energy consumption compared to the cloud-only approach. This could
be the main reason that previous work on collaborative intelligence usually has focused on deep
architectures where the intermediate feature size is relatively small compared to the input size after
computing only a few layers, such as AlexNet [47] and DeepFace [66].
Furthermore, since features of an intermediate layer in a deep model tend to exhibit statistical
characteristics such as data redundancy and lower entropy compared to the input of the model,
compressing the feature tensor before sending it to the cloud can potentially achieve considerable
reductions in the data size needed to be sent over the wireless network. Therefore a major portion
of the works studying collaborative intelligence also consider feature compression instead of direct
transfer of the feature tensor to the cloud [52, 53, 57, 58]. In this work, we consider lossy compres-
sion of the feature tensor. Lossy compression methods can lead to higher bit savings compared to
the lossless compression approaches. However, they may adversely affect the achieved accuracy
and thus lossy compression methods are less studied in the works that use feature compression in
collaborative intelligence framework, as they mostly use lossless compression techniques on the
features before transmitting them to the cloud. In order to compensate for the reduced accuracy due
to the lossy compressor, we propose a novel training method for the network which approximates
the behavior of the lossy compressor as an identity function in backpropagation. The proposed
training method is explained in detail in Section 4.2.2.
In summary, the contributions of this paper are as follows:
54
• We propose the bottleneck unit, in which by using a learnable reduction unit followed by a
lossy compressor, the feature tensor size required to be transmitted to the cloud is signifi-
cantly reduced.
• For training our model, we propose a lossy compression-aware training method in order to
compensate for the accuracy loss.
• Using the proposed BottleNet architecture, compared to the cloud-only approach, we achieve
on average 5.1 improvement in end-to-end latency and 6.9 improvement in mobile en-
ergy consumption with no accuracy loss.
The remainder of the paper is structured as follows: Section 4.2 provides a more detailed
explanation of the bottleneck unit and training method. Section 4.3 provides the energy and latency
improvements of the proposed bottleneck unit and discusses the efficacy of our training approach,
and also elaborates on the flexibility of changing the partition point depending on the cloud server
congestion and wireless network conditions. Finally, Section 4.4 concludes the paper.
4.2 Proposed Method
In this section, first, we describe details of the bottleneck unit. Then, we explain our proposed
training method when a non-differentiable lossy compression is applied to the intermediate feature
tensor before transmitting it to the cloud. Finally, we explain our approach for finding the best
insertion location and size of the bottleneck unit in the underlying deep model to achieve the
lowest end-to-end latency and/or mobile energy consumption in different wireless settings.
4.2.1 Bottleneck Unit
For dimension reduction in the feature tensor in the bottleneck unit, we evaluate and compare
dimension reductions along either channel or spatial dimensions. The bottleneck unit, referred
to as autoencoder in deep learning context, is responsible for learning a dense representation of
55
the features in an intermediate layer. As depicted in Fig. 4.2, channel-wise reduction, shrinks the
number of channels of the features, and spatial reduction shrinks the spatial dimensions (width
and height) of the features. More specifically, channel-wise reduction unit takes a tensor of size
(batch size, w, h, c) as input, and outputs a tensor of size (batch size, w, h, c
0
) by applying a non-
expensive convolution filter of size (1, 1, c, c
0
) followed by normalization and non-linearity layers.
The output tensor of the channel-wise reduction unit is the reduced-order representation of its input
in terms of channel size (c
0
c). Spatial reduction unit takes a tensor of size (batch size, w, h, c)
as input, and outputs a tensor of size (batch size, w
0
, h
0
, c) by applying a convolution filter of size
(w
f
, h
f
, c, c) followed by normalization and non-linearity layers. The output tensor of the Spatial
reduction unit is the reduced-order representation of its input in terms of width and height (w
0
< w,
and h
0
< h). In both channel-wise and spatial reduction units, we use ReLU as the non-linearity
function. For reduction in the spatial dimension, the stride step of the convolution should be more
than one. It should be noted that to cover each neuron during the convolution at least once, the size
of this filter should be more than the stride step size, i.e., w
f
>
w
w
0
, and h
f
>
h
h
0
. In this paper, we
use the same reduction factor for both width and height, referred to as the spatial reduction factor
of s, i.e.,
w
w
0
=
h
h
0
= s.
The bottleneck unit architecture uses both spatial and channel-wise reduction units followed by
a compressor unit on the mobile device to create a compressed representation of the feature tensor
which is the tensor transmitted to the cloud. On the cloud, the bottleneck unit uses a decompressor
followed by channel-wise and spatial restoration units to restore the dimension of the feature tensor.
The detailed architecture of the bottleneck unit is depicted in Fig. 4.3. The choice of ReLU in
reduction units can potentially lead to higher zero rates resulting in higher compression ratios. The
bottleneck unit is inserted between two selected consecutive layers of the underlying deep model,
where these two layers are selected by an algorithm as explained in Section 4.2.3.
56
Features
w
h
c
c'
h
w
Features
w
h
c
spatia
w'
h'
c
Features
w
h
c
Features
w
h
c
(a)
(b)
1x1 Conv
BN
Relu
1x1 Conv
BN
Relu
w
f
x h
f
Conv
BN
Relu
w
f
x h
f
Deconv
BN
Relu
Channel-wise
Reduction Unit
Channel-wise
Restoration Unit
Spatial
Reduction Unit
Spatial
Restoration Unit
Figure 4.2: Learnable dimension reduction and restoration units along the (a) channel and (b)
spatial dimension of features.
Features
w
h
c
Features
w
h
c
Spatial
Reduction
Compressed
Features
Channel-wise
Reduction
JPEG
Compressor
JPEG
Decompressor
Channel-wise
Restoration
Spatial
Restoration
On the mobile On the cloud
Bottleneck Unit
Figure 4.3: The bottleneck unit embedded with a non-differentiable lossy compression (e.g.,
JPEG).
57
4.2.2 Non-differentiable Lossy Compression Aware Training
Lossy compression is inherently a non-differentiable function due to the presence of quantization
as an integral part of the compression. Introducing non-differentiable functions in a neural network
disables the back-propagation because the gradients are not propagated to the layers before the
non-differentiable function, resulting in the model not end-to-end trainable. To solve this issue, we
introduce a new training method to enable the model to be end-to-end differentiable by defining a
gradient for the pair of compressor and decompressor during the backpropagation. In other words,
as depicted in Fig. 4.4, the pair of lossy compressor and decompressor is used as is during the
forward propagation, while we treat this pair as an identity function during backpropagation (i.e.,
gradients passed without any change to the layers before the compressor). Therefore, the whole
model will become end-to-end differentiable.
The intuition behind approximating the pair of compressor and decompressor with identity
function during backpropagation is the fact that the input to the compressor is equal to the output
of the decompressor in a lossless scenario. Because we are incorporating a lossy compression
method, assuming this equality during backpropagation serves as an approximation. However, by
still considering lossy compression during the forward propagation, the introduced error is given a
chance to be compensated for by the next layers. The effectiveness of using this training method
instead of simply training of the model without considering the compression will be explained in
Section 4.3.3.
The input to the compressors are typically quantized to unsigned n-bit numbers by an uniform
quantizer. Similar to [58], to quantize features, F, we use the following quantizer:
˜
F = round(
F min(F)
max(F) min(F)
(2
n
1)) (4.1)
In addition, the input to the compressors should be reshaped into 2-D tensors. The features, F,
with C channels, are rearranged in a tiled image with the width of 2
ceil(
1
2
log
2
(C))
and the height of
58
Features
w
h
c
Features
w
h
c
JPEG
Compressor
Compressed
Features
JPEG
Decompressor
Identity Function
Forward Propagation
Backpropagation
Figure 4.4: Embedding non-differentiable compression (e.g., JPEG) in DNN architecture. We
approximate the pair of JPEG compressor and decompressor units by identity function to make the
model differentiable in backpropagation.
2
f loor(
1
2
log
2
(C))
to keep the aspect ratio as square as possible to achieve the maximum compression
ratio.
4.2.3 Architecture Selection
The proposed algorithm, for choosing the location of the bottleneck unit and the proper value of
c
0
(reflecting the degree of reduction along the channel dimension), and s (reflecting the degree
of reduction along the spatial dimension) comprises of three main steps: 1) Training, 2) Profiling,
and 3) Selection. We consider placing the bottleneck unit each time after an arbitrary selected
layer of the underlying network, for total of M different locations in the network, where M is
less than or equal to total N layers of the network. In each of M locations, we train different
architectures associated with different degrees of dimension reduction along channel or spatial di-
mensions, and among those which result in acceptable accuracy levels, we select the one with the
minimum bit requirement. We repeat this process for all of M locations. At the end, among M
selected models associated with M different partitioning solutions of the network, depending on
our optimization target, we choose the best partitioning in terms of minimizing mobile energy con-
sumption and/or end-to-end latency. We measure the latency and mobile energy consumption of
computations assigned to the mobile (including reduction and compressor units), wireless transfer
59
of dense compressed feature tensor to the cloud, and computations assigned to the cloud (includ-
ing decompressor and restoration units). The detailed algorithm for choosing the location of the
bottleneck unit and the proper amount of reductions in channel and spatial dimensions is presented
in Algorithm 3.
4.3 Evaluation
4.3.1 Experimental Setup
For evaluating our proposed method, we use NVIDIA Jetson TX2 board [33] equipped with
NVIDIA Pascal™ GPU with 256 CUDA cores as our mobile platform, while our server platform
is equipped with a NVIDIA Geforce® GTX 1080 Ti GPU, which has almost 30x more comput-
ing power compared to our mobile platform. We measure the GPU power consumption on our
mobile platform using INA226 power monitoring sensor with the sampling rate of 500 KHz [34].
For the wireless network settings, the average upload speed of different wireless networks, 3G,
4G, and Wi-Fi, in the U.S. are used in our experiments [36, 37]. We use the transmission power
models of [38] for wireless networks with estimation error rate of less than 6%. The power level
for up-link is estimated by P
u
=a
u
t
u
+b , where t
u
is the up-link throughput, and a
u
and b are
regression coefficients of power models. The values for our power model parameters are presented
in Table 4.1.
Table 4.1: Wireless networks parameters
Param. 3G 4G Wi-Fi
t
u
(Mbps) 1.1 5.85 18.88
a
u
(mW/Mbps) 868.98 438.39 283.17
b (mW) 817.88 1288.04 132.86
We prototype the proposed method by implementing the inference networks for both the mobile
device and cloud server using NVIDIA TensorRT™ [59], which is the state-of-the-art platform for
high-performance deep learning inference. It includes a deep learning inference optimizer and
60
Algorithm 3 The partitioning algorithm for BottleNet
Inputs:
N: number of layers in the DNN
M: number of partitioning points in the DNN (M N)
bottleneck(s;c
0
): A bottleneck unit with the spatial reduction factor of s and reduced channel size
of c
0
S
max
: maximum allowable spatial reduction factor
C
0
max
: maximum allowable reduced channel size
K
mobile
: current load level of mobile
K
cloud
: current load level of cloud
t
mobile
; p
mobile
( j;K
mobile
)j j= 1::M: latency and power on the mobile corresponding to partition j
and load K
mobile
t
cloud
( j;K
cloud
)j j= 1::M: latency on the cloud corresponding to partition j and load K
cloud
NB: wireless network bandwidth
PU: wireless network up-link power consumption
Outputs:
Best partitioned model
Variables:
fD
j
j j= 1::Mg: compressed feature size in each of M partitioning solutions
// Training phase
for j= 1; j M; j= j+ 1 do
for c
0
= 1; c
0
C
0
max
; c
0
= c
0
+ 1 do
for s= 1; s S
max
; s= s+ 1 do
Place bottleneck(s;c
0
) after j-th layer
Train()
Store the corresponding model and its accuracy
end
end
end
for j= 1; j M; j= j+ 1 do
For those models that the bottleneck unit is placed after j-th layer, among ones with acceptable
accuracy, store the one with minimum compressed feature size and store its compressed
feature size as D
j
end
// Profiling phase
for j= 1; j M; j= j+ 1 do
T M
j
= t
mobile
( j;K
mobile
)
PM
j
= p
mobile
( j;K
mobile
)
TC
j
= t
cloud
( j;K
cloud
)
TU
j
= D
j
=NB // time to upload data to the cloud
end
// Selection phase
if target is min latency then
return argmin
j=1::M
(T M
j
+ TU
j
+ TC
j
)
end
if target is min energy then
return argmin
j=1::M
(T M
j
PM
j
+ TU
j
PU)
end
61
run-time that delivers low latency and high-throughput for deep learning inference applications.
TensorRT is equipped with cuDNN[18], a GPU-accelerated library of primitives for deep neural
networks. TensorRT supports three precision modes for creating the inference graph, namely FP32
(single precision), FP16 (half precision), and INT8 (8-bit integer). However, our mobile device
does not support INT8 operations on its GPU for inference. Therefore, we use FP16 mode for
creating the inference graph from the trained model graph, where for the training itself single
precision mode is used. As demonstrated in [60], 8-bit quantization would be enough for even
challenging tasks like ImageNet [61] classification. Therefore, we apply 8-bit quantization on
FP16 data types, using the uniform quantizer presented in Section 4.2.2, before applying the lossy
compression on them. Given a partition decision, execution begins on the mobile device and
cascades through the layers of the DNN leading up to that partition point. Upon completion of that
layer and the reduction unit and lossy compressor, mobile sends the reduced dense feature tensor
from the mobile device to the cloud. Cloud server then executes the computations associated with
the decompressor, restoration unit, and the remaining DNN layers. Upon the completion of the the
execution of last DNN layer on the cloud, the inference result is sent back to the mobile device.
For the choice of our lossy compressor, here, we use JPEG compression.
For evaluating our proposed method, we use ResNet-50 and VGG-19 as our underlying deep
models, which are two of the widely used networks. ResNet architecture comes with flexible
number of layers such as 34, 50, 101. There are 16 residual blocks (RBs) in ResNet-50. We also
use the 19-layer version of VGG, VGG-19, in which i-th convolutional layer is referred to as Ci in
this paper. In our experiments, we place the bottleneck unit after each residual block in ResNet-50
and after each convolutional layer in VGG-19. Using algorithm 3 explained in Section 4.2.3, we
obtain different models where each model is associated with placing the bottleneck unit after a
selected layer for both ResNet-50 and VGG-19. As presented in algorithm 3, the obtained c
0
and s
of the bottleneck unit, when it is placed after each of the selected layers could be different, where
c
0
and s were associated with different degrees of reductions in channel and spatial dimensions,
respectively.
62
The input image size of the models and the size of output feature tensor of each layer are
presented in Fig. 4.5. As indicated in Fig. 4.5, the size of intermediate feature tensors in ResNet-
50 are larger than the input size up until RB14, which is relatively deep in the model. Therefore,
merely splitting this network between the mobile and cloud for collaborative intelligence may not
perform better than the cloud-only approach in terms of latency and mobile energy consumption,
since a large portion of the workload is pushed toward the mobile. This is also the same for VGG-
19, where the layer in which the feature size is less than input size is 12nd convolutional layer
which is quite deep in the model.
We use miniImageNet [64], a subset of ImageNet, as the dataset which includes 100 classes and
600 examples per each class. 85% of whole dataset examples are used as the training set, and the
rest as the test set. We randomly crop a 224224 region from each sample for data augmentation.
We train the models for 90 epochs.
4.3.2 Latency and Energy Improvements
Table 4.2: Comparison of mobile-only and cloud-only approaches with the proposed method (Bot-
tleNet)
Setup Latency (ms) Energy (mJ) Bottleneck Location Offloaded Data (B)
Approach Network ResNet-50 VGG-19 ResNet-50 VGG-19 ResNet-50 VGG-19 ResNet-50 VGG-19
Mobile-only - 15.7 45.3 20.5 59.6 - - 0 0
Cloud-only
3G 196.2 198.7 310.1 311.9 - - 26766 26766
4G 37.9 39.6 168.3 169.7 - - 26766 26766
Wi-Fi 13.1 14.9 110.7 112.1 - - 26766 26766
BottleNet
3G 15.5 23 33.0 44.5 RB1 C2 1580 1720
4G 9.0 15.5 20.5 35.5 RB1 C2 1580 1720
Wi-Fi 8.0 14.5 17.5 20.5 RB1 C2 1580 1720
The accuracy of ResNet-50 and VGG-19 models for miniImageNet dataset without the bot-
tleneck unit are 76.1% and 68.3%, respectively, which serve as our acceptable accuracy values in
Algorithm 3. Table 4.2 compares the selected partitions with the mobile-only and cloud-only ap-
proaches in terms of latency and energy. For the cloud-only approach, before transmitting the input
63
input
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
(a)
3211
803
1606
151
Feature size (kB)
input
RB1
RB2
RB3
RB4
RB5
RB6
RB7
RB8
RB9
RB10
RB11
RB12
RB13
RB14
RB15
RB16
(b)
803
401
151
Feature size (kB)
Figure 4.5: Input image size and the size of output feature tensor of the convolutional layers in (a)
VGG-19 and (b) ResNet-50 models.
64
to the cloud, we apply JPEG compression on the input images which are stored in 8-bit RGB for-
mat. Note that the best partitioning for the goal of minimum end-to-end latency is the same as the
best partitioning for the goal of minimum mobile energy consumption in each wireless network
settings. This is mainly due to the fact that end-to-end latency and mobile energy consumption
are proportional to each other since the dominant portion of both of them are associated with the
wireless transfer overheads of the intermediate feature tensor. Our proposed method can achieve
significant dimension reductions and thus bit savings. For instance, when the bottleneck unit is
placed after RB1 in ResNet-50, feature tensor of size (56, 56, 256) is reduced to (28, 28, 5) using
the channel-wise and spatial reduction units.
Latency Improvement - As demonstrated in Table 4.2, using our proposed method, the end-
to-end latency achieves on average 10.6, 3.4, 1.3 improvements over the cloud-only approach
in 3G, 4G, and Wi-Fi networks, respectively.
Energy Improvement - As demonstrated in Table 4.2, using our proposed method, the mobile
energy consumption achieves on average 8.2, 6.5, and 5.9 improvements over the cloud-only
approach in 3G, 4G, and Wi-Fi networks, respectively.
As observed in Table 4.2, the best partition across all wireless network settings is associated
with placing the bottleneck unit after RB1 and C2 for ResNet-50 and VGG-19, respectively. This is
an important result since as explained before and according to Fig. 4.5, due to the relatively large
sizes of intermediate feature tensors compared to the input image size for these models, merely
splitting the network between the mobile and cloud and transmitting the intermediate feature tensor
to the cloud may not perform better than the cloud-only approach in terms of latency and mobile
energy consumption. However, using our proposed method, mobile device only computes the
first few layers and send the reduced dense output feature tensor to the cloud, achieving minimum
latency and mobile energy consumption among all possible partitions and significant improvements
compared to the cloud-only approach, while there is no accuracy loss.
It is worthwhile to mention that the evaluations are done in a no-congestion setting (K
mobile
and K
cloud
in Algorithm 3 are zero). In this scenario, Algorithm 3 is expected to select a layer for
65
100 200 300 400 500 600 700
Compressed feature size of RB1 (B)
0
20
40
60
80
Accuracy loss (%)
Compression-aware Trained
Naively Trained
quality=20
Figure 4.6: The comparison between the accuracy loss of the proposed compression-aware training
and a naively trained model for different compressed feature size values, when the bottleneck unit
is placed after RB1 in ResNet-50.
the best insertion location the bottleneck unit which is corresponding to an early split in the deep
network (RB1 in ResNet-50 and C2 in VGG-19 are selected using Algorithm 3 in no-congestion
settings). However, in a scenario where there is a high server utilization on the cloud server,
as Algorithm 3 takes the effect of K
cloud
into consideration during its profiling phase, the best
insertion location of the bottleneck unit given by Algorithm 3 is expected to be deeper in the
network (bigger portion of computations are pushed toward the mobile device). The effect of the
server load is discussed in our prior work [67].
4.3.3 The Efficacy of Compression-aware Training
Incorporating the pair of JPEG compressor and decompressor as a new computational unit in a
neural network can be performed in two ways: 1) Placing the compression unit in a given trained
model (Naive), 2) Training the model from scratch using the proposed compression-aware training
66
method as explained in Section 4.2.2. In JPEG compressor, which is the choice of lossy compressor
for our experiments, changing a parameter named quality affects the amount of output bits of the
compressor. Fig. 4.6 presents the accuracy loss obtained for ResNet-50 when the bottleneck unit
is placed after RB1, versus different number of output bits of the compressor (corresponding to
different values of the JPEG quality parameter, ranging from 1 to 100). As depicted in Fig. 4.6, the
accuracy loss of the compression-aware training becomes almost zero by having a JPEG quality
level of higher than 20, while the accuracy loss of the naive approach is close to 18% using the
quality level of 20. In our experiments, we use the quality level of 20 for JPEG compression in
order to achieve maximum bit savings while there is no accuracy loss.
4.3.4 Comparison to Other Feature Compression Techniques
In comparison with other works in collaborative intelligence framework which have considered the
compression of intermediate features before uploading them to the cloud, our proposed method can
achieve significantly higher bit savings compared to the cloud-only approach. For instance, as re-
ported in [52], which is one of the few works in collaborative intelligence literature which consider
lossy compression of features before transmitting them to cloud, communication overhead in their
work can be reduced up to 70% compared to the cloud-only approach. However, in our work, with
the proposed trainable reduction unit for spatial and channel dimensions, and the proposed lossy
compression-aware training method, as presented in Table 4.2, we can achieve on average 16.3
bit savings compared to the cloud-only approach with no accuracy loss. This shows that in col-
laborative intelligence framework, adding a learnable reduction in channel and spatial dimension
alongside with a compression-aware training method can significantly perform better than merely
splitting a network with fixed weights and compressing the intermediate feature tensor before up-
loading to the cloud.
67
4.4 Conclusion
Recent studies have shown that the latency and energy consumption of deep neural networks in
mobile applications can be reduced by splitting the network between the mobile and cloud in a
collaborative intelligence framework. In this work, we develop a new partitioning scheme that
creates a bottleneck in a neural network using the proposed bottleneck unit, which considerably
reduces the communication costs of feature transfer between the mobile and cloud. Our proposed
method can adapt to any DNN architecture, hardware platform, wireless network settings, and mo-
bile and server load levels, and selects the best partition point for the minimum end-to-end latency
and/or mobile energy consumption at run-time. The new network architecture is trained end-to-end
using our proposed compression-aware training method which allows significant bit savings. Our
proposed method, across different wireless network settings, achieves on average 5.1 improve-
ments for end-to-end latency and 6.9 improvements for mobile energy consumption compared to
the cloud-only approach, while there is no accuracy loss.
68
Chapter 5
Runtime Deep Model Multiplexing for Reduced Latency and
Energy Consumption Inference
We propose a learning algorithm to design a light-weight neural multiplexer that given the input
and computational resource requirements, calls the model that will consume the minimum compute
resources for a successful inference. Mobile devices can use the proposed algorithm to offload the
hard inputs to the cloud while inferring the easy ones locally. Besides, in the large scale cloud-
based intelligent applications, instead of replicating the most-accurate model, a range of small and
large models can be multiplexed from depending on the input’s complexity which will save the
cloud’s computational resources. The input complexity or hardness is determined by the number
of models that can predict the correct label. For example, if no model can predict the label correctly,
then the input is considered as the hardest. The proposed algorithm allows the mobile device to
detect the inputs that can be processed locally and the ones that require a larger model and should
be sent a cloud server. Therefore, the mobile user benefits from not only the local processing
but also from an accurate model hosted on a cloud server. Our experimental results show that
the proposed algorithm improves mobile’s model accuracy by 8.52% which is because of those
inputs that are properly selected and offloaded to the cloud server. In addition, it saves the cloud
providers’ compute resources by a factor of 2.85 as small models are chosen for easier inputs.
69
alexnet
mobilenet_v2
mnasnet1_0
resnet50
resnet152
resnext101_32x8d
Can not
alexnet
mobilenet_v2
mnasnet1_0
resnet50
resnet152
resnext101_32x8d
Can
0 3.9 3.5 3 2.8 2.8
19 0 5.3 4.4 3.7 3.8
20 6.9 0 4.9 4.1 4.2
23 8.7 7.6 0 4.1 4.1
25 10 9 6.3 0 4.6
26 11 10 7.3 5.6 0
Can vs Can not (%)
0
5
10
15
20
25
Figure 5.1: The percentage of ImageNet’s[78] validation set images that can be predicted correctly
by a certain model but can not be correctly predicted by another model. As an example, alexnet,
as our worst performing model, can correctly predict 2.8% of the inputs that the largest model,
resnext101 32x8d, cannot.
5.1 Introduction
Deep learning is the rocket fuel of the recent advances in artificial intelligence and gaining popu-
larity in intelligent mobile applications, solving complex problems like object recognition[48, 68],
facial recognition[69, 70], speech processing[71], and machine translation[72]. Although many of
these tasks are important on mobile and embedded devices, especially for sensing and mission-
critical applications such as health care and video surveillance, existing deep learning solutions
often require powerful computational resources to run on. Running these models on mobile de-
vices can lead to long run-times and the consumption of abundant amounts of resources, including
CPU, memory, and power, even for simple tasks[73–75]. Besides the enhancements achieved in
optimizing the computation graph, efficient storage access such as Computational Storage Devices
has shown promising results in further acceleration of deep learning models by reducing the data
movements from storage device [76, 77].
The training process of deep neural networks (DNNs) is often offloaded to the cloud as it
requires a huge amount of computations on large data. Once the model is trained, it will be used
for inference on new unseen inputs. The inference process can be hosted privately on the local
70
devices or as a public service in the cloud which we call mobile-only and cloud-only inference,
respectively. In the cloud-only inference, the cloud providers grant access to the pre-trained models
using an Application Programming Interface (API), which receives the input from the user and
returns the inference results (predictions). The cloud-only inference is easy to deploy and scale up
but compromises the data privacy and needs a reliable network connection. The communication
cost of cloud-based inference can be also larger than the computation cost of running a small
model locally. On the other hand, the mobile-only inference enables the mobile application to
function without network access but is limited to small models due to the lack of enough computing
resources.
Recent promising advances in mobile-friendly deep architectures, such as mobilenet v2[79], is
closing the accuracy gap between the mobile and cloud level inference. For instance, the accuracy
of mobilenet v2 as a mobile-scale model and resnext101 32x8d as a cloud-scale model are 73% and
79%, respectively. This essentially means that the mobile level model can predict 73% of the inputs
locally while the cloud level model can be called for the rest. As a result, a model multiplexer can
be designed to call either the local model or the cloud model. However, the cost of this multiplexer
should be kept small. We provide the definition of input complexity or easiness/hardness that we
use throughout this paper:
• Given a pair of small (mobile-side) and large (cloud-side) models, an input is easy if its label
can be predicted correctly by the small model. An input is hard if the prediction is performed
correctly by the large model.
• Given an ensemble of N models, the complexity of an input lies in a range between 0 and
N representing the number of models that correctly predict the input’s label. In the extreme
case, the input complexity is 0, if all models can predict correctly. On the other hand, the
input complexity is N if no model can make a correct prediction on it.
In the cloud inference services, the best-performing model is replicated across the servers and
an API routes the users’ input to one of the hosting servers. However, as we discussed ear-
lier, a large portion of the inputs can be predicted correctly by worse-performing models with
71
mobile sends inputs
cloud returns predictions
mobile sends the hard inputs
Model
Multiplexor
cloud returns the predictions
If the input is easy
Small Model
Large Model Large Model
mobile sends inputs
cloud returns predictions
Multiplexed model in the cloud
Model
Multiplexor
The model
multiplexor
selects the
efficient model to
call based on the
input's complexity
(a) (b) (c) (d)
Small Model
Figure 5.2: Deep learning-powered mobile application deployment options. (a) and (b) show the
status quo approaches of cloud-only and mobile-only approaches. In (c) a model multiplexer is
called on the input which decides whether the input can be classified correctly on-device or should
be offloaded to the cloud due to its complexity. (d) demonstrates multiplexing among a set of
models (more than two) in the cloud intelligent service providers.
fewer computations. Also, a surprising fact is that the small model can predict some inputs
correctly that the largest model cannot. For example, as demonstrated in Figure 5.1, the worst-
performing model, alexnet[80], correctly predicts 2.8% of the images that the best-performing
model, resnext101 32x8d[81], is not capable of. This suggests that if the multiplexing is performed
well, the accuracy can be even higher than the most accurate model.
The proper selection of a model for inference can lead to considerable resource usage savings
and higher accuracy. In this paper, we present a model multiplexer that receives the raw input (e.g.
an image) and outputs a binary vector that shows the models capable of performing the inference.
This multiplexer can be used in both mobile devices and cloud hosts. In a mobile application,
the output of the multiplexer is a single binary value which decides whether the input should be
processed locally or on the cloud. In a cloud service provider, instead of replicating the best per-
forming models, we can host a wide range of different models on servers with different computing
requirements and choose them depending on the complexity of the input. The multiplexer is a
light-weight neural network extracting the required meta-features to speculate the correctness of
the predictions of a set of models. We discuss the related works in the following.
Model compression techniques have been proposed to reduce the computational demand often
by trading the prediction accuracy. These techniques include quantization [82, 83], pruning [84],
optimized convolution operations [79, 85, 86], and knowledge distillation for training small models
using the knowledge of a teacher model [87]. Hardware-aware neural architecture search is also a
72
recent interesting and promising research area [88]. These approaches require the user to be expert
enough to come up with a specific model that satisfies the prediction accuracy requirements. Our
proposed methods in this paper for model multiplexing enables the user to automatically select the
model that requires the least resources.
Neurosurgeon [89] and JointDNN [90, 91] decides to offload some, or all layers in a DNN
from the mobile device to the cloud server for reduced latency and mobile energy consumption.
Unlike JointDNN, our granularity level is a complete DNN not a group of DNN layers. We seek to
minimize the mobile inference latency by running the small models on the mobile side and large
models on the cloud side depending on the hardness of input. Offloading the inference task to the
cloud adds the additional cost of communication over a network which can be even larger than
the computation cost. Besides, cloud-based inference compromises user privacy. However, if the
mobile device can determine the input’s complexity, it can run the inference locally as easy inputs
can be solved by a small mobile-friendly DNN. Off-loading the DNN inference computations
to the cloud can reduce the inference time[92], however, this is not always applicable because
of privacy, communication latency, or connectivity issues. Another similar work[93] uses hand-
crafted features such as brightness or edge length in vision applications to choose the best model
among a group of models which is highly dependent on the application domain. Furthermore,
feature compression techniques are also proposed in prior arts to reduce the costs of uploading the
inputs to the cloud server [94–96].
Because the level of granularity in model multiplexing is a whole DNN, all acceleration tech-
niques inside a DNN are complementary to our approach. Techniques such as convolutional kernel
optimization[84, 97], task parallelism[98], and trading precision for time[99] are used to accelerate
the inference time to name but a few. Since a single DNN is not likely to meet all the constraints
such as accuracy, latency, and energy consumption across inputs, a strategy to dynamically select
the appropriate model to use appears to be a prudent option.
Our approach is also related to ensemble learning where multiple models are used to solve
an optimization problem. This technique is shown to be useful on many cognition tasks[100].
73
However, in ensemble learning a voting mechanism (e.g. weighted mean) is used on all the models’
predictions while our approach only calls a single model.
Figure 5.2 illustrates the summary of four different scenarios that we addressed: (a) cloud-only
inference where the input is always offloaded to the cloud, (b) mobile-only inference where the
input is always processed locally, (c) mobile-cloud collaborative inference in which we choose
between the mobile and cloud using the proposed multiplexer, (d) as the multiplexing can be done
for more than two models, cloud API providers can also use the proposed algorithm to call smaller
models instead of always calling the best-performing models. The paper makes the following
contributions:
• We present a deep learning-based approach to automatically learn how to multiplex DNN
models depending on the input complexity and computational resource requirements. We
leverage multiple DNN models and their expertise domain to improve the prediction accu-
racy and reduce the floating-point operations (FLOPs) and latency.
• The proposed method has a little overhead for the multiplexing as we use a small DNN acting
as a pre-processor on the inputs. However, it benefits us by avoiding calling the expensive
large models while achieving higher accuracy.
• In the mobile inference, the proposed method enables the mobile devices to perform the easy
inference tasks locally and offload the hard ones to the cloud server. Therefore, it preserves
the privacy of users for the inputs that are detected as easy.
• In the large scale cloud intelligent services, instead of replicating the best-performing model,
one can host a range of small and large models and select from them at run-time depending
on the input’s complexity which will save the cloud resources by a factor of 2.85.
74
Figure 5.3: The t-SNE visualization of feature space of our benchmark models on the validation set
of ImageNet dataset. The feature space of correct and incorrect predictions are highly overlapped.
This overlap shows that predicting whether the prediction of a certain model will be correct is a
hard task.
5.2 Methodology
In this section, we explain our proposed algorithm for model multiplexer design. Assume we are
given N models to multiplex from. We use a very light-weight mobile-friendly Convolutional
Neural Network (CNN), consists of 4 convolutional layers, which outputs N values in the range of
[0,1]. The closer the ith value is to one, the more likely it is that the ith model can correctly predict
the label. In this section, we explain our proposed method for learning the model multiplexer.
The output of the layer before the final classification layer in a deep neural network is a vector
referred to as an embedding. The embedding is the essential feature vector of the input learned by a
neural network. Therefore, we expect the embeddings of different classes to shape in the space such
that they are linearly separable. In Figure 5.3, we have depicted the projected embeddings of the
inputs which are predicted correctly or incorrectly by six different deep model benchmarks. The
projection from the high dimensional space of embeddings into two-dimensional vectors is carried
out using the t-SNE [101] dimensionality reduction algorithm. Figure 5.3 shows that there is no
75
Model 1
Model 2 Model 3
Model 1&2 Model 1&3
Model 2&3
Model 1&2&3
Figure 5.4: The target embedding space. The feature maps of the inputs are distributed in the space
such that when a group of models can all predict the label of input correctly, their embeddings are
close to each other. Also, when a group of models can predict the label correctly while another
group of models can not, the distance between their embeddings is increased. This will lead to a
feature space similar to a Venn diagram. For instance, the red region on top shows the samples
which can be only predicted correctly by model 1.
separation between the inputs which are predicted correctly or incorrectly by a certain model. As a
result, using a pre-trained deep model for the model multiplexing without any further supervision
would be ineffective. We propose a loss function, referred to as contrastive loss, for jointly training
all the models we are multiplexing from. The intuition behind the contrastive loss is that given two
groups of models if one group can predict the label of input correctly and the other group cannot,
the distance between their embeddings will be increased. Also, when a group of models all can
predict an input correctly, the distance between their embeddings will be decreased. This loss
function shapes the embedding space of models similar to a Venn diagram. As depicted in Figure
5.4, for example, the red region on top contains the samples which can be predicted correctly only
by Model 1 whereas the gray region in the center is the embedding space of samples which are
predicted correctly by all models. The proposed loss is inspired by the Pairwise Ranking Loss[102]
in which the distance of representations of the samples is determined by the pairwise similarity of
the samples.
Once the models are trained using the contrastive loss, we need to train the model multiplexer
using our trained models. As we discussed earlier, given N models, the model multiplexer will
have N outputs where the ith output shows the probability that ith model can predict the input
correctly. One advantage of using multiple models is that we can also leverage the ensemble tech-
niques. In an ensemble model, a subset of models is selected for the inference and the mean of the
76
Meta-feature
Extraction
Network
Projected
Embeddings
Model #1
Embeddings
Model #N
Embeddings
. . .
. . .
Output Output
L
ce
L
ce
L
cn t
Projected Embeddings
Linear
transformation
L
ce
Softmax Linear
Aggregator
Model #1
Computation Costs
Input
Input
Input
Distilling the learned
embeddings
w
1
Model #N
w
N
Mean Aggregator
. . .
g
1
g
N
. . .
L
d i st i l l
y
1
y
N
Step 1: Learning the models
with contrastive loss
Step 2: Learning the model
multiplexer
Figure 5.5: Model multiplexer training procedure and its architecture. In the first step, the models
we are multiplexing from are trained using the contrastive loss. The contrastive loss allows the
learned embeddings to be grouped into regions where each region determines the expertise domain
of a subset of models. In the second step, we distill the learned embeddings from the first step
into the multiplexer by adding a distillation loss function. The multiplexer outputs a set of weights
where each weight determines the confidence of its corresponding model about the prediction
correctness. We also show where each loss function is applied to in the figure.
77
selected models’ outputs will be the final prediction. Our training procedure for model multiplexer
allows for selecting more than one model for ensembling purposes so as to increase the accuracy.
The training procedure of both CNN models using contrastive loss and model multiplexer will be
discussed in the following sections.
5.2.1 Contrastive Loss Function
We seek to learn the features which are useful for extracting the domain expertise of a group of
models. By expertise, we specifically mean the set of inputs that can be predicted correctly by a
certain model. In practice, since the embedding vector size of models can be different, we define
h
i
which will linearly transform the embedding space of ith model into the same dimension and
further normalize the linearly transformed embeddings by L
2
norm. We call this transformed space
projected embeddings. An embedding and a projected embedding of ith model are shown as g
i
and
e
i
, respectively:
e
i
= normalize(h
T
i
g
i
)
(5.1)
Given a pair of models, three cases can happen regarding their capability of correct prediction:
1- Both can predict correctly in which case we decrease the distance between the projected embed-
ding vectors. 2- One can predict correctly whereas the other cannot in which case we increase the
distance between the projected embedding vectors. 3- None of them can predict correctly in which
we will not apply the contrastive loss and let the cross-entropy loss enable the models to learn the
correct prediction without any interference from the contrastive loss. With this explanation, the
contrastive loss function, L
cnt
, will be of the form:
L
cnt
( ˆ y;y)=
N
å
i=1
N
å
j=1;i6= j
log(d(e
i
;e
j
))(( ˆ y
i
== y & ˆ y
j
== y)
( ˆ y
i
!= y & ˆ y
j
== y)
( ˆ y
i
== y & ˆ y
j
!= y))
(5.2)
78
where y is the true label, ˆ y
i
is the prediction of ith model, d is a distance function. We may choose
d as any family of functions satisfying d :fE
1
;E
2
g![0;1], where E
1
and E
2
are embedding space
domain. We use the cosine distance for the distance function as following:
d(e
1
;e
2
)=
e
T
1
e
2
e
T
1
e
1
e
T
2
e
2
(5.3)
Other distance functions in which the output range is normalized to [0;1] can be used in this
formulation, however, we performed the experiments using the cosine distance. We train all the
models that we are multiplexing from by adding the contrastive loss to their main loss function
which is cross-entropy in our case. Figure 5.5’s Step 1 demonstrates the learning procedure with
the contrastive loss which is applied to all models in the ensemble.
5.2.2 Learning the Model Multiplexer
Let f
i
: X! y;8i2f1;::;Ng denote the learned prediction functions of N deep learning mod-
els, where X and y are the input space and target predictions, respectively. Similar to standard
stacking[103], we seek to determine a weighted prediction function of the form:
y
ENS
=
N
å
i=1
w
i
(x) f
i
(x);8x2 X (5.4)
where w
i
(x)2R is the ith model contribution to the final prediction. Let m
i
represent the meta-
feature extraction function for predicting the correct prediction of ith model, and c
i
denote the
computing cost of the ith model. The meta-features are supposed to learn the features necessary
for determining the weights that corresponds to the likelihood that a certain model can make a
correct prediction on the given input. We model w
i
(x) as a linear function of the meta-features
weighted by the inverse of the computing cost which is FLOPs in our case:
w
i
=
M
å
j=1
v
i j
m
j
(x)
c
i
;8x2 X (5.5)
79
where v
i j
2R. To squash w
i
into the range of [0;1], we normalize them using Softmax function.
Under this assumptions, Equation 5.4 can be rewritten as:
y
ENS
=
M
å
i=1
exp(å
M
j=1
v
i j
m
j
(x)
c
i
)
å
M
k=1
exp(å
M
j=1
v
k j
m
j
(x)
c
k
)
f
i
(x);8x2 X (5.6)
We parameterize all m
i
with a convolutional neural network and denote its parameters byQ.
As a result, the learnable parameters areQ and v
i j
. This formulation leads to the following opti-
mization problem:
min
Q;v
L
mux
(y
ENS
;y)=
å
x2X
y(x)log(y
ENS
(x)) (5.7)
where X is the training set. We also add a distillation loss for distilling the projected embed-
dings of all models learnt by the contrastive loss into the multiplexer. We denote the projected
embedding learnt by the ith model as e
i
and the ith meta-feature of the model multiplexer as g:
L
distill
(g;e)=
N
å
i=1
d(m;e
i
) (5.8)
where d is the same function as in Equation 5.3.
Figure 5.5’s Step 2 demonstrates the proposed learning algorithm for training the model multi-
plexer. The complete training process of the models and the multiplexer is demonstrated in Algo-
rithm 4.
5.2.3 Multiplexing process
We explained how to train the model multiplexer. The multiplexing can be performed in two ways:
1- We find the maximum weight and call the corresponding model to perform the inference. 2- We
select all models whose corresponding weight is greater than a threshold and take the average of
their outputs. The whole multiplexing process is shown in Algorithm 5.
80
Algorithm 4 Model multiplexer learning
1: Initialize all N models parameters,q
i
2: Initialize the model multiplexer parameters,Q, v
3: //Learning the models we are multiplexing from. foriteration = 1,2,... do
4: Sample a batch of inputs x with labels y for all models i do
5: ˆ y
i
= f
q
i
(x)
end
for all models i do
6: L
i
= L
cnt
( ˆ y;y)+ L
ce
( ˆ y
i
;y)
7: q
i
=q
i
-aÑL
i
end
end
8: //Learning the model multiplexer. foriteration = 1,2,... do
9: Sample a batch of inputs x with labels y for all models i do
10: ˆ y
i
;e
i
= f
q
i
(x)
end
11: ˆ w
1
; ˆ e
1
;:::; ˆ w
N
; ˆ e
N
= f
Q
(x)
12: ˆ y=å
N
i=1
ˆ w
i
(x) ˆ y
i
13: L= L
ce
( ˆ y;y)+å
N
i=1
L
distill
( ˆ e
i
;e
i
)
14: Q =Q -aÑL
end
15: returnQ
Algorithm 5 Multiplexing process
1: Inputs: x is the model input, T is the weight threshold
2: w = f
Q
(x)
3: S= argmax(w) or S = NoneZeroElements((w> T))
4: ˆ y = avg( f
s
(x));8s2 S
5: return ˆ y
81
Table 5.1: The latency, percentage of local inference, and accuracy of mobile-only, cloud-only and
hybrid (multiplexing) methods. mobilenet v2 and resnext101 32x8d are used as the mobile and
cloud deep models, respectively.
Setup Flops Latency Mobile Energy Local Acc.
Mobile-only 299M 3.53ms 12mJ 100% 71.88%
Cloud-only 16.4G 13.1ms 110mJ 0% 79.39%
Hybrid 5.75G 10.12ms 55.36mJ 68% 80.4%
5.3 Experiments and Results
5.3.1 Experimental Setup
Hardware. We evaluate our approach on the NVIDIA Jetson TX2 embedded deep learning plat-
form as our mobile device. The system has a 64 bit dual-core Denver2 and a 64 bit quad-core
ARM CortexA57 running at 2.0 GHz, and a 256-core NVIDIA Pascal Graphics Processing Unit
(GPU) running at 1.3 GHz. The board has 8 GB of LPDDR4 RAM and 96 GB of storage (32 GB
eMMC plus 64 GB SD card). We use NVIDIA GTX 1080Ti as our server-side hosting GPU. We
measure the energy consumption of each component on the board using the INA226 power sensor.
We use and set to the average Wi-Fi uplink and downlink speed in the United States[104] for the
communication latency.
System Software. Our evaluation platform runs Ubuntu 16.04 with Linux kernel v4.4.15. We use
PyTorch[105], cuDNN (v7.0) and CUDA (v10.1).
Deep Learning Models. We consider six of the state-of-the-art CNN models for image recogni-
tion. The models are built using PyTorch and trained on the ImageNet ILSVRC 2012[78] training
set. The total number of floating-point operations required for a single inference is used as the com-
putation cost of the model in Equation 5.5. We train all the benchmark models and the multiplexer
model for 200 epochs on the training set of ImageNet.
82
5.3.2 Results
Mobile-cloud collaborative inference. In this scenario, one light-weight model is hosted on the
mobile side (mobilenet v2) and the best-performing model (resnext101 32x8d) on the cloud side.
The multiplexer is a 4-layered light-weight CNN adding negligible computation cost compared to
the mobile-hosted model. Our neural multiplexer outputs a single value between zero and one.
Zero means the input should be classified on the mobile device and one means the input should be
classified on the cloud server. We use a threshold function at 0.5 to binarize the output. We call the
multiplexer to decide whether to perform the inference on the mobile devices or the cloud server.
Although a negligible extra computation is added to the mobile inference, it benefits the user with
about 10% improvement in the accuracy which is because of the inputs which could be classified
correctly only by the cloud’s large and accurate model. In order to have a clear understanding of
the components of the latency and energy consumption, we provide their formulations. The latency
and energy consumption of a single inference using the mobile-only approach is only due to the
computations required for the inference using the mobile-side model (mobilenet v2). We refer to
both latency and energy consumption as the cost which is represented by C:
C
mobileonly
= C
mobilecomputein f erence
(5.9)
The latency and energy consumption of a single inference using the cloud-only approach con-
sists of the communication costs, and the cloud compute costs:
C
cloudonly
= C
upload
+C
cloudcomputemodel
+C
download
(5.10)
The latency and energy consumption of a single inference using the hybrid approach has two
possible cases: 1- The multiplexer decides to perform the inference locally in which:
C
hybridm
= C
mobilemux
+C
mobilecomputein f erence
(5.11)
83
2- The multiplexer decides to perform the inference on the cloud in which:
C
hybridc
= C
mobilemux
+C
upload
+C
cloudcompute
+C
download
(5.12)
Therefore, the cost of the hybrid approach will be the weighted average of the two previous equa-
tions. The weights are determined by the percentage of inferences that are performed on the mobile
and cloud. The hybrid approach’s cost will be:
C
hybrid
=(%local)C
hybridm
+(%cloud)C
hybridc
(5.13)
Detailed results for the collaborative inference between the mobile device and cloud server are
shown in Table 5.1. As it shows, 68% of the inputs are decided by the multiplexer to be processed
locally on the mobile device while the other 32% are offloaded to the cloud. Our algorithm also
improves the accuracy of the mobile-only approach by 8.5% which is because of the correct pre-
dictions on those inputs that are offloaded to the cloud. The accuracy of the hybrid approach is
even higher than the cloud model which is because of the fact the small model can make correct
predictions on inputs that the large model cannot. The True Negative Rate of the multiplexer is the
detection rate of the inputs that can be classified correctly by the mobile device which is 0.966%
in our case. This means we miss (1-0.966)*0.7188=2.4% of the inputs that could be predicted cor-
rectly by the mobile device which will be compensated by the powerful cloud model. The latency
and energy of the hybrid approach in Table 5.1 is worse than those of the mobile-only but this
comparison is not fair. The reason is that the extra latency and energy cost we pay is directing in-
creasing the accuracy. Neglecting the cost of multiplexing, the extra latency, and energy is because
of two reasons: 1- The inputs that could be predicted correctly on the mobile but we offload it to
the cloud which is only the case for 2.4% of the inputs; 2- The inputs that could not be predicted
correctly on the mobile and we offload it to the cloud which is the case for 32-2.4%=29.6% of the
inputs and is the dominant component.
84
Table 5.2: The FLOPs, latency, accuracy of six of the state-of-art CNN models. The Called column
shows the percentage of inputs which are decided to be predicted by the corresponding model.
Model FLOPs Latency Accuracy Called
alexnet[80] 655M 6.8ms 56.55% 10.56%
mobilenet v2[79] 299M 3.0ms 71.88% 18.80%
mnasnet1 0[88] 313M 5.5ms 73.45% 21.80%
resnet50[48] 4.08G 8.9ms 76.15% 14.80%
resnet152[48] 11.5G 11.3ms 78.31% 15.80%
resnext101 32x8d[81] 16.4G 11.8ms 79.31% 18.24%
Hybrid-single 5.75G 7.73ms 83.86% 100%
Hybrid-ensemble 7.12G 8.15ms 85.54% 100%
Cloud-based API inference. As in the cloud-hosted inference services the best-performing
model is replicated on the servers while many inputs are easy and can be processed with small
models. The proposed algorithms help to distribute the easy and hard inputs to the model that will
consume minimum resources. Table 5.2 demonstrates the improvements we could achieve for the
cloud providers. The hybrid-single represents the scenario in which we multiplex a single model
from a group of models while hybrid-ensemble represent the scenario in which we multiplex more
than one model from a group of models. The models whose associated weight in the Equation 5.6
is greater than a threshold are selected to perform the inference. We sweep over all possible values
for the threshold and found 0.288 as the best value giving the maximum accuracy. Similarly, we
also show the cost equation for the hybrid approach of cloud-based inference:
C
hybrid
=
å
i
(%called
i
)C
i
cloudcomputemodel
(5.14)
where called
i
is the percentage of the times that the ith model is called, and C
i
cloudcomputemodel
represents the cost of running ith model on the cloud. In the hybrid-single case, the FLOPs count
is reduced from 16.4G (i.e. the largest model FLOPs) to 5.75G which essentially results in saving
GPU resources by a factor of 2.85. The latency is reduced by 34.5% and the over accuracy is
improved by 4.55%. In addition, if we use more than model after the multiplex. i.e. ensembling
the models, we can further improve the accuracy. Although ensembling increases the FLOPs, we
85
Figure 5.6: The t-SNE visualization of feature space of validation set of ImageNet dataset for
the benchmark models trained using the proposed loss function. Left: mobile-cloud collaborative
inference using mobilenet v2 on the mobile side and resnext101 32x8d on the cloud side. Right:
Ensemble of six benchmark CNNs which is suitable for cloud based intelligent services which
host the replicas of the most-accurate model. For instance, instead of replicating resnext101 32x8d
on six different servers, one can host these six CNNs plus the multiplexer which achieves less
compute resource usage and higher accuracy.
exploit the fact that model ensembles can be parallelized on GPUs. As a result, the increase (%) in
the latency of hybrid-ensemble is less than the increase (%) in its FLOPs.
We demonstrate the effectiveness of the contrastive loss in Figure 5.6. The learned embedding
space is similar to our target Venn diagram style depicted in Figure 5.4. The inputs which are
only in the expertise domain of a certain model are pushed to the boundaries and the inputs which
can be predicted correctly by multiple models are closer to the center. The separable embedding
space that we create enables a light-weight neural multiplexer to effectively learn the multiplexing
function.
5.4 Conclusion and Future Work
In this paper, we present an algorithm to multiplex a deep learning model to use depending on the
input complexity and resource budgets. With the proposed algorithm, the mobile devices can host
a small and mobile-friendly model and detect the inputs that are likely to be predicted correctly
by local inference. Mobile devices will offload the inputs that they find hard to the cloud servers
86
to be inferred by the larger models hosted in the cloud. The communication cost of the cloud-
based inference dominates the local inference computation cost. As a result, it is desirable to
offload as little as possible to the cloud and meet the accuracy requirements at the same time.
Our results show that a user only needs to offload 32% of the inputs to the cloud while achieving
an accuracy even higher than the cloud-hosted model. Furthermore, the cloud providers offering
APIs for cognitive tasks replicate their best-performing model in the server to called for any inputs
regardless of their level of complexity. However, with this approach, they can host a wide range
of small and large models and choose one depending on the input. It will save 2.85x of the cloud
provider’s compute resources while improving the accuracy by 4.55% compared to deploying the
most accurate model.
87
Chapter 6
Information Obfuscation for Privacy-Preserving Data Valuation
In the Big Data era, 2.5 quintillion bytes of data are generated by machines and humans daily.
The large-scale generated data are then structured, cleaned, and labeled by humans to create high-
quality datasets, which one can use in training the machine learning models. However, recent
research shows that not all data points are equally helpful for machine learning models; therefore,
finding and choosing the most valuable data for a given target task is of great importance. Data
valuation aims to accurately quantify each data point’s value, which helps identify the incorrect
labels, data quality, and relatedness to the target machine learning task. Model providers aim to
improve their models’ performance by gathering more data from data providers; however, data
providers prefer to privately find the value of their data before offloading their raw data to the
model provider. This paper proposes a framework in which the data providers extract the features
locally by a feature extraction network given by the model provider and then obfuscate the features
and send the obfuscated feature data to the model provider. The information obfuscation method
removes the information which is irrelevant to the target data valuation task. Then, the model
provider calls the data valuation function on the obfuscated features and sends the corresponding
data values to the data providers. Afterward, the data providers can decide on offloading their raw
data samples if they are satisfied with the valuation results. Our experimental results show that the
proposed approach can reduce the signal content by up to four orders of magnitude while losing
less than 1% of the valuation accuracy and outperforms the existing obfuscation methods for the
data valuation task.
88
6.1 Introduction
In many real-world applications, multiple individuals often contribute the datasets that support
queries and machine learning (ML) and rely on massive crowdsourcing. An example is a voice
recognition system whose training data are gathered from many users. There are data marketplaces
providing access to data, e.g., IOTA [106], DAWEX [107], Xignite [108]; however, the critical
challenge in these marketplaces is how to allocate fair revenue securely. A data provider would
like to know the value of her data without exposing the raw data to the valuation algorithm, which
has not been considered in the prior data valuation algorithms. Figure 6.1 demonstrates the data
valuation for machine learning. The most critical part of a machine learning model is large-scale
and high-quality data [109, 110]. As shown in [111], each training sample contributes differently
to the target model’s accuracy, therefore, assigning a correct value to each training sample can
be of great importance in data marketplaces. Furthermore, some training samples are noisy and
have degrading effect on the accuracy, therefore, detecting and removing those samples is useful
[112, 113]. The domain mismatch between the train and test environments leads to low performing
models in action, as a result, choosing the most relevant samples to the target domain can improve
the accuracy [114, 115].
The main contribution of this work is proposing a privacy-preserving framework which attaches
a value to every single data point in relative terms, w.r.t. a specific ML model, while protecting the
data provider’s privacy. We use an obfuscation algorithm with theoretical foundation and demon-
strate the effectiveness of this approach in reducing the information that can be eavesdropped by
an adversary. This is the first research work studying the application of privacy-preserving compu-
tation for the data valuation problem.
89
ML Model Provider
Data
Providers
Data
Data
Data
V aluation
Profit
Allocation
Profit $
Profit $
Figure 6.1: Data valuation for machine learning. The data providers send their valuable datasets to
the model provider and receive monetary profits. However, the data providers prefer to learn the
value of their dataset without sharing the raw information with the model provider.
Data Value Estimator
A data batch
(x
1
, y
1
)
(x
N
, y
N
)
...
Data Generator Clients
$v
1
$v
N
Model Owner in the Cloud
Feature Extractor Information Obfuscation
Figure 6.2: Privacy-preserving data valuation using information obfuscation. The data provider
clients will extract the feature vector of the input data and obfuscate them, which is then sent to
the model owner for computing the data values. The data values are then sent back to the clients,
which enables the clients to decide whether they want to sell their raw data given the valuation.
90
6.1.1 Related work
6.1.1.1 Data Valuation
Data Shapley [116] value is used to assess the worth of data because it uniquely possesses attractive
qualities such as fairness, rationality, and decentralizability. All feasible subsets of the dataset are
utilized to assess the model’s accuracy increase in Data Shapley, and the marginal improvement
is considered as the individual data value. Because all potential subsets must be considered, the
computational cost for computing the Shapley value is exponential in terms of the number of data
points. As a result, to minimize computing complexity, approximation approaches such as Monte
Carlo sampling and gradient-based estimation are utilized. However, computing complexity re-
mains significant since the model must be retrained for each subset of input. Another disadvantage
of Data Shapley is that the data valuation function is disconnected from predictor model training,
resulting in less accurate valuation results. By jointly optimizing the predictor model and the data
valuation function, data valuation using reinforcement learning (DVRL) [117] has been suggested
to overcome computational complexity and enhance valuation accuracy. We use DVRL throughout
this work as our data valuation method.
6.1.1.2 Data Obfuscation
The current state-of-the-art data obfuscation methods lie under the following categories:
Cryptography-based solutions. A range of public-key encryption systems have been proposed
to protect the data in transmission and perform the computation over the encrypted data [118].
However, the downside of the cryptographic based solutions is the large communication and com-
putation overhead. As an example, it requires 32 GB of data transmission to perform inference on
a single ImageNet data [118].
Noise Injection. In this class of data obfuscation methods, the client injects noise into the informa-
tion before sending them to the cloud. Differential privacy (DP) is a typical method in this class,
which guarantees that for any two different inputs the distribution of the obfuscated data is almost
91
the same [119, 120]. However, the accuracy loss of using DP is still large [121–123]. While noise
addition enhances privacy in general, it has been proven to dramatically impair accuracy [124,
125].
Information Bottleneck (IB). This approach has been proposed to obfuscate the information re-
lated to a known set of sensitive attributes, y
hid
. However, the downside of this approach is that we
should explicitly define the hidden sensitive attributes, y
hid
, that one does not want to reveal to the
servers.
Adverserial Training (AT). This approach has been developed for obscuring the information of
a known set of sensitive characteristics while retaining the target label’s correctness. There is a
client, a server, and an opponent in AT, and a min-max optimization problem is addressed to mini-
mize the adversary’s information gain while maximizing the client and server’s performance on a
target prediction task [124].
6.2 Methodology
6.2.1 Framework Overview
6.2.1.1 Training
In the training phase, we assume a pre-trained feature extraction network already exist to extract to
essential information from raw input data. This pre-trained model helps to reduce the dimension-
ality of the input data and discard the unrelated information. The model provider will train the data
valuation estimator using reinforcement learning. The output of data valuation is a set of weights
between 0 and 1 which are summed to 1 and represent a probability distribution over the relative
importance of each data point. These weights are multiplied to their corresponding loss value of
the target predictive task. The amount of improvement on the target predictive model (e.g. accu-
racy) is used as reinforcement signal to improve the data valuation function. When the training of
the data valuation function is completed, then the data obfuscation function is trained so that the
92
Selection
Probabilities
Data Value
Estimator
w
1
w
B
w
2
Sampler
(Multinomial
Distribution)
Selection
Vector
s
1
s
B
s
2
Predictor Loss
Moving
Average of
Previous
Losses
Reward
Backpropagation with Weighted Optimization
{( f
i
, y
i
)}
i=1:B
{( f
i
val
, y
i
val
)}
i=1:B
Reinforcement Signal
{( f
i
, y
i
)}
i=1:B
Feature
Extraction
Network
{( x i , y i )} i=1:B
Feature
Extraction
Network
{( x i
val
, y i
val
)} i=1:B
Obfuscation
Network
Figure 6.3: The training pipeline for jointly learning the data value estimator and obfuscation
network.
output of the data valuation function is almost the same for both obfuscated and non-obfuscated
inputs.
6.2.1.2 Inference
In the proposed framework, the data providers will host the feature extraction network and the data
obfuscation function and the model provider will host the data valuation function. The clients will
send their obfuscated data to the model provider to receive the corresponding relative data values
from the model provider. Once they receive the data values from the model provider, they can
decide whether to sell their raw data or not. With this approach, the data providers do not risk
offloading their raw inputs to the data marketplaces before knowing their data values.
6.2.1.3 The adversary model
The attacker attempts to deduce sensitive information from the obfuscated feature vector. We de-
fine a strong adversary to be someone who is fully aware of the feature extraction and data valua-
tion models, the training data and training method, and the data provider’s obfuscation scheme and
settings. The model owner and data suppliers, on the other hand, are unaware of the adversary’s
technique and the sensitive information that the attacker attempts to deduce.
93
6.2.2 Data Valuation Training Overview
In the training process, we use a pre-trained feature extraction network to extract the input fea-
tures which are fed into the data value estimator as inputs and we freeze the feature extraction
network weights to accelerate the training process. A batch of features extracted from the training
samples is used as the input to the obfuscation network. The data value estimator takes the obfus-
cated features and outputs the corresponding selection probabilities w
i
for each training sample.
The sampler outputs a binary selection vector, s=(s
1
;s
2
;:::;s
B
);s
i
= 0;1, which is sampled from
a multinomial distribution where P(s
i
= 1)= w
i
. The target predictor model uses the selection
vector to select a subset of the training samples for training the predictor using a conventional
gradient-descent optimization algorithm. The selection probabilities (or data values) w
i
rank the
samples according to their relative importance for improving the accuracy of the underlying pre-
dictor model. To calculate the reward, the prediction model’s loss is assessed on a small validation
set and compared to the moving average of prior losses. Finally, the data value estimator is up-
dated by the reinforcement signal led by this reward. The whole training pipeline is demonstrated
in Figure 6.3.
6.2.3 Data Valuation Training Formulations
We denote the training set as D=f(x
i
;y
i
)g
N
i=1
which are used as the input to feature extraction net-
work and we obtain the training feature set D=f( f
i
;y
i
)g
N
i=1
that we operate on for training the data
valuation network. f
i
is a d-dimensional vector in R
d
and y
i
is the corresponding label in the label
space. We consider a testing set D=f(x
test
i
;y
test
i
)g
N
i=1
and a validation set, D=f(x
val
i
;y
val
i
)g
N
i=1
which are sampled from the training set distribution and are mutually exclusive to each other. The
learning algorithm is based on reinforcement learning and consists of two learnable functions: 1)
94
the target task predictor model, f
q
2) the data value estimator model h
f
. The predictor model’s pa-
rametersq are optimized according to a certain (e.g. Mean Absolute or Square Error MSE, MAE
for regression or cross-entropy for classification) weighted loss over the training data set:
f
q
= argmin
ˆ
f
1
N
N
å
i=1
h
f
( f
i
;y
i
)L
f
(
ˆ
f( f
i
);y
i
) (6.1)
In the Equation 6.1, f
q
can be any trainable function parameterized with q such as a neural net-
work. The data value estimator model h
f
: f![0;1] is trained to output weights that determine
the probability of the selection of the samples, which are used to train the target predictor model
f
q
. The corresponding optimization objective is formulated as:
min
h
f
E
(x
val
;y
val
)
[L( f
q
(x
val
);y
val
)]
s:t: f
q
= argmin
ˆ
f
1
N
E
( f;y)
[h
f
( f;y)L
f
(
ˆ
f( f);y)]
(6.2)
where h
f
( f;y) represents the value of the training samples ( f;y). Similar to the target predic-
tor model, the data value estimator is a trainable function parameterized by f, such as a neural
network.
In the training process, the data value network may collapse into exploiting the initial state
rather than exploring solutions closer to the optimal. Therefore, we treat the training sample se-
lection as a stochastic process in order to increase the level of exploration of our RL agent, which
is the data value estimator. We denote the probability of selecting(x
i
;y
i
) by w
i
= h
f
(x;y). s
i
is a
binary value representing the selection of ith sample for training the predictor model. As during
the training we sample data batches of size B, the probability of a specific selection vector will
be p
f
(s)=P
B
i=1
[h
f
(x
i
;y
i
)
s
i
:(1 h
f
(x
i
;y
i
))
1s
i
]. w
i
s which are the output of the data value net-
work are treated as data values. Data values can be used to do robust learning by putting fewer
weights to the non-valuable training samples. As all the models used in this work are differen-
tiable neural networks, we use stochastic gradient descent to jointly train the obfuscation network,
95
the data value estimation network, and the target predictor model. As there exist a sampling pro-
cess in the training pipeline, this operation should become differentiable. Gumbel-softmax [126],
stochastic back-propagation [127] are the examples of tricks to use non-differentiable units in the
optimization. However, as we are using a reinforcement learning setup, we use REINFORCE
[128] algorithm to directly encourage the policy exploration. The rewards are the accuracy gain on
a small validation set, which are supposed to approximate the accuracy on the target task. The loss
function for training the data value estimation network will be:
ˆ
l(f)= E
(x
val
;y
val
)P
t[E
sp
f
(D;:)
[L
h
( f
q
(x
val
);y
val
)]]
=
Z
P
t
(x
val
)[
å
s2[0;1]
B
p
f
(D;s):[L
h
( f
q
(x
val
);y
val
)]]dx
val
(6.3)
As we use gradient based optimization, the gradient of the loss function will be:
Ñ
f
ˆ
l(f)
=
Z
P
t
(x
val
)[
å
s2[0;1]
B
Ñ
f
p
f
(D;s):[L
h
( f
q
(x
val
);y
val
)]]dx
val
=
Z
P
t
(x
val
)[
å
s2[0;1]
B
Ñ
f
log(p
f
(D;s))p
f
(D;s):
[L
h
( f
q
(x
val
);y
val
)]]dx
val
= E
(x
val
;y
val
)P
t[E
sp
f
(D;:)
[L
h
( f
q
(x
val
);y
val
)]
Ñ
f
log(p
f
(D;s))]
(6.4)
whereÑ
f
log(p
f
(D;s)) is:
Ñ
f
log(p
f
(D;s))
=Ñ
f
B
å
i=1
log[h
f
(x
i
;y
i
)
s
i
:(1 h
f
(x
i
;y
i
))
(1s
i
)
]
=
B
å
i=1
s
i
Ñ
f
log[h
f
(x
i
;y
i
)]+(1 s
i
)Ñ
f
log[(1 h
f
(x
i
;y
i
))]
(6.5)
96
6.2.4 Information Obfuscation Problem Formulation
We denote the feature matrix by z2 R
Nd
, where N is the batch dimension and d is the feature
dimension and w
tar
2 [0;1]
N
is the corresponding target selection probability assigned to each
feature vector. w
tar
is the output of the data valuation function, which we show as w
tar
= V(z).
We aim to design an obfuscation function, Ob f , such that z
0
= Ob f(z) provides the necessary and
sufficient information to predict w
tar
. In a formal formulation, we aim to minimize the mutual
information between z and z
0
, I(z;z
0
), while maintaining the relative importance of data samples in
the batch as much as possible for correctly predicting the w
tar
:
min
z
0
I(z;z
0
) s.t. jjV(z)V(z
0
)jje (6.6)
The objective function in 6.6 minimizes the mutual information of the original features and the
obfuscated features while bounding the distortion to the data valuation function. If we represent
the entropy function by H(:), we can rewrite the mutual information function as:
I(z;z
0
)= H(z
0
) H(z
0
jz)= H(z
0
);
(6.7)
as the obfuscation function Ob f(:) is a deterministic function H(z
0
jz)= 0 because we can compute
z
0
given z in a deterministic way. As a result, the objective function in 6.6 can be simplified to:
min
z
0
H(z
0
) s.t. jjV(z)V(z
0
)jje (6.8)
From the perspective of the model owner sitting on the cloud server, low H(z
0
) means receiving
similar feature vectors which are obfuscated in a way that they only contain the necessary and
sufficient information from relative importance ranking of the data samples.
97
6.2.4.1 Obfuscation for Linear Layers
We aim to design a low complexity obfuscation function, Ob f , with theoretical distortion bounds
that solves the Equation 6.6 with respect to the data valuation network on the cloud server. The
function Obs can be viewed as an auto-encoder network which is optimized for the satisfying the
Equation 6.6. However, such auto-encoders are often computationally expensive to run on the
data generating devices (typically at the scale of IoT devices) and not supported by theoretical
distortion bounds. To alleviate the aforementioned issues, we propose an obfuscation function
which is parameterized based on the first linear layer of the data valuation function. The linear
layer can be either a fully connected or convolutional layer, which are the widely used linear layers
in neural networks. The objective in Equation 6.8 for linear models can be written as:
min
z
0
H(z
0
) s.t. jjWzWz
0
jje (6.9)
where W2 R
mn
is the parameters’ matrix of the first linear layer in a neural network.
Definition 1. We denote the singular value decomposition (SVD) of W as W
mn
=U
mm
S
mn
V
T
nn
.
We denote the columns of V
T
nn
by v
k
n
k=1
and they provide an orthonormal basis that we can repre-
sent z as follows:
z=
n
å
k=1
a
k
v
k
=
n
å
k=1
z
k
; a
k
= v
T
k
:z (6.10)
Lemma 1. If z N(0;s
2
I), thena
k
s are independent random variables witha
k
N(0;s
2
I).
Proof. Sincea
k
= v
T
k
:z, it is a Gaussian distribution with the following mean and variance:
E[a
k
]= v
T
k
:E[z]= 0; E[a
2
k
]= v
T
k
:E[zz
T
]:v
k
=s
2
v
T
k
:v
k
=s
2
(6.11)
98
Assume i6= j then we have:
cov(a
i
;a
j
)= E[a
i
:a
j
] E[a
i
]:E[a
j
]= v
T
i
:E[zz
T
]:v
j
= 0
(6.12)
Also,b = c
1
a
i
+c
2
a
j
=(c
1
v
T
i
+c
2
v
T
j
):z= v
0T
z. Since cov(a
i
;a
j
)= 0 andb is Gaussian, then for
any combination of c
1
and c
2
,a
k
s are independent random variables.
Lemma 2. For z N(0;s
2
I), we have H(z
i
jz
j
)= H(z
i
);8i6= j
Proof. Assume i6= j then we have:
cov(z
i
;z
j
)= E[z
i
:z
T
j
] E[z
i
]:E[z
j
]
T
= E[a
i
a
j
]v
i
v
T
j
E[a
i
]E[a
j
]v
i
:v
T
j
= 0
(6.13)
Also, x= c
1
z
i
+ c
2
z
j
+ c1a
i
v
i
+ c
2
a
j
v
j
. According to Lemma 1, a
k
s are independent Gaussian
random variables. As a result, x is a multivariate Gaussian for any c
1
and c
2
. Thus, z
k
s are
independent random vectors and H(z
i
jz
j
)= H(z
i
);8i6= j.
Lemma 3. For z N(0;s
2
I), we have H(z)=å
n
k=1
H(a
k
)
Proof. Since z
k
=a
k
v
k
=(v
T
k
z)v
k
, we have H(z
k
jz)= 0. Lemma 2 also demonstrates H(z
i
jz
j
)=
H(z
i
);8i6= j. Therefore:
H(z;z
1
;:::;z
n
)= H(z)+
n
å
k=1
H(z
k
jz;z
1
;:::;z
k1
)= H(z) (6.14)
99
The left-hand side can be rewritten as:
H(z;z
1
;:::;z
n
)= H(z
1
)+ H(zjz
1
)+
n
å
k=2
H(z
k
jz;z
1
;:::;z
k1
)
= H(z
1
)+ H(zjz
1
)
= H(z
1
)+ H(z
2
;:::;z
n
)
= H(z
1
)+
n
å
k=2
H(z
k
jz
2
;:::;z
k1
)
=
n
å
k=1
H(z
k
)
(6.15)
Thus, H(z)=å
n
k=1
H(z
k
). We also have:
H(z
k
;a
k
)= H(z
k
)+ H(a
k
jz
k
)= H(z
k
)
= H(a
k
)+ H(z
k
ja
k
)= H(a
k
)
(6.16)
As a result, H(z
k
)= H(a
k
), and H(z)=å
n
k=1
H(a
k
).
Lemma 4. Let z=å
n
k=1
a
k
v
k
and z
0
=å
n
k=1
a
0
k
v
k
. We havejjW(zz
0
)jj=
q
å
n
k=1
(a
k
a
0
k
)
2
s
2
k
,
where s
k
is the k-th singular value of W.
Proof.
W(z
0
z
)= USV(z z
0
)=
n
å
k=1
USV((a
k
a
0
k
)v
k
)
=
n
å
k=1
US((a
k
a
0
k
)d
k
)
=
n
å
k=1
U(s
k
(a
k
a
0
k
)d
k
)
=
n
å
k=1
s
k
(a
k
a
0
k
)U
k
(6.17)
whered
k
is a one-hot vector with its k-th element set to 1 and U
k
is the k-th column of U. Since
U
k
s are orthonormal, we havejjW(z z
0
)jj=
q
å
n
k=1
(a
k
a
0
k
)
2
s
2
k
.
The following theorem provides a closed-form solution to the information obfuscation objec-
tive in Equation 6.9.
100
Theorem 1. Let W = USV
T
as defined in Definition 1. Let z=å
n
k=1
a
k
v
k
, where z N(0;s
2
I)
anda
k
s are sorted according to the singular values of W. The global minimum for the information
obfuscation objective in Equation 6.9 is z
0
=å
m
k=1
a
0
k
v
k
where:
a
0
k
=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
a
k
, k< m
0
a
k
gsign(a
k
), k= m
0
0, k> m
0
where m
0
= argmin
k
e
k
e,e
k
=
q
å
m
i=k+1
a
2
i
s
2
i
andg =
q
e
2
e
2
m
0
s
m
0
.
Proof. Using Lemmas 1 and 3, we have H(z)=å
n
k=1
H(a
k
), where a
k
N(0;s
2
). We have,
H(a
k
)=
1
2
ln(2pes
2
) which is monotonically increasing in var(a
k
)=s
2
. Hence, H(a
k
) can be
reduced by suppressing the variance ofa
k
, i.e., makinga
k
s closer to zero.
We aim to find the modified values of a
k
s given a distortion budget e. From the entropy
perspective, based on the assumptions above and Lemma 3, reducing the variance of each a
k
reduces the entropy by the same amount for all k. Lemma 4, however, states that modifying a
k
by g causes a distortionjgjs
k
, where s
k
is the k-th singular value of W. Since smaller s
k
s cause
smaller distortion, the solution is achieved by sorting the singular values and then modifying the
a
k
s corresponding to the smaller singular values towards zero, one at a time, until the budget e is
exhausted.
The following provides the solution more specifically. If m < n in the weight matrix, the
last n m coefficients do not contribute to W:z and thus can be set to zero without causing any
distortion. Now, assume the coefficients in range m
0
+ 1 to m are to be set to zero. The total
distortion will bee
0
m
=
q
å
m
i=m
0
+1
s
2
i
a
2
i
. Also, the distortion caused by modifyinga
0
m
byg is s
0
m
jgj,
which we will set to be equal to the remaining distortion,
q
e
2
e
2
m
0
, i.e., g =
q
e
2
e
2
m
0
s
0
m
. This
completes the proof.
101
Definition 2. The signal content of z with respect to a matrix W, or simply the signal content
of z, is the solution to Equation 6.9 withe = 0. It is denoted by z
S
and defined as follows:
z
S
=
m
å
k=1
a
k
v
k
(6.18)
The remaining n m components of z are called the null content defined as follows:
z
N
= z z
S
=
n
å
k=m+1
a
k
v
k (6.19)
The signal content is the information that is kept after multiplying z by W, and the null con-
tent is the discarded information. By setting z
0
= z
S
, the data owner reduces the entropy with-
out introducing any distortion to the output of the data valuation function. We call this method
distortion-free obfuscation herein. The entropy can be further reduced by removing components
from the signal content as well, for which the optimal way for a desirede is determined by Theo-
rem 1. We call this method distortion-bounded obfuscation in the rest of the paper.
6.3 Experiments
We evaluate our obfuscated data valuation algorithm on six datasets which include image and
tabular data types. The baseline for our work is the non-obfuscated data valuation, which is shown
with DVRL, and the obfuscated one, which is the actual contribution of this paper. We compare
the accuracy loss and the cumulative signal content between the obfuscated and non-obfuscated
data valuations.
Measuring obfuscation with cumulative signal content (CSC). Multiple methods have been
proposed for measuring the information leakage of feature vectors, such as computing the mutual
information between the obfuscated z
0
and non-obfuscated z vectors [129]. As computing the
mutual information for high-dimensional data is intractable, certain assumptions need to be made
about the distribution of the underlying random variables. A practical approach has been proposed
102
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Accuracy
Adult Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.72
0.74
0.76
0.78
0.80
0.82
0.84
0.86
Accuracy
Blog Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0.80
0.81
Accuracy
Fashion MNIST Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.90
0.91
Accuracy
Flower Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
(a) (b) (c) (d)
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.70
0.72
0.74
0.76
0.78
0.80
0.82
0.84
Accuracy
CIFAR-10 Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.54
0.56
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
Accuracy
HAM 10000 Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Accuracy
Adult Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.65
0.70
0.75
0.80
Accuracy
Blog Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
(e) (f) (g) (h)
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Accuracy
Fashion MNIST Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Accuracy
Flower Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Accuracy
CIFAR-10 Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of removed samples
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Accuracy
HAM 10000 Dataset
DVRL Least
DVRL Most
DVRL Least Obf
DVRL Most Obf
Random
(i) (j) (k) (l)
Figure 6.4: Accuracy of the predictor model trained with the curated training set, in which the
most/the least valued training samples are removed according to the estimated data values. (a)-(f)
represent the scenario in which the training set labels are not noisy. (g)-(l) represent the scenario
in which 20% of the training set labels are noisy.
103
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of data inspected
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of discovered corrupted samples
Adult Dataset
DVRL
Optimal
DVRL Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of data inspected
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of discovered corrupted samples
Blog Dataset
DVRL
Optimal
DVRL Obf
Random
(a) (b)
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of data inspected
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of discovered corrupted samples
CIFAR-10 Dataset
DVRL
Optimal
DVRL Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of data inspected
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of discovered corrupted samples
Fashion MNIST Dataset
DVRL
Optimal
DVRL Obf
Random
(c) (d)
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of data inspected
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of discovered corrupted samples
Flower Dataset
DVRL
Optimal
DVRL Obf
Random
0.0 0.1 0.2 0.3 0.4 0.5
Fraction of data inspected
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of discovered corrupted samples
HAM 10000 Dataset
DVRL
Optimal
DVRL Obf
Random
(e) (f)
Figure 6.5: The performance of discovering corrupted samples in six different datasets where 20%
of the samples have noisy labels. Random represents the case where we have no prior knowledge on
the clean and corrupted samples, thus the fraction of the discovered corrupted samples is equal to
the fraction of inspected samples. Optimal represents the case where we only inspect the corrupted
samples with optimal accuracy, thus after inspecting 20% of samples it saturates.
104
by [130] which computes the reconstruction errorjjˆ z zjj where ˆ z is estimated using the feature
vector z. Null/signal content represents the irrelevant/relevant parts of the information to the target
task. We denote the cumulative signal content by, C
S
(z)= log(
jjz
S
jj
2
2
jjzjj
2
2
) where z is the feature vector
which is passed to the data valuation algorithm. For example, C
S
(z)=1 means the obfuscated
vector contains 10 times less information compared to the original feature vector.
We use four public image datasets including HAM-10000 [131], Fashion-MNIST [132], CIFAR-
10 [133], and Flower [134] and two public tabular datasets including Blog [135] and Adult [136].
We use the tabular datasets with 1000 training and 400 validation samples and the image datasets
with 2000 training and 800 validation samples.
Feature extraction networks As the proposed method is model-agnostic, we may use differ-
ent prediction models for each dataset. For the tabular datasets, we use multi-layer perceptron on
top of XGBoost features due to its superior performance. For Flower, HAM 10000, and CIFAR-
10 datasets, we apply transfer learning to an Inception-V3 model pre-trained on ImageNet[61] by
re-initializing the prediction head and further fine-tuning of the weights. We use multi-nominal lo-
gistic regression for Fashtion-MNIST and pre-trained ResNet-32 [21] on ImageNet for CIFAR-10
datasets. The data valuation network is composed of multi-layer perceptrons with ReLU activation,
in which the number of layers and hidden units are determined with a cross-validation optimization
process.
6.3.1 Removing High/Low Quality Samples
Taking low-value samples out of the training dataset can improve prediction model performance,
especially if the training dataset contains corrupted samples. Removing high-value samples, on
the other hand, might have a significant impact on performance, particularly if the dataset is tiny.
Overall, performance after removing high/low value samples predicts data valuation quality sig-
nificantly.
We consider two scenarios in our experiments: 1) We use training, validation, and testing
datasets which are sampled from the same distribution, i.e., there is no noisy sample or distribution
105
shift. For evaluation, using the trained data value estimator, we remove the most and least valued
data samples from the training set and train the corresponding predictor model and report the
accuracy on a disjoint testing set. 2) It is the same as the previous scenario, except that we use a
noisy labeled training dataset where 20% of the samples have incorrect labels. Training a model
with noisy labels will deteriorate the accuracy, and we expect the data valuation algorithm to assign
low values to the noisy samples. Removing those noisy least-valued samples from the training
set (DVRL Least) should either improve the accuracy level or at least degrade at a slower rate
compared to removal of most-valued samples with correct labels (DVRL Most). The results of this
experiment is shown in the Figure 6.4 and it follows the aforementioned expectations. The average
accuracy loss and CSC of using our obfuscation method is 0.00987 and -4.67 for this experiment.
6.3.2 Corrupted Sample Discovery
In certain cases, training samples may contain corrupted samples, for example, as a result of low-
cost label collection methods. An automated corrupted sample finding technique would be ex-
tremely useful for differentiating samples with clean labels from samples with noisy labels. In
this case, data valuation may be employed by using a small clean validation set to assign low data
values to prospective samples with noisy labels. All noisy labels would receive the lowest data
values if an optimum data value estimator was used.
We assume the same experimental setting as the second scenario of the previous subsection,
in which 20% of the training set labels are noisy and incorrect. The results of this experiment is
demonstrated in Figure 6.5.
6.3.3 Comparison to other obfuscation methods
We assume the adversary is interested in extracting sensitive hidden features of the data providers’
data. In our experiments, we randomly assign 10% of the features as sensitive and the rest as
non-sensitive features. The attacker tries to infer the sensitive information from the obfuscated
information. For example, if the embedding vector size of the feature extraction network is 100,
106
we randomly choose 10 features as the sensitive ones. We consider the task of removing half of
the whole training set with the least quality samples. In this experiment, during training of the data
valuation network, we train the adversary model to predict sensitive features from the obfuscated
one. We compare the attacker L1 loss and target task accuracy of the proposed obfuscation method
to adversarial training [124] and feature pruning [137] methods. As demonstrated in Figure 6.6,
the obfuscation method of this work outperforms the prior arts in terms of the attacker utilization.
6.4 Conclusions and Future Work
In this paper, we propose an obfuscation method for the data valuation problem where data owners
would like to sell their valuable data without revealing the raw information of their data. We use the
data valuation using reinforcement learning as our underlying valuation algorithm, which uses the
obfuscated feature vectors for training the data value estimation network. Our experimental results
demonstrate that our proposed approach outperforms existing techniques and achieves about four
orders of magnitude reduction in the signal content while maintaining the accuracy loss to less
than 1%. In the future works, we plan to study the information in the feature space which carries
the actual data value. This enables the information obfuscation algorithms to properly capture the
features necessary to preserve for accurate data valuation. In addition, a future work can propose
an obfuscation algorithm specifically designed for the data valuation task.
107
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200
T arget task error rate
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
0.200
Attacker L1 loss of the predicted sensitive features
Adult Dataset
This work
Adversarial Training
Feature Pruning
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200
T arget task error rate
0.0
0.1
0.2
0.3
0.4
0.5
Attacker L1 loss of the predicted sensitive features
Blog Dataset
This work
Adversarial Training
Feature Pruning
(a) (b)
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200
T arget task error rate
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Attacker L1 loss of the predicted sensitive features
CIFAR-10 Dataset
This work
Adversarial Training
Feature Pruning
0.00 0.05 0.10 0.15 0.20
Target task error rate
0.00
0.05
0.10
0.15
0.20
Attacker L1 loss of the predicted sensitive features
Fashion-MNIST Dataset
This work
Adversarial Training
Feature Pruning
(c) (d)
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200
T arget task error rate
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Attacker L1 loss of the predicted sensitive features
Flower Dataset
This work
Adversarial Training
Feature Pruning
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200
T arget task error rate
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Attacker L1 loss of the predicted sensitive features
HAM-10000 Dataset
This work
Adversarial Training
Feature Pruning
(e) (f)
Figure 6.6: The comparison of different obfuscation methods on the least valued sample removal
task.
108
Chapter 7
Conclusion and Future Works
In this dissertation, we provided solutions for efficiently distributing the computations in a deep
neural network between a mobile device and a cloud server. In addition, we addressed privacy con-
cerns in sending the raw information to cloud servers by proposing a privacy-preserving framework
for private data valuation.
As future works, we suggest the following:
• Edge-cloud collaboration for video analytics. Processing videos locally on the device is
a challenging task because of lack of enough compute resources. Therefore, developing
a hybrid mobile-cloud system for proper offloading of video data is of great importance.
The streaming aspect of video data and temporal correlation can be exploited for further
reduction in the communication and computation.
• Obfuscation methods designed for data valuation task. Study the information in the
feature space which carries the actual data value which enables the information obfuscation
algorithms to properly capture the features necessary to preserve for accurate data valuation.
• Extension to hierarchical computing resource. In our formulations of mobile-cloud col-
laborative approaches, we solved the optimization problem for only two computing agents.
However, this can be extended to scenarios where we have a hierarchy of computation node
which is the case especially in data-centers.
109
References
1. Pouyanfar, S., Sadiq, S., Yan, Y ., Tian, H., Tao, Y ., Reyes, M. P., et al. A Survey on Deep
Learning: Algorithms, Techniques, and Applications. ACM Comput. Surv. 51, 92:1–92:36.
doi:10.1145/3234150 (2018).
2. Oh, K.-S. & Jung, K. GPU implementation of neural networks. 37, 1311–1314 (2004).
3. Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V ., et al. Large Scale
Distributed Deep Networks. NIPS’12 1223–1231 (2012).
4. Razlighi, M. S., Imani, M., Koushanfar, F. & Rosing, T. LookNN: Neural network with no
multiplication, 1775–1780. doi:10.23919/DATE.2017.7927280 (2017).
5. Sze, V ., Chen, Y ., Yang, T. & Emer, J. S. Efficient Processing of Deep Neural Networks:
A Tutorial and Survey. Proceedings of the IEEE 105, 2295–2329. doi:10.1109/JPROC.
2017.2761740 (2017).
6. Skala, K., Davidovic, D., Afgan, E., Sovic, I. & Sojat, Z. Scalable Distributed Computing
Hierarchy: Cloud, Fog and Dew Computing. Open Journal of Cloud Computing (OJCC) 2,
16–24 (2015).
7. Newsroom, A. The future is here: iPhone X. [Online; accessed 15-January-2018] (2017).
8. Li, H., Yu, J., Ye, Y . & Bregler, C. Realtime Facial Animation with On-the-fly Correctives.
ACM Trans. Graph. 32, 42:1–42:10 (2013).
9. Eshratifar, A. E., Esmaili, A. & Pedram, M. BottleNet: A Deep Learning Architecture for
Intelligent Mobile Cloud Computing Services, 1–6. doi:10.1109/ISLPED.2019.8824955
(2019).
10. Eshratifar, A. E., Esmaili, A. & Pedram, M. Towards Collaborative Intelligence Friendly
Architectures for Deep Learning, 14–19. doi:10.1109/ISQED.2019.8697647 (2019).
11. Eshratifar, A. E. & Pedram, M. Energy and Performance Efficient Computation Offloading
for Deep Neural Networks in a Mobile Cloud Computing Environment. GLSVLSI ’18 111–
116. doi:10.1145/3194554.3194565 (2018).
12. Pan, Y ., Cheng, C.-A., Saigol, K., Lee, K., Yan, X., Theodorou, E., et al. Agile Autonomous
Driving using End-to-End Deep Imitation Learning (2018).
13. Nazemi, M., Eshratifar, A. E. & Pedram, M. A Hardware-Friendly Algorithm for Scalable
Training and Deployment of Dimensionality Reduction Models on FPGA (2018).
14. Kang, Y ., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., et al. Neurosurgeon:
Collaborative Intelligence Between the Cloud and Mobile Edge. ASPLOS ’17 615–629.
doi:10.1145/3037697.3037698 (2017).
15. Chun, B.-G., Ihm, S., Maniatis, P., Naik, M. & Patti, A. CloneCloud: Elastic Execution
Between Mobile Device and Cloud. EuroSys ’11 301–314 (2011).
110
16. Teerapittayanon, S., McDanel, B. & Kung, H. T. Distributed Deep Neural Networks Over
the Cloud, the Edge and End Devices. 2017 IEEE 37th International Conference on Dis-
tributed Computing Systems (ICDCS), 328–339 (2017).
17. Ahmad, R. W., Gani, A., Hamid, S. H. A., Xia, F. & Shiraz, M. A Review on mobile ap-
plication energy profiling: Taxonomy, state-of-the-art, and open research issues. Journal of
Network and Computer Applications 58, 42–59. doi:https://doi.org/10.1016/j.
jnca.2015.09.002 (2015).
18. Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., et al. cuDNN:
Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014).
19. Qi, H., Sparks, E. R. & Talwalkar, A. Paleo: A Performance Model for Deep Neural Net-
works (2017).
20. Hong, S. & Kim, H. An Integrated GPU Power and Performance Model. SIGARCH Comput.
Archit. News 38, 280–289. doi:10.1145/1816038.1815998 (2010).
21. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778 (2016).
22. Wang, Z. & Crowcroft, J. Quality-of-service routing for supporting multimedia applications.
IEEE Journal on Selected Areas in Communications 14, 1228–1234. doi:10.1109/49.
536364 (1996).
23. Juttner, A., Szviatovski, B., Mecs, I. & Rajko, Z. Lagrange relaxation based method for the
QoS routing problem. 2, 859–868 vol.2 (2001).
24. Krizhevsky, A., Sutskever, I. & Hinton, G. E. in Advances in Neural Information Processing
Systems 25 (eds Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q.) 1097–1105
(Curran Associates, Inc., 2012).
25. (eds Bengio, Y . & LeCun, Y .) 2nd International Conference on Learning Representations,
ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings (2014).
26. (eds Bengio, Y . & LeCun, Y .) 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015).
27. Hannun, A. Y ., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., et al. Deep
Speech: Scaling up end-to-end speech recognition. CoRR abs/1412.5567 (2014).
28. (eds Bengio, Y . & LeCun, Y .) 2nd International Conference on Learning Representations,
ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings (2014).
29. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al.
in Advances in Neural Information Processing Systems 27 (eds Ghahramani, Z., Welling,
M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) 2672–2680 (Curran Associates, Inc.,
2014).
30. Finn, C. & Levine, S. Deep visual foresight for planning robot motion. 2017 IEEE Interna-
tional Conference on Robotics and Automation (ICRA), 2786–2793 (2016).
31. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA,
USA, June 7-12, 2015 (IEEE Computer Society, 2015).
32. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Hon-
olulu, HI, USA, July 21-26, 2017 (IEEE Computer Society, 2017).
33. Corporation, N. Jetson TX2 Module. [Online; accessed 15-January-2018] (2018).
34. Incorporated, T. I. INA Current/Power Monitor. [Online; accessed 15-January-2018] (2018).
35. Corporation, N. TESLA DATA CENTER GPUS FOR SERVERS. [Online; accessed 15-
January-2018] (2018).
111
36. OpenSignal.com. State of Mobile Networks: USA. [Online; accessed 15-January-2018]
(2017).
37. OpenSignal.com. United States Speedtest Market Report. [Online; accessed 15-January-
2018] (2017).
38. Huang, J., Qian, F., Gerber, A., Mao, Z. M., Sen, S. & Spatscheck, O. A Close Examination
of Performance and Power Characteristics of 4G LTE Networks. MobiSys ’12 225–238
(2012).
39. Glorot, X., Bordes, A. & Bengio, Y . Deep Sparse Rectifier Neural Networks. Proceedings
of Machine Learning Research 15 (eds Gordon, G., Dunson, D. & Dud´ ık, M.) 315–323
(2011).
40. Cover, T. M. & Thomas, J. A. Elements of Information Theory (Wiley Series in Telecommu-
nications and Signal Processing) (Wiley-Interscience, 2006).
41. Ra, M.-R., Sheth, A., Mummert, L., Pillai, P., Wetherall, D. & Govindan, R. Odessa: En-
abling Interactive Perception Applications on Mobile Devices. MobiSys ’11 43–56. doi:10.
1145/1999995.2000000 (2011).
42. Gordon, M. S., Jamshidi, D. A., Mahlke, S., Mao, Z. M. & Chen, X. COMET: Code Offload
by Migrating Execution Transparently. OSDI’12 93–106 (2012).
43. Cuervo, E., Balasubramanian, A., Cho, D.-k., Wolman, A., Saroiu, S., Chandra, R., et al.
MAUI: Making Smartphones Last Longer with Code Offload. MobiSys ’10 49–62. doi:10.
1145/1814433.1814441 (2010).
44. Wang, X., Liu, X., Zhang, Y . & Huang, G. Migration and Execution of JavaScript Appli-
cations Between Mobile Devices and Cloud. SPLASH ’12 83–84. doi:10.1145/2384716.
2384750 (2012).
45. Zhang, Y ., Huang, G., Liu, X., Zhang, W., Mei, H. & Yang, S. Refactoring Android Java
Code for On-demand Computation Offloading. SIGPLAN Not. 47, 233–248. doi:10.1145/
2398857.2384634 (2012).
46. Kumar, K., Liu, J., Lu, Y .-H. & Bhargava, B. A Survey of Computation Offloading for
Mobile Systems. Mobile Networks and Applications 18, 129–140. doi:10.1007/s11036-
012-0368-0 (2013).
47. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolu-
tional neural networks, 1097–1105 (2012).
48. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778 (2016).
49. Girshick, R., Donahue, J., Darrell, T. & Malik, J. Region-based convolutional networks
for accurate object detection and segmentation. IEEE transactions on pattern analysis and
machine intelligence 38, 142–158 (2016).
50. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., et al. Deep neural
networks for acoustic modeling in speech recognition: The shared views of four research
groups. IEEE Signal processing magazine 29, 82–97 (2012).
51. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations
of words and phrases and their compositionality, 3111–3119 (2013).
52. Choi, H. & Baji´ c, I. V . Deep Feature Compression for Collaborative Object Detection,
3743–3747. doi:10.1109/ICIP.2018.8451100 (2018).
112
53. Eshratifar, A. E., Abrishami, M. S. & Pedram, M. JointDNN: an efficient training and infer-
ence engine for intelligent mobile cloud computing services. arXiv preprint arXiv:1801.08618
(2018).
54. Eshratifar, A. E., Esmaili, A. & Pedram, M. Energy and Performance Efficient Compu-
tation Offloading for Deep Neural Networks in a Mobile Cloud Computing Environment.
GLSVLSI ’18 111–116. doi:10.1145/3194554.3194565 (2018).
55. Kang, Y ., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., et al. Neurosurgeon:
Collaborative intelligence between the cloud and mobile edge. ACM SIGPLAN Notices 52,
615–629 (2017).
56. Grulich, P. M. & Nawab, F. Collaborative edge and cloud neural networks for real-time
video processing. Proceedings of the VLDB Endowment 11, 2046–2049 (2018).
57. Chen, Z., Lin, W., Wang, S., Duan, L. & Kot, A. C. Intermediate Deep Feature Compression:
the Next Battlefield of Intelligent Sensing. arXiv preprint arXiv:1809.06196 (2018).
58. Choi, H. & Baji´ c, I. V . Near-Lossless Deep Feature Compression for Collaborative Intelli-
gence, 1–6. doi:10.1109/MMSP.2018.8547134 (2018).
59. NVIDIA TensorRT (2018).
60. Han, S., Mao, H. & Dally, W. J. Deep Compression: Compressing Deep Neural Network
with Pruning, Trained Quantization and Huffman Coding. CoRR abs/1510.00149 (2015).
61. Szegedy, C., Vanhoucke, V ., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the Inception Ar-
chitecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2818–2826 (2016).
62. Slee, M., Agarwal, A. & Kwiatkowski, M. Thrift: Scalable cross-language services imple-
mentation. Facebook White Paper 5 (2007).
63. Glorot, X. & Bengio, Y . Understanding the difficulty of training deep feedforward neural
networks. Proceedings of Machine Learning Research 9 (eds Teh, Y . W. & Titterington, M.)
249–256 (2010).
64. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K. & Wierstra, D. Matching Networks
for One Shot Learning. NIPS’16 3637–3645 (2016).
65. Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image
Recognition. CoRR abs/1409.1556 (2014).
66. Taigman, Y ., Yang, M., Ranzato, M. & Wolf, L. Deepface: Closing the gap to human-level
performance in face verification, 1701–1708 (2014).
67. Eshratifar, A. E., Esmaili, A. & Pedram, M. Towards Collaborative Intelligence Friendly
Architectures for Deep Learning, 14–19. doi:10.1109/ISQED.2019.8697647 (2019).
68. Donahue, J., Jia, Y ., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. DeCAF: A Deep
Convolutional Activation Feature for Generic Visual Recognition. ArXiv abs/1310.1531
(2013).
69. Parkhi, O. M., Vedaldi, A. & Zisserman, A. Deep Face Recognition (2015).
70. Sun, Y ., Chen, Y ., Wang, X. & Tang, X. Deep Learning Face Representation by Joint
Identification-Verification (2014).
71. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., et al.
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (2015).
72. Bahdanau, D., Cho, K. & Bengio, Y . Neural Machine Translation by Jointly Learning to
Align and Translate. CoRR abs/1409.0473 (2014).
113
73. Canziani, A., Paszke, A. & Culurciello, E. An Analysis of Deep Neural Network Models
for Practical Applications. ArXiv abs/1605.07678 (2017).
74. Samragh, M., Javaheripi, M. & Koushanfar, F. EncoDeep: Realizing Bit-Flexible Encoding
for Deep Neural Networks. ACM Trans. Embed. Comput. Syst.
75. Abrishami, M. S., Eshratifar, A. E., Eigen, D., Wang, Y ., Nazarian, S. & Pedram, M. Ef-
ficient Training of Deep Convolutional Neural Networks by Augmentation in Embedding
Space, 347–351 (2020).
76. HeydariGorji, A., Rezaei, S., Torabzadehkashi, M., Bobarshad, H., Alves, V . & Chou, P. H.
HyperTune: Dynamic Hyperparameter Tuning For Efficient Distribution of DNN Training
Over Heterogeneous Systems. ArXiv abs/2007.08077 (2020).
77. HeydariGorji, A., Torabzadehkashi, M., Rezaei, S., Bobarshad, H., Alves, V . & Chou, P. H.
STANNIS: Low-Power Acceleration of Deep Neural Network Training Using Computa-
tional Storage. ArXiv abs/2002.07215 (2020).
78. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. ImageNet: A large-scale
hierarchical image database (2009).
79. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., et al. Mo-
bileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv
abs/1704.04861 (2017).
80. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet Classification with Deep Convolu-
tional Neural Networks. NIPS (2012).
81. Xie, S., Girshick, R. B., Doll´ ar, P., Tu, Z. & He, K. Aggregated Residual Transformations
for Deep Neural Networks. 2017 IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 5987–5995 (2016).
82. Han, S., Pool, J., Tran, J. & Dally, W. J. Learning both Weights and Connections for Efficient
Neural Network. NIPS (2015).
83. Rastegari, M., Ordonez, V ., Redmon, J. & Farhadi, A. XNOR-Net: ImageNet Classification
Using Binary Convolutional Neural Networks (2016).
84. Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M., et al. EIE: Efficient Inference
Engine on Compressed Deep Neural Network. ISCA (2016).
85. Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., Dally, W. J. & Keutzer, K. SqueezeNet:
AlexNet-level accuracy with 50x fewer parameters and 1MB model size. ArXiv abs/1602.07360
(2017).
86. Georgiev, P., Bhattacharya, S., Lane, N. D. & Mascolo, C. Low-resource Multi-task Audio
Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representa-
tions. IMWUT (2017).
87. Hinton, G. E., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Network. ArXiv
abs/1503.02531 (2015).
88. Tan, M., Chen, B., Pang, R., Vasudevan, V . & Le, Q. V . MnasNet: Platform-Aware Neural
Architecture Search for Mobile. CVPR (2018).
89. Kang, Y ., Hauswald, J., Gao, C., Rovinski, A., Mudge, T. N., Mars, J., et al. Neurosurgeon:
Collaborative Intelligence Between the Cloud and Mobile Edge (2017).
90. Eshratifar, A. E., Abrishami, M. S. & Pedram, M. JointDNN: An Efficient Training and
Inference Engine for Intelligent Mobile Cloud Computing Services. IEEE Transactions on
Mobile Computing, 1–1 (2019).
114
91. Eshratifar, A. E. & Pedram, M. Energy and Performance Efficient Computation Offloading
for Deep Neural Networks in a Mobile Cloud Computing Environment, 111–116 (2018).
92. Teerapittayanon, S., McDanel, B. & Kung, H. T. Distributed Deep Neural Networks Over
the Cloud, the Edge and End Devices. 2017 IEEE 37th International Conference on Dis-
tributed Computing Systems (ICDCS) (2017).
93. Taylor, B., Marco, V . S., Wolff, W., El-khatib, Y . & Wang, Z. Adaptive deep learning model
selection on embedded systems (2018).
94. Eshratifar, A. E., Esmaili, A. & Pedram, M. BottleNet: A Deep Learning Architecture for
Intelligent Mobile Cloud Computing Services, 1–6 (2019).
95. Eshratifar, A. E., Esmaili, A. & Pedram, M. Towards Collaborative Intelligence Friendly
Architectures for Deep Learning, 14–19 (2019).
96. Choi, H. & Baji´ c, I. V . Deep Feature Compression for Collaborative Object Detection,
3743–3747. doi:10.1109/ICIP.2018.8451100 (2018).
97. Bhattacharya, S. & Lane, N. D. Sparsification and Separation of Deep Learning Layers for
Constrained Resource Inference on Wearables (2016).
98. Lane, N. D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., et al. DeepX:
A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. 2016
15th ACM/IEEE International Conference on Information Processing in Sensor Networks
(IPSN) (2016).
99. Loc, H. N., Lee, Y . & Balan, R. K. DeepMon: Mobile GPU-based Deep Learning Frame-
work for Continuous Vision Applications (2017).
100. Sagi, O. & Rokach, L. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery (2018).
101. Van der Maaten, L. & Hinton, G. E. Visualizing Data using t-SNE (2008).
102. Chen, W., Liu, T.-Y ., Lan, Y ., Ma, Z. & Li, H. Ranking Measures and Loss Functions in
Learning to Rank (2009).
103. Breiman, L. Stacked regressions. Machine Learning 24, 49–64 (1996).
104. Ookla. 2019 Speedtest U.S. Mobile Performance Report (2019).
105. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. Automatic dif-
ferentiation in PyTorch (2017).
106. IOTA.
107. DAWEX.
108. Xignite.
109. Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., et al. Deep
Learning Scaling is Predictable, Empirically. ArXiv abs/1712.00409 (2017).
110. Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T., Seliya, N., Wald, R. & Muharemagic,
E. A. Deep learning applications and challenges in big data analytics. Journal of Big Data
2, 1–21 (2014).
111. Toneva, M., Sordoni, A., des Combes, R. T., Trischler, A., Bengio, Y . & Gordon, G. J. An
Empirical Study of Example Forgetting during Deep Neural Network Learning (2019).
112. Ferdowsi, H., Jagannathan, S. & Zawodniok, M. An Online Outlier Identification and Re-
moval Scheme for Improving Fault Detection Performance. IEEE Transactions on Neural
Networks and Learning Systems 25, 908–919. doi:10.1109/TNNLS.2013.2283456 (2014).
115
113. Frenay, B. & Verleysen, M. Classification in the Presence of Label Noise: A Survey. IEEE
Transactions on Neural Networks and Learning Systems 25, 845–869. doi:10.1109/TNNLS.
2013.2292894 (2014).
114. Ngiam, J., Peng, D., Vasudevan, V ., Kornblith, S., Le, Q. V . & Pang, R. Domain Adaptive
Transfer Learning with Specialist Models. ArXiv abs/1811.07056 (2018).
115. Zhu, L., Arik, S.
¨
O., Yang, Y . & Pfister, T. Learning to Transfer Learn. ArXiv abs/1908.11406
(2019).
116. Ghorbani, A. & Zou, J. Data Shapley: Equitable Valuation of Data for Machine Learning
(2019).
117. Yoon, J., Arik, S. & Pfister, T. Data Valuation using Reinforcement Learning. Proceedings
of Machine Learning Research 119 (eds III, H. D. & Singh, A.) 10842–10851 (2020).
118. Rathee, D., Rathee, M., Kumar, N., Chandran, N., Gupta, D., Rastogi, A., et al. CrypT-
Flow2: Practical 2-Party Secure Inference. CCS ’20 325–342. doi:10.1145/3372297.
3417274 (2020).
119. Dwork, C. Differential Privacy (eds Bugliesi, M., Preneel, B., Sassone, V . & Wegener, I.)
1–12 (2006).
120. Dwork, C., McSherry, F., Nissim, K. & Smith, A. Calibrating Noise to Sensitivity in Private
Data Analysis (eds Halevi, S. & Rabin, T.) 265–284 (2006).
121. Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S. & Smith, A. What Can
We Learn Privately?, 531–540. doi:10.1109/FOCS.2008.27 (2008).
122. Ullman, J. Tight Lower Bounds for Locally Differentially Private Selection (2021).
123. Kairouz, P., McMahan, B. H., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., et al. Ad-
vances and Open Problems in Federated Learning working paper or preprint. 2021.
124. Li, A., Guo, J., Yang, H. & Chen, Y . DeepObfuscator: Adversarial Training Framework for
Privacy-Preserving Image Classification. ArXiv abs/1909.04126 (2019).
125. Liu, S., Shrivastava, A., Du, J. & Zhong, L. Better accuracy with quantified privacy: repre-
sentations learned via reconstructive adversarial network (2019).
126. Jang, E., Gu, S. & Poole, B. Categorical Reparameterization with Gumbel-Softmax (2017).
127. Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic Backpropagation and Approximate
Inference in Deep Generative Models. ICML’14 II–1278–II–1286 (2014).
128. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning 8, 229–256 (2004).
129. Kraskov, A., St¨ ogbauer, H. & Grassberger, P. Estimating mutual information. Physical re-
view. E, Statistical, nonlinear, and soft matter physics 69 6 Pt 2, 066138 (2004).
130. Mahendran, A. & Vedaldi, A. Understanding deep image representations by inverting them,
5188–5196. doi:10.1109/CVPR.2015.7299155 (2015).
131. Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of
multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data 5
(2018).
132. Xiao, H., Rasul, K. & V ollgraf, R. Fashion-MNIST: a Novel Image Dataset for Benchmark-
ing Machine Learning Algorithms (2017).
133. Krizhevsky, A., Nair, V . & Hinton, G. CIFAR-10 (Canadian Institute for Advanced Re-
search).
134. Nilsback, M.-E. & Zisserman, A. Automated Flower Classification over a Large Number of
Classes (2008).
116
135. Buza, K. Feedback Prediction for Blogs (eds Spiliopoulou, M., Schmidt-Thieme, L. & Jan-
ning, R.) 145–152 (2014).
136. Kohavi, R. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid
(1996).
137. Li, H., Kadav, A., Durdanovic, I., Samet, H. & Graf, H. Pruning Filters for Efficient Con-
vNets. ArXiv abs/1608.08710 (2017).
117
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Cloud-enabled mobile sensing systems
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Exploring complexity reduction in deep learning
PDF
Optimizing privacy-utility trade-offs in AI-enabled network applications
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Security and privacy in information processing
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Shift-invariant autoregressive reconstruction for MRI
PDF
Ultra-low-latency deep neural network inference through custom combinational logic
PDF
Adaptive resource management in distributed systems
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Circuit design with nano electronic devices for biomimetic neuromorphic systems
Asset Metadata
Creator
Eshratifar, Amir Erfan
(author)
Core Title
Edge-cloud collaboration for enhanced artificial intelligence
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2021-12
Publication Date
10/28/2021
Defense Date
10/01/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,cloud computing,deep learning,deep neural networks,mobile computing,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Leahy, Richard (
committee chair
), Golubchik, Leana (
committee member
), Nuzzo, Pierluigi (
committee member
)
Creator Email
erfaneshrati@gmail.com,eshratif@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC16251813
Unique identifier
UC16251813
Legacy Identifier
etd-Eshratifar-10179
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Eshratifar, Amir Erfan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
cloud computing
deep learning
deep neural networks
mobile computing