Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
(USC Thesis Other)
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Coded Computing: Mitigating Fundamental Bottlenecks in Large-scale Data Analytics Author: Songze Li songzeli@usc.edu A Dissertation Submitted to the Faculty of the USC Graduate School In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Committee Members: Prof. Salman Avestimehr (Chair) Prof. Mahdi Soltanolkotabi Prof. Leana Golubchik Department of Electrical Engineering University of Southern California Los Angeles, CA 90089 August 2018 Abstract In this dissertation, we introduce the concept of “coded computing”, which is a novel distributed computing paradigm that utilizes coding theory to smartly inject and lever- age data/computation redundancy into distributed computing systems, achieving optimal tradeoffs between key resources like computation, bandwidth and storage, and significantly mitigating the fundamental performance bottlenecks for running large-scale data analytics. We consider the widely used MapReduce distributed computing framework, for which the input files distributedly stored on computing nodes are first mapped into intermediate val- ues, then the computed intermediate values are shuffled between the nodes to be reduced to final output results. For such a framework, we characterize a fundamental inversely proportional tradeoff between computation load in the Map phase and the communication load in the Shuffle phase. We propose a coded scheme, named “coded distributed comput- ing” (CDC), to achieve this tradeoff. CDC performs redundant Map computations across r nodes following a particular structure, enabling coding opportunities to create coded pack- ets that are simultaneously useful for r nodes, and hence reducing the communication load by r times. For computation tasks with particular algebraic structures, e.g., multi-stage computations and computations with linear aggregations, we further optimize the CDC scheme to achieve additional reduction in the bandwidth consumption. We illustrate the practical impact of CDC by developing and evaluating a novel distributed sorting algorithm, named CodedTeraSort, by integrating the coding techniques of CDC into a commonly used Hadoop sorting benchmark TeraSort. On Amazon EC2 clusters, we empirically demon- strate a 3.39× speedup over the benchmark, by optimally trading extra computations for bandwidth consumption using CodedTeraSort. Beyond cloud computing systems, we extend the idea of CDC into the mobile edge com- puting environment, where many resource-poor mobile users scattered at the network edge collaborate to meet their computational needs by locally processing partial data and shuf- fling intermediate computation results via a wireless access point. For this edge computing model, we apply the CDC scheme to develop a scalable wireless distributed computing plat- form that can accommodate an unlimited number of mobile users with constant bandwidth requirement, via communicating coded packets that are simultaneously useful for multiple users. This platform provides a promising solution to enable efficient collaborative edge/fog computing for Internet-of-Things (IoT) devices, by substantially alleviating the heavy load of communication between a huge number of devices exerted on the underlying wireless channels with limited bandwidth. For another mobile edge computing model in which mo- bile users offload their computation tasks to the edge servers and receive the computation results from the servers, we propose a universal coded edge computing (UCEC) architecture to simultaneously minimize the load of computation at the edge servers, and maximize the physical-layer communication efficiency towards the mobile users. In the proposed UCEC architecture, edge servers create coded inputs of the users, from which they compute coded output results. Then, the edge servers utilize the computed coded results to create com- munication messages that zero-force all the interference signals over the air at each user. i Specifically, the proposed scheme is universal since the coded computations performed at the edge nodes are oblivious of the channel states during the communication process from the edge nodes to the users. Finally, motivated by the idea of utilizing error correcting codes to handle the performance bottleneck caused by straggling servers, whose computation and/or communication is much slower than the other servers, we study the problem of designing optimal coding schemes to speed up MapReduce jobs run over heterogeneous computing clusters, in which some of the servers may be stragglers. For this setting, we propose a unified coding scheme that organically superimposes the proposed CDC scheme on top of Maximum-Distance-Separable (MDS) codes on individual tasks, allowing a flexible tradeoff between the computation latency limited by the stragglers in the Map phase and the communication load between the surviving nodes in the Shuffle phase. ii Acknowledgements First and foremost I would like to express my deepest respect and gratitude to my advi- sor Prof. Salman Avestmehr. Salman sets a perfect example for me of being an excellent researcher and supervisor. With his outstanding vision, knowledge, and patience, Salman guided me through every step of identifying research directions, formulating problems, find- ing solutions, and presenting the results. As a advisor, Salman always emphasized on working on the “right” problems, from which I learnt how to find relevant problems, and produce impactful results. I remember that we used to spend hours to discuss technical details and the way to present the results, from which I learnt that it is essential for a researcher to stay precise and rigorous. I also remember that Salman always spent a lot of efforts to practice presentations with me to make sure that the audience get as excited as us. From this, I learnt that communicating the results to the others is as important as obtaining the results itself. After all these years working with Salman, I gradually com- pleted my transition from a graduate student to a mature researcher. During the summer of 2016, with the collaboration of Dr. Mohammad Ali Maddah-Ali, we first identify the opportunities of applying coding theory into the domain of distributed computing, which has become my main research direction since then, and has drawn a lot of attention as a promising and important application of coding theory. Finally, I thank Prof. Avestimehr again for all the invaluable suggestions and help he provided for my future career after PhD. I would next thank Dr. Mohammad Ali Maddah-Ali from Bell Labs (now at Sharif Univer- sity of Technology), who has been an amazing collaborators over the last 3 years. I really appreciate the perspectives from Dr. Maddah-Ali in identifying problems, and the insights brought by him to resolve technical difficulties. I would like to thank Qian Yu from my re- search group who has done fantastic work to help to strengthen many of my results. I would also like to thank my colleague Sucha Supittayapornpong who has helped to implement our coded computing framework on the cloud. I would like to thank the members of my qualifying exam and dissertation committee mem- bers, Prof. Andreas Molisch, Prof. Viktor K. Prasanna, Prof. Bhaskar Krishnamachari, Prof. Mahdi Soltanolkotabi, Prof. Yan Liu and Prof. Leana Golubchik, whose insightful feedbacks have helped to significantly improve the quality of this dissertation. I would like to thank Prof. Urbashi Mitra for the support and guidance during the early stage of my PhD study. It has been an amazing experience for me to work with the brilliant group members at USC, including Navid Naderializadeh, David Kao, Mehrdad Kiamari, Qian Yu, Aly El Gamal, Eyal En Gad, Mohammadreza Mousavi Kalan, Saurav Prakash, Chien-Sheng Yang, and iii Mingchao (Fisher) Yu. It is also a great honor to work and interact with fantastic colleagues at the Communication Sciences Institute (CSI) of USC, including Zheda Li, Rui Wang, Hao Feng, Vinod Kristem, Daoud Burghal, Mingyue Ji, Hao Yu, Xiaohan Wei. I would also like to thank the staff of CSI, Susan Wiedem, Gerrielyn Ramos, and Corine Wong, who have provided me great help during my years with CSI. Last but not least, I would like to express my deepest thanks to my beloved family, my wife Aina Che, my son Lucas Li, and my parents Daming Li and Xin Cheng, for their always caring, understanding, and supporting me. This dissertation is dedicated to them. iv Contents Abstract i Acknowledgements iii Contents v List of Figures viii 1 Introduction 1 2 A Fundamental Tradeoff between Computation and Communication 6 2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Illustrative Examples: Coded Distributed Computing . . . . . . . . . . . . . 17 2.4 General Achievable Scheme: Coded Distributed Computing . . . . . . . . . 21 2.4.1 Map Phase Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 Coded Data Shuffling . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Correctness of CDC . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.4 Communication Load . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.5 Non-Integer Valued Computation Load . . . . . . . . . . . . . . . . 26 2.5 Converse of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Extension of CDC: linear computations and compressed CDC . . . . . . . . 35 2.6.1 Linear Reduce functions . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.2 Network model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.3 Computation model . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.5 Description of the compressed CDC scheme . . . . . . . . . . . . . . 39 2.7 Extension of CDC: multi-stage dataflows . . . . . . . . . . . . . . . . . . . . 46 v Contents 2.7.1 Problem Formulation: Layered-DAG . . . . . . . . . . . . . . . . . . 47 2.7.2 CDC for Layered-DAG . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3 Coded TeraSort 55 3.1 Execution Time of Coded Distributed Computing (CDC) . . . . . . . . . . 57 3.2 TeraSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.1.1 File Placement . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.1.2 Key Domain Partitioning . . . . . . . . . . . . . . . . . . . 59 3.2.1.3 Map Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.1.4 Shuffle Stage . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.1.5 Reduce Stage . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 Coded TeraSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.1 Structured Redundant File Placement . . . . . . . . . . . . . . . . . 62 3.3.2 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.3 Encoding to Create Coded Packets . . . . . . . . . . . . . . . . . . . 63 3.3.4 Multicast Shuffling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.5 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.6 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.1 Implementation Choices . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.5 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 72 4 A Scalable Framework for Wireless Distributed Computing 74 4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 The Proposed CWDC Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 The Proposed CWDC Scheme for the Decentralized Setting . . . . . . . . . 88 4.4 Optimality of the Proposed CWDC Schemes . . . . . . . . . . . . . . . . . 91 4.4.1 Lower Bound on L ∗ u (U) . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.2 Lower Bound on L ∗ d (U) . . . . . . . . . . . . . . . . . . . . . . . . . 94 5 A Universal Coded Computing Architecture for Mobile Edge Processing 97 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1.1 Computation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1.2 Communication phase . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Motivation and Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Illustration of the Universal Coded Edge Computing scheme via a simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 Universal Coded Edge Computing Architecture . . . . . . . . . . . . . . . . 110 5.4.1 Overview of UCEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 vi Contents 5.4.2 Generating coded inputs . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4.3 Computation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4.4 Communication phase . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5 Robust Universal Coded Edge Computing . . . . . . . . . . . . . . . . . . . 115 5.5.1 Generating coded inputs . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.5.2 Computation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.5.3 Communication phase . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6 A Unified Coding Framework with Straggling Servers 121 6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1.2 Distributed Computing Model . . . . . . . . . . . . . . . . . . . . . 124 6.1.3 Illustrating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3 Proposed Coded Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.3.1 Example: m = 20, N = 12, K = 6 and μ = 1 2 . . . . . . . . . . . . . . 132 6.3.2 General Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.4 Converse of Theorem 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A Converse of Theorem 2.2 140 B Coded TeraSort Experiment Results 145 C Constant Multiplicative Gap of Minimum Bandwidth Code 147 Bibliography 149 vii List of Figures 2.1 Comparison of the communication load achieved by Coded Distributed Com- puting L coded (r) with that of the uncoded scheme L uncoded (r), for Q = 10 output functions, N = 2520 input files and K = 10 computing nodes. For r∈{1,...,K}, CDC is r times better than the uncoded scheme. . . . . . . 8 2.2 Illustration of a two-stage distributed computing framework. The overall computation is decomposed into computing a set of Map and Reduce functions. 12 2.3 Minimum communication load L ∗ (r,s) = L coded (r,s) in Theorem 2.2, for Q=360 output functions, N =2520 input files and K =10 computing nodes. 17 2.4 Illustrations of the conventional uncoded distributed computing scheme with computation load r = 1, and the proposed Coded Distributed Computing scheme with computation load r = 2, for computing Q = 3 functions from N = 6 inputs on K = 3 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Illustration of the CDC scheme to compute Q = 6 output functions from N = 6 input files distributedly at K = 4 computing nodes. Each file is mapped byr = 2 nodes and each output function is computed bys = 2 nodes. After the Map phase, every node knows 6 intermediate values, one for each output function, in every file it has mapped. The Shuffle phase proceeds in two rounds. In the first round, each node multicasts bit-wise XOR of intermediate values to subsets of two nodes. In the second round, each node splits an intermediate value v q,n evenly into two segments v q,n = (v (1) q,n ,v (2) q,n ), and multicasts two linear combinations of the segments that are cosntructed using coefficients α 1 , α 2 , and α 3 to the other three nodes. . . . . . . . . . . 19 2.6 A file assignment for N = 6 files and K = 3 nodes. . . . . . . . . . . . . . 28 2.7 File placement onto K = 4 computing nodes. For each j = 1, 2, 3, 4, we place the set of files for job j,{1 (j) , 2 (j) ,..., 6 (j) } onto a unique subset of μK + 1 = 3 nodes, following a repetitive pattern where each file is stored on μK = 2 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 Illustration of the operations in the second stage of compressed CDC, in the subset of Nodes 1, 2, and 3. Note that in this stage, pre-combined packets from different jobs are utilized to create coded multicast packets. . . . . . . 45 2.9 A 4-layer DAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.10 A diamond DAG. The reduce factors s 1 ,...,s 4 are determined by the com- putation loads r 2 ,r 3 ,r 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 viii List of Figures 2.11 Illustration of the mapped files in the Map phases of the vertices m 2 andm 3 in the diamond DAG, for the case Q 1 =N 2 =N 3 = 6, r 2 = 2, and r 3 = 1. . 53 3.1 Illustration of conventional TeraSort with K = 4 nodes and key domain partitions [0, 25), [25, 50), [50, 75), [75, 100]. A dotted box represents an input file. An input file is hashed into 4 groups of KV pairs, one for each partition. For each of the 4 partitions, the 4 groups of KV pairs belonging to that partition computed on 4 nodes are all fetched to a corresponding node, which sorts all KV pairs in that partition locally. . . . . . . . . . . . . . . . . . . . 59 3.2 An illustration of the structured redundant file placement in CodedTeraSort with K = 4 nodes and r = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 An illustration of the Map stage at Node 1 in CodedTeraSort with K = 4, r = 2 and the key partitions [0, 25), [25, 50), [50, 75), [75, 100]. . . . . . . . . 63 3.4 An illustration of the encoding process within a multicast groupM ={1, 2, 3}. 64 3.5 An illustration of the decoding process within a multicast groupM ={1, 2, 3}. 67 3.6 The coordinator-worker system architecture. . . . . . . . . . . . . . . . . . . 68 3.7 (a) Serial unicast in the Shuffle stage of TeraSort; a solid arrow represents a unicast. (b) Serial multicast in the Multicast Shuffle stage of CodedTeraSort; a group of solid arrows starting at the same node represents a multicast. . . 69 4.1 Illustration of the CWDC scheme for an example of 3 mobile users. . . . . . 76 4.2 Comparison of the communication loads achieved by the uncoded scheme with those achieved by the proposed CWDC scheme, for a network ofK = 20 users. Here the storage size μ≥ 1 K = 0.05 such that the entire dataset can be stored across the users. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 A wireless distributed computing system. . . . . . . . . . . . . . . . . . . . 80 4.4 A two-stage distributed computing framework decomposed into Map and Reduce functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Comparison of the communication loads achieved by the centralized and the decentralized CWDC schemes, for a network of K = 20 participating users. 90 4.6 Concentration of the number of users each files is stored at around μK. Each curve demonstrates the normalized fraction of files that are stored by different numbers of users, for a particular number of participating users K. The density functions are computed for a storage size μ = 0.4, and for K = 2 3 ,..., 2 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.1 A mobile edge computing system consisting of K mobile users and M edge nodes. The edge processing consists of the computation phase and the com- munication phase. The users’ requests are processed at the edge nodes in the computation phase, and the computed results are delivered back to the users in the communication phase. . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2 Channel-state-informed coded edge computing for the scenario of K = 2 users and M = 2 edge nodes. Using channel state information to design coded computations allows zero-forcing the interference signal at each user. 105 ix List of Figures 6.1 The Latency-Load tradeoff, for a distributed matrix multiplication job of computingN = 840 output vectors usingK = 14 servers each with a storage size μ = 1/2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Illustration of the Minimum Bandwidth Code in [1] and the Minimum La- tency Code in [2]. (a) Minimum Bandwidth Code. Every row of A is mul- tiplied with the input vectors twice. For k = 1, 2, 3, 4, Server k reduces the output vector y k . In the Shuffle phase, each server multicasts 3 bit-wise XORs, denoted by⊕, of the calculated intermediate values, each of which is simultaneously useful for two other servers. (b) Minimum Latency Code. A is encoded into 24 coded rows c 1 ..., c 24 . Server 1 and 3 finish their Map computations first. They then exchange enough number (6 for each output vector) of intermediate values to reduce y 1 , y 2 at Server 1 and y 3 , y 4 at Server 3.127 6.3 Comparison of the latency-load pairs achieved by the proposed scheme with the outer bound, for computingN = 180 output vectors usingK = 18 servers each with a storage size μ = 1/3, assuming the the distribution function of the Map time in (6.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4 Storage Design when the Map phase is terminated when 4 servers have fin- ished the computations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.5 Multicasting 9 coded intermediate values across Servers 1, 2 and 3. Similar coded multicast communications are performed for another 3 subsets of 3 servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.6 General MDS coding and storage design. . . . . . . . . . . . . . . . . . . . . 134 6.7 Cut-set of Servers 1,...,t for the compound setting consisting of thed q t e output assignments in (6.18). . . . . . . . . . . . . . . . . . . . . . . . . . . 138 x To my wife Aina Che, and my parents Daming Li and Xin Cheng. xi Chapter 1 Introduction Recent years have witnessed a rapid growth of large-scale machine learning and big data an- alytics, facilitating the developments of data-intensive applications like voice/image recogni- tion, real-time mapping services, autonomous driving, social networks, and augmented/vir- tual reality. These applications are supported by cloud infrastructures composed of large data centers. For example, to support around 1 billion active daily users, Facebook has built four huge data centers and another two under construction [3]. Within a datacenter, a massive amount of users’ data are stored distributedly on hundreds of thousands of low- end commodity servers, and any application of big data analytics has to be performed in a distributed manner within or across data centers. This has motivated the fast develop- ment of scalable, interpretable, and fault-tolerant distributed computing frameworks (see, e.g., [4–8]) that efficiently utilize the underlying hardware resources (e.g., CPUs and GPUs). It is well known that communicating intermediate computation results (or data shuffling) is one of the major performance bottlenecks for various distributed computing applications, including self-join [9], TeraSort [10], and many machine learning algorithms [11]. For in- stance, in a Facebook’s Hadoop cluster, it is observed that 33% of the overall job execution time is spent on data shuffling [11]. Also as is observed in [12], 70% of the overall job execution time is spent on data shuffling when running a self-join job on an Amazon EC2 cluster. This bottleneck is becoming worse for training deep neural networks with millions of model parameters (e.g., ResNet-50 [13]), where partial gradients with millions of entries are computed at distributed computing nodes and passed across the network to update the model parameters [14]. 1 Chapter 1. Introduction Many optimization methods have been proposed to alleviate the communication bottleneck in distributed computing systems. For example, from the algorithm perspective, when the function that reduces the final result is commutative and associative, it was proposed to pre-combine intermediate results before data shuffling, reducing the amount of data move- ment through the network [4, 15]. More generally, the line of work on communication complexity [16–18] study the problem of minimizing the size of the messages exchanged between distributed computing parties, exploiting the algebraic properties of the computed functions. On the other hand, from the system perspective, optimal flow scheduling across network paths has been designed to accelerate the data shuffling process [19, 20], and dis- tributed cache memories were utilized to speed up the data transfer between consecutive computation stages [21, 22]. Recently, motivated by the fact that training algorithms ex- hibit tolerance to precision loss of intermediate results, a family of lossy compression (or quantization) algorithms for distributed learning systems have been developed to compress the intermediate results (e.g., gradients), and then the compressed results are communicated to achieve a smaller bandwidth consumption (see, e.g., [23–25]). The above mentioned approaches are designed for specific types of computations, or particu- lar network structures, and difficult to generalize to handle arbitrary distributed computing tasks. In this document, we propose to utilize coding theory to slash the communica- tion bottleneck in distributed computing applications. In particular, we consider a general MapReduce-type distributed computing model [4], in which each input file is mapped into multiple intermediate values, one for each of the output functions, and the intermediate val- ues from all input files for each output function are collected and reduced to the final output result. For this model, we propose a coded computing scheme, named “coded distributed computing” (CDC), that trades extra local computations for more network bandwidth. For some design parameterr, which is termed as “communication load”, the CDC scheme places and maps each of the input files on r carefully chosen distributed computing nodes, inject- ing r times more local computations. In return, the redundant computations at the nodes produce side information, which enable the opportunities to create coded multicast packets during data shuffling that are simultaneously useful for r nodes. That is, the CDC scheme tradesr times more redundant computations for anr times reduction in the communication load. Furthermore, we theoretically demonstrate that the inversely proportional tradeoff between computation and communication achieved by CDC is fundamental, i.e., for a given computation load, no other distributed computing schemes can achieve a lower communica- tion load than that achieved by CDC. Having developed the CDC framework for a general 2 Chapter 1. Introduction distributed computing model, we further extend it to accommodate the scenario where the Reduce (aggregation) function is linear, and the scenario of multi-stage computations where the entire computation consists of multiple rounds of MapReduce executions. Exploiting the specific structures of these computing scenarios, the extended schemes achieve additional reduction in the communication load on top of the CDC scheme. Having proposed the CDC framework and characterized its optimal performance in trad- ing extra computations for communication bandwidth, we also empirically demonstrate its practical impact on speeding up distributed computing applications. In particular, we in- tegrate the principle of CDC into a distributed sorting algorithm TeraSort [26], which is a commonly used benchmark in Hadoop MapReduce, developing a novel distributed sort- ing algorithm, named CodedTeraSort. At a high level, CodedTeraSort imposes structured redundancy in the input data, enabling in-network coding opportunities to significantly slash the load of data shuffling, which is a major bottleneck of the run-time performance of TeraSort. Through extensive experiments on Amazon EC2 [27] clusters, we demonstrate that CodedTeraSort achieves 1.97×-3.39× speedup over TeraSort, for typical settings of interest. Despite the extra overhead imposed by coding (e.g., generation of the coding plan, data encoding and decoding), the practically achieved performance gain approximately matches the gain theoretically promised by the proposed CodedTeraSort algorithm. Having demonstrated the impact of coding on improving the performance of applications run on data centers, we also introduce the concept of coded computing to tackle the scenarios of mobile edge/fog computing, where the communication bottleneck is even more severe due to the low data rate and the large number of mobile users. In particular, we consider a wireless distributed computing platform, which is composed of a cluster of mobile users scattered around the network edge, connected wirelessly through an access point. Each user has a limited storage and processing capability, and the users have to collaborate to satisfy their computational needs that require processing a large dataset. This ad hoc computing model, in contrast to the centralized cloud computing model, is becoming increasingly common in the emerging fog computing paradigm for Internet-of-Things (IoT) applications [28, 29]. For this wireless distributed computing platform, following the principle of the CDC scheme, we propose a coded wireless distributed computing (CWDC) scheme that jointly designs the local storage and computation for each user, and the communication between the users through an intermediate access point. The CWDC scheme achieves a constant bandwidth consumption that is independent of the number of users in the network, which leads to a scalable design of the platform that can simultaneously accommodate an arbitrary number 3 Chapter 1. Introduction of users. Moreover, we consider a more practically important decentralized setting, in which each user needs to decide its local storage and computation independently without knowing the existence of any other participating users. In this case, we extend the proposed CWDC scheme to achieve a bandwidth consumption that is very close to that of the centralized setting. In the above mobile distributed computing model, the computations are performed on each of the users who has a part of the dataset, and the access point (or edge node) is only responsible for communicating between users. In this document, we also consider another prevalent mobile edge computing model where mobile users (e.g., smartphones, and smart cars) offload their computation requests to the computing nodes distributed at the network edge (e.g., base stations). The edge nodes process the offloaded requests in the “computation phase”, and return the results to the users in the “communication phase” through wireless links. With emphasis on the role of coding in both computation and communication phases, we propose a “universal coded edge computing” (UCEC) scheme to simultaneously minimize the load of computation at the edge nodes, and maximize the physical-layer communication efficiency towards the mobile users. In the proposed UCEC scheme, edge nodes create coded input requests of the users, from which they compute coded output results. Then, the edge nodes utilize the computed coded results to create communication messages that zero-force all the interference signals over the air at each user. Specifically, the proposed scheme is universal since the coded computations performed at the edge nodes are oblivious of the channel states during the communication process from the edge nodes to the users. Finally, we extend the UCEC scheme to be robust to missing a subset of edge nodes in the communication phase. The extended UCEC scheme enables the delivery of the computation results from any subset of the edge nodes with slightly more computations, effectively overcoming the problem of losing connections between edge nodes and users that can be caused by users’ mobility or edge nodes that compute significantly slower than the others. Other than data shuffling, another major performance bottleneck of distributed comput- ing applications is the effect of stragglers. That is, the execution time of a computation consisting of multiple parallel tasks is limited by the slowest task run on the straggling processor. In Hadoop MapReduce [30], which is the original open-source implementation of MapReduce, the stragglers are constantly detected and the slow tasks are speculatively restarted on other available nodes. Following this idea of straggler detection, more timely straggler detection algorithms and better scheduling algorithms have been developed to 4 Chapter 1. Introduction further alleviate the straggler effect (see, e.g., [31, 32]). Apart from straggler detection and speculative restart, another straggler mitigation technique is to schedule the clones of the same task (see, e.g., [33–37]). The underlying idea of cloning is to execute redundant tasks such that the computation can proceed when the results of the fast-responding clones have returned. Recently, it has been proposed to utilize error correcting codes for straggler mit- igation [2, 38–40]. It was shown in [2] that applying the the maximum-distance-separable (MDS) code [41] for distributed matrix multiplication significantly improves the robust- ness against the stragglers over the state-of-the-art approaches (e.g., cloning), and hence minimizing the overall computation latency. We refer to the above CDC scheme as the minimum bandwidth code, and the above MDS coding strategy as the minimum latency code. In this document, we propose a unified coding framework for a class of linear distributed computing jobs executed following a MapReduce model, which organically superimposes the minimum bandwidth code on top of the minimum latency code. This framework allows us to characterize a tradeoff between computation latency in the Map phase and the load of communication between the surviving servers in the communication phase, which has the minimum bandwidth code and the minimum latency code as the two extreme points: minimizing the communication load or the computation latency. On the proposed tradeoff, one can select the optimal operation point to minimize the overall job execution time. 5 Chapter 2 A Fundamental Tradeoff between Computation and Communication We consider a general distributed computing framework, motivated by prevalent structures like MapReduce [4] and Spark [5], in which the overall computation is decomposed into two stages: “Map” and “Reduce” . Firstly in the Map stage, distributed computing nodes process parts of the input data locally, generating some intermediate values according to their designed Map functions. Next, they exchange the calculated intermediate values among each other (a.k.a. data shuffling), in order to calculate the final output results distributedly using their designed Reduce functions. Within this framework, data shuffling often appears to limit the performance of distributed computing applications, including self-join [9], tera-sort [10], and machine learning algo- rithms [11]. For example, in a Facebook’s Hadoop cluster, it is observed that 33% of the overall job execution time is spent on data shuffling [11]. Also as is observed in [12], 70% of the overall job execution time is spent on data shuffling when running a self-join application on an Amazon EC2 cluster [27]. As such motivated, we ask this fundamental question that if coding can help distributed computing in reducing the load of communication and speeding up the overall computation? Coding is known to be helpful in coping with the channel un- certainty in telecommunication and also in reducing the storage cost in distributed storage systems and cache networks. In this work, we extend the application of coding to distributed This chapter is mainly taken from [1, 42–44], coauthored by the author of this document. 6 Chapter 2. A Fundamental Tradeoff between Computation and Communication computing and propose a framework to substantially reduce the load of data shuffling via coding and some extra computing in the Map phase. More specifically, we formulate and characterize a fundamental tradeoff relationship between “computation load” in the Map phase and “communication load” in the data shuffling phase, and demonstrate that the two are inversely proportional to each other. We propose an optimal coded scheme, named “Coded Distributed Computing” (CDC), which demonstrates that increasing the computation load of the Map phase by a factor ofr (i.e., evaluating each Map function atr carefully chosen nodes) can create novel coding opportunities in the data shuffling phase that reduce the communication load by the same factor. To illustrate our main result, consider a distributed computing framework to compute Q arbitrary output functions from N input files, using K distributed computing nodes. As mentioned earlier, the overall computation is performed by computing a set of Map and Reduce functions distributedly across the K nodes. In the Map phase, each input file is processed locally, in one of the nodes, to generateQ intermediate values, each corresponding to one of the Q output functions. Thus, at the end of this phase, QN intermediate values are calculated, which can be split into Q subsets of N intermediate values and each subset is needed to calculate one of the output functions. In the Shuffle phase, for every output function to be calculated, all N intermediate values corresponding to that function are transferred to one of the nodes for reduction. Of course, depending on the node that has been chosen to reduce an output function, a part of the intermediate values are already available locally, and do not need to be transferred in the Shuffle phase. This is because that the Map phase has been carried out on the same set of nodes, and the results of mapping done at a node can remain in that node to be used for the Reduce phase. This offers some saving in the load of communication. To reduce the communication load even more, we may map each input file in more than one nodes. Apparently, this increases the fraction of intermediate values that are locally available. However, as we will show, there is a better way to exploit this redundancy in computation to reduce the communication load. The main message of this paper is to show that following a particular patten in repeating Map computations along with some coding techniques, we can significantly reduce the load of communication. Perhaps surprisingly, we show that the gain of coding in reducing communication load scales with the size of the network. To be more precise, we define the computation load r, 1≤ r≤ K, as the total number of computed Map functions at the nodes, normalized by N. For example, r = 1 means 7 Chapter 2. A Fundamental Tradeoff between Computation and Communication that none of the Map functions has been re-computed, and r = 2 means that on average each Map function can be computed on two nodes. We also define communication load L, 0≤ L≤ 1, as the total amount of information exchanged across nodes in the shuffling phase, normalized by the size ofQN intermediate values, in order to compute theQ output functions disjointly and uniformly across the K nodes. Based on this formulation, we now ask the following fundamental question: • Given a computation load r in the Map phase, what is the minimum communication load L ∗ (r), using any data shuffling scheme, needed to compute the final output functions? We propose Coded Distributed Computing (CDC) that achieves a communication load of L coded (r) = 1 r ·(1− r K ) forr = 1,...,K, and the lower convex envelop of these points. CDC employs a specific strategy to assign the computations of the Map and Reduce functions across the computing nodes, in order to enable novel coding opportunities for data shuffling. In particular, for a computation load r∈{1,...,K}, CDC utilizes a carefully designed repetitive mapping of data blocks at r distinct nodes to create coded multicast messages that deliver data simultaneously to a subset of r≥ 1 nodes. Hence, compared with an uncoded data shuffling scheme, which as we show later achieves a communication load L uncoded (r) = 1− r K , CDC is able to reduce the communication load by exactly a factor of the computation load r. Furthermore, the proposed CDC scheme applies to a more general distributed computing framework where every output function is computed by more than one, or particularly s∈{1,...,K} nodes, which provides better fault-tolerance in distributed computing. 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Computation Load (r) Communication Load (L) Uncoded Scheme Coded Distributed Computing Figure 2.1: Comparison of the communication load achieved by Coded Distributed Computing L coded (r) with that of the uncoded schemeL uncoded (r), forQ = 10 output functions,N = 2520 input files andK = 10 computing nodes. For r∈{1,...,K}, CDC is r times better than the uncoded scheme. 8 Chapter 2. A Fundamental Tradeoff between Computation and Communication We numerically compare the computation-communication tradeoffs of CDC and uncoded data shuffling schemes (i.e.,L coded (r) andL uncoded (r)) in Fig. 2.1. As it is illustrated, in the uncoded scheme that achieves a communication load L uncoded (r) = 1− r K , increasing the computation load r offers only a modest reduction in communication load. In fact for any r, this gain vanishes for large number of nodes K. Consequently, it is not justified to trade computation for communication using uncoded schemes. However, for the coded scheme that achieves a communication load of L coded (r) = 1 r · (1− r K ), increasing the computation load r will significantly reduce the communication load, and this gain does not vanish for large K. For example as illustrated in Fig. 2.1, when mapping each file at one extra node (r = 2), CDC reduces the communication load by 55.6%, while the uncoded scheme only reduces it by 11.1%. We also prove an information-theoretic lower bound on the minimum communication load L ∗ (r). To prove the lower bound, we derive a lower bound on the total number of bits communicated by any subset of nodes, using induction on the size of the subset. To derive the lower bound for a particular subset of nodes, we first establish a lower bound on the number of bits needed by one of the nodes to recover the intermediate values it needs to calculate its assigned output functions, and then utilize the bound on the number of bits communicated by the rest of the nodes in that subset, which is given by the inductive argument. The derived lower bound onL ∗ (r) matches the communication load achieved by the CDC scheme for any computation load 1≤r≤K. As a result, we exactly characterize the optimal tradeoff between computation load and communication load in the following: L ∗ (r) =L coded (r) = 1 r · (1− r K ), r∈{1,...,K}. For general 1≤r≤K,L ∗ (r) is the lower convex envelop of the above points{(r,L coded (r)) : r∈{1,...,K}}. Note that for large K, 1 r · (1− r K )≈ 1 r , hence L ∗ (r)≈ 1 r . This result reveals a fundamental inversely proportional relationship between computation load and communication load in distributed computing. This also illustrates that the gain of 1 r achieved by CDC is optimal and it cannot be improved by any other scheme (sinceL coded (r) is an information-theoretic lower bound onL ∗ (r) that applies to any data shuffling scheme). Finally, we extend the CDC scheme to deal with the following two distributed computing scenarios 1) the Reduce function is a linear combination of the intermediate values from the input files, and 2) multi-stage computations where each stage consists of executing multiple MapReduce jobs. For linear Reduce functions, we propose a coded computing 9 Chapter 2. A Fundamental Tradeoff between Computation and Communication scheme named “compressed CDC” that first pre-combines intermediate values for the same function, and applies the coded multicasting techniques of CDC on the combined packets, achieving an additional data compression gain on top of the multicasting gain of CDC in slashing the communication load. For a multi-stage computation task for which the output results of one stage are the input data for the next stage, we can place both data and computations repetitively in a structured manner, and use the CDC scheme to minimize the communication load for computations in each stage. Related Works. The problem of characterizing the minimum communication for dis- tributed computing has been previously considered in several settings in both computer science and information theory communities. In [16], a basic computing model is proposed, where two parities have x and y and aim to compute a boolean function f(x,y) by ex- changing the minimum number of bits between them. Also, the problem of minimizing the required communication for computing the modulo-two sum of distributed binary sources with symmetric joint distribution was introduced in [45]. Following these two seminal works, a wide range of communication problems in the scope of distributed computing have been studied (see, e.g., [17, 18, 46–49]). The key differences distinguishing the setting in this chapter from most of the prior ones are 1) We focus on the flow of communication in a general distributed computing framework, motivated by MapReduce, rather than the struc- tures of the functions or the input distributions. 2) We do not impose any constraint on the numbers of output results, input data files and computing nodes (they can be arbitrarily large), 3) We do not assume any special property (e.g. linearity) of the computed functions. The idea of efficiently creating and exploiting coded multicasting was initially proposed in the context of cache networks in [50, 51], and extended in [52, 53], where caches pre-fetch part of the content in a way to enable coding during the content delivery, minimizing the network traffic. In this work, we propose a framework to study the tradeoff between computation and communication in distributed computing. We demonstrate that the coded multicasting opportunities exploited in the above caching problems also exist in the data shuffling of distributed computing frameworks, which can be created by a strategy of repeating the computations of the Map functions specified by the Coded Distributed Computing (CDC) scheme. Finally, in a recent work [2], the authors have proposed methods for utilizing codes to speed up some specific distributed machine learning algorithms. The considered problem in this work differs from [2] in the following aspects. We propose a general methodology 10 Chapter 2. A Fundamental Tradeoff between Computation and Communication for utilizing coding in data shuffling that can be applied to any distributed computing framework with a MapReduce structure, regardless of the underlying application. In other words, any distributed computing algorithm that fits in the MapReduce framework can benefit from the proposed CDC solution. We also characterize the information-theoretic computation-communication tradeoff in such frameworks. Furthermore, the coding used in [2] is at the application layer (i.e., applying computation on coded data), while in this work we focus on applying codes directly on the shuffled data. 2.1 Problem Formulation In this section, we formulate a general distributed computing framework motivated by MapReduce, and define the function characterizing the tradeoff between computation and communication. We consider the problem of computingQ arbitrary output functions fromN input files using a cluster ofK distributed computing nodes (servers), for some positive integersQ,N,K∈N, with N≥K. More specifically, given N input files w 1 ,...,w N ∈F 2 F , for some F∈N, the goal is to computeQ output functionsφ 1 ,...,φ Q , whereφ q : (F 2 F ) N →F 2 B,q∈{1,...,Q} maps all input files to a length-B binary stream u q = φ q (w 1 ,...,w N )∈ F 2 B, for some B∈N. Motivated by MapReduce, we assume that as illustrated in Fig. 2.2 the computation of the output function φ q , q∈{1,...,Q} can be decomposed as follows: φ q (w 1 ,...,w N ) =h q (g q,1 (w 1 ),...,g q,N (w N )), (2.1) where • The “Map” functions~ g n =(g 1,n ,...,g Q,n ):F 2 F→(F 2 T ) Q , n∈{1,...,N} maps the input file w n into Q length-T intermediate values v q,n = g q,n (w n )∈ F 2 T , q∈{1,...,Q}, for some T∈N. 1 1 When mapping a file, we compute Q intermediate values in parallel, one for each of the Q output functions. The main reason to do this is that parallel processing can be efficiently performed for applications that fit into the MapReduce framework. In other words, mapping a file according to one function is only marginally more expensive than mapping according to all functions. For example, for the canonical Word Count job, while we are scanning a document to count the number of appearances of one word, we can simultaneously count the numbers of appearances of other words with marginally increased computation cost. 11 Chapter 2. A Fundamental Tradeoff between Computation and Communication • The “Reduce” functions h q : (F 2 T ) N →F 2 B, q∈{1,...,Q} maps the intermediate values of the output function φ q in all input files into the output value u q =h q (v q,1 ,...,v q,N ). Map Functions Reduce Functions Figure 2.2: Illustration of a two-stage distributed computing framework. The overall computation is decomposed into computing a set of Map and Reduce functions. Remark 2.1. Note that for every set of output functions φ 1 ,...,φ Q such a Map-Reduce decomposition exists (e.g., setting g q,n 0 s to identity functions such that g q,n (w n ) = w n for all n = 1,...,N, and h q to φ q in (2.94)). However, such a decomposition is not unique, and in the distributed computing literature, there has been quite some work on developing appropriate decompositions of computations like join, sorting and matrix multiplication (see, e.g., [4, 15]), for them to be performed efficiently in a distributed manner. Here we do not impose any constraint on how the Map and Reduce functions are chosen (for example, they can be arbitrary linear or non-linear functions). The above computation is carried out by K distributed computing nodes, labelled as Node 1,..., Node K. They are interconnected through a multicast network. Following the above decomposition, the computation proceeds in three phases: Map, Shuffle and Reduce. Map Phase: Node k, k∈{1,...,K} computes the Map functions of a set of filesM k , which are stored on Node k, for some design parameterM k ⊆{w 1 ,...,w N }. For each file w n inM k , Nodek computes~ g n (w n )=(v 1,n ,...,v Q,n ). We assume that each file is mapped by at least one node, i.e., ∪ k=1,...,K M k ={w 1 ,...,w N }. Definition 2.1 (Computation Load). We define the computation load, denoted by r, 1≤ r≤K, as the total number of Map functions computed across the K nodes, normalized by the number of files N, i.e., r, P K k=1 |M k | N . The computation load r can be interpreted as the average number of nodes that map each file. 3 Shuffle Phase: Node k, k∈{1,...,K} is responsible for computing a subset of output functions, whose indices are denoted by a setW k ⊆{1,...,Q}. We focus on the case Q K ∈N, 12 Chapter 2. A Fundamental Tradeoff between Computation and Communication and utilize a symmetric task assignment across theK nodes to maintain load balance. More precisely, we require 1)|W 1 | =··· =|W K | = Q K , 2)W j ∩W k =? for all j6=k. Remark 2.2. Beyond the symmetric task assignment considered in this work, character- izing the optimal computation-communication tradeoff allowing general asymmetric task assignments is a challenging open problem. As the first step to study this problem, in our follow-up work [54] in which the number of output functions Q is fixed and the comput- ing resources are abundant (e.g., number of computing nodes K Q), we have shown that asymmetric task assignments can do better than the symmetric ones, and achieve the optimum run-time performance. To compute the output value u q for some q∈W k , Node k needs the intermediate values that are not computed locally in the Map phase, i.e.,{v q,n :q∈W k ,w n / ∈M k }. After Node k,k∈{1,...,K} has finished mapping all the files inM k , theK nodes proceed to exchange the needed intermediate values. In particular, each nodek,k∈{1,...,K}, creates an input symbolX k ∈F 2 ` k , for some` k ∈N, as a function of the intermediate values computed locally during the Map phase, i.e., for some encoding function ψ k : (F 2 T ) Q|M k | →F 2 ` k at Node k, we have X k =ψ k ({~ g n :w n ∈M k }). (2.2) Having generated the message X k , Node k multicasts it to all other nodes. By the end of the Shuffle phase, each of the K nodes receives X 1 ,...,X K free of error. Definition 2.2 (Communication Load). We define the communication load, denoted by L, 0≤ L≤ 1, as L, ` 1 +···+` K QNT . That is, L represents the (normalized) total number of bits communicated by the K nodes during the Shuffle phase. 2 3 Reduce Phase: Node k, k∈{1,...,K}, uses the messages X 1 ,...,X K communicated in the Shuffle phase, and the local results from the Map phase{~ g n : w n ∈M k } to construct inputs to the corresponding Reduce functions ofW k , i.e., for eachq∈W k and some decoding 2 For notational convenience, we define all variables in binary extension fields. However, one can consider arbitrary field sizes. For example, we can consider all intermediate values vq,n, q = 1,...,Q, n = 1,...,N, to be in the field F p T , for some prime number p and positive integer T , and the symbol communicated by Nodek (i.e.,X k ), to be in the fieldF s ` k for some prime numbers and positive integer` k , for allk = 1,...,K. In this case, the communication load can be defined as L, (` 1 +···+` K ) logs QNT logp . 13 Chapter 2. A Fundamental Tradeoff between Computation and Communication function χ q k :F 2 ` 1 ×···×F 2 ` K × (F 2 T ) Q|M k | → (F 2 T ) N , Node k computes (v q,1 ,...,v q,N ) =χ q k (X 1 ,...,X K ,{~ g n :w n ∈M k }). (2.3) Finally, Nodek,k∈{1,...,K}, computes the Reduce functionu q =h q (v q,1 ...v q,N ) for all q∈W k . We say that a computation-communication pair (r,L)∈ R 2 is feasible if for any δ > 0 and sufficiently largeN, there existM 1 ,...,M K ,W 1 ,...,W K , a set of encoding functions {ψ k } K k=1 , and a set of decoding functions{χ q k : q∈W k } K k=1 that achieve a computation- communication pair (˜ r, ˜ L)∈Q 2 such that|r−˜ r|≤δ,|L− ˜ L|≤δ, and Nodek can successfully compute all the output functions whose indices are inW k , for all k∈{1,...,K}. Definition 2.3. We define the computation-communication function of the distributed com- puting framework L ∗ (r), inf{L : (r,L) is feasible}. (2.4) L ∗ (r) characterizes the optimal tradeoff between computation and communication in this framework. 3 Example (Uncoded Scheme). In the Shuffle phase of a simple “uncoded” scheme, each node receives the needed intermediate values sent uncodedly by some other nodes. Since a total of QN intermediate values are needed across the K nodes and rN· Q K = rQN K of them are already available after the Map phase, the communication load achieved by the uncoded scheme L uncoded (r) = 1−r/K. (2.5) Remark 2.3. After the Map phase, each node knows the intermediate values of all Q output functions in the files it has mapped. Therefore, for a fixed file assignment and any symmetric assignment of the Reduce functions, specified byW 1 ,...,W K , we can satisfy the data requirements using the same data shuffling scheme up to relabelling the Reduce functions. In other words, the communication load is independent of the assignment of the Reduce functions. In this chapter, we also consider a generalization of the above framework, which we call “cascaded distributed computing framework”, where after the Map phase, each Reduce function is computed by more than one, or particularly s nodes, for some s∈{1,...,K}. 14 Chapter 2. A Fundamental Tradeoff between Computation and Communication This generalized model is motivated by the fact that many distributed computing jobs require multiple rounds of Map and Reduce computations, where the Reduce results of the previous round serve as the inputs to the Map functions of the next round. Computing each Reduce function at more than one node admits data redundancy for the subsequent Map-function computations, which can help to improve the fault-tolerance and reduce the communication load of the next-round data shuffling. We focus on the case Q ( K s ) ∈N, and enforce a symmetric assignment of the Reduce tasks to maintain load balance. Particularly, we require that every subset of s nodes compute a disjoint subset of Q ( K s ) Reduce functions. The feasible computation-communication triple (r,s,L)∈R×N×R is defined similar as before. We define the computation-communication function of the cascaded distributed computing framework L ∗ (r,s), inf{L : (r,s,L) is feasible}. (2.6) 2.2 Main Results Theorem 2.1. The computation-communication function of the distributed computing framework, L ∗ (r) is given by L ∗ (r) =L coded (r), 1 r · (1− r K ), r∈{1,...,K}, (2.7) for sufficiently large T . For general 1≤ r≤ K, L ∗ (r) is the lower convex envelop of the above points{(r, 1 r · (1− r K )) :r∈{1,...,K}}. We prove the achievability of Theorem 2.1 by proposing a coded scheme, named Coded Distributed Computing, in Section 2.4. We demonstrate that no other scheme can achieve a communication load smaller than the lower convex envelop of the points{(r, 1 r · (1− r K )) : r∈{1,...,K}} by proving the converse in Section 2.5. Remark 2.4. Theorem 2.1 exactly characterizes the optimal tradeoff between the compu- tation load and the communication load in the considered distributed computing framework. Remark 2.5. For r∈{1,...,K}, the communication load achieved in Theorem 2.1 is less than that of the uncoded scheme in (2.5) by a multiplicative factor of r, which equals the 15 Chapter 2. A Fundamental Tradeoff between Computation and Communication computation load and can grow unboundedly as the number of nodes K increases if e.g. r = Θ(K). As illustrated in Fig. 2.1, while the communication load of the uncoded scheme decreases linearly as the computation load increases, L coded (r) achieved in Theorem 2.1 is inversely proportional to the computation load. Remark 2.6. While increasing the computation load r causes a longer Map phase, the coded achievable scheme of Theorem 2.1 maximizes the reduction of the communication load using the extra computations. Therefore, Theorem 2.1 provides an analytical framework to optimally trading the computation power in the Map phase for more bandwidth in the Shuffle phase, which helps to minimize the overall execution time of applications whose performances are limited by data shuffling. Theorem 2.2. The computation-communication function of the cascaded distributed com- puting framework, L ∗ (r,s) is characterized by L ∗ (r,s) =L coded (r,s), min{r+s,K} X `=max{r+1,s} ` K ` `−2 r−1 r `−s r K r K s , r∈{1,...,K}, (2.8) for some s∈{1,...,K} and sufficiently large T . For general 1≤ r≤ K, L ∗ (r,s) is the lower convex envelop of the above points{(r,L coded (r,s)) :r∈{1,...,K}}. We present the Coded Distributed Computing scheme that achieves the computation- communication function in Theorem 2.2 in Section 2.4, and the converse of Theorem 2.2 in Appendix A. Remark 2.7. A preliminary part of this result, in particular the achievability for the special case of s = 1, or the achievable scheme of Theorem 2.1 was presented in [42]. We note that when s = 1, Theorem 2.2 provides the same result as in Theorem 2.1, i.e., L ∗ (r, 1) = 1 r · (1− r K ), for r∈{1,...,K}. Remark 2.8. For any fixed s∈{1,...,K} (number of nodes that compute each Reduce function), as illustrated in Fig. 2.3, the communication load achieved in Theorem 2.2 outper- forms the linear relationship between computation and communication, i.e., it is superlinear with respect to the computation load r. Before we proceed to describe the general achievability scheme for the cascaded distributed computing framework (also the distributed computing framework as a special case ofs = 1), we first illustrate the key ideas of the proposed Coded Distributed Computing scheme by presenting two examples in the next section, for the cases of s = 1 and s> 1 respectively. 16 Chapter 2. A Fundamental Tradeoff between Computation and Communication 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Computation Load (r) Communication Load (L) s=1 s=2 s=3 Figure 2.3: Minimum communication load L ∗ (r,s) = L coded (r,s) in Theorem 2.2, for Q = 360 output functions, N =2520 input files and K =10 computing nodes. 2.3 Illustrative Examples: Coded Distributed Computing In this section, we present two illustrative examples of the proposed achievable scheme for Theorem 2.1 and Theorem 2.2, which we call Coded Distributed Computing (CDC), for the cases of s = 1 (Theorem 2.1) and s> 1 (Theorem 2.2) respectively. Example 2.1 (CDC for s = 1). We consider a MapReduce-type problem in Fig. 2.4 for distributed computing of Q = 3 output functions, represented by red/circle, green/square, and blue/triangle respectively, from N = 6 input files, using K = 3 computing nodes. Nodes 1, 2, and 3 are respectively responsible for final reduction of red/circle, green/square, and blue/triangle output functions. Let us first consider the case where no redundancy is imposed on the computations, i.e., each file is mapped once and computation load r = 1. As shown in Fig. 2.4(a), Node k maps File 2k− 1 and File 2k for k = 1, 2, 3. In this case, each node maps 2 input files locally, computing all three intermediate values needed for the three output functions from each mapped file. In Fig. 2.4, we represent, for example, the intermediate value of the red/circle function in File n using a red circle labelled by n, for all n = 1,..., 6. Similar representations follow for the green/square and the blue/triangle functions. After the Map phase, each node obtains 2 out of 6 required intermediate values to reduce the output function it is responsible for (e.g., Node 1 knows the red circles in File 1 and File 2). Hence, each node needs 4 intermediate values from the other nodes, yielding a communication load of 4×3 3×6 = 2 3 . Now, we demonstrate how the proposed CDC scheme trades the computation load to slash the communication load via in-network coding. As shown in Fig. 2.4(b), we double the computation load such that each file is now mapped on two nodes (r = 2). It is apparent 17 Chapter 2. A Fundamental Tradeoff between Computation and Communication 1 2 1 2 3 4 5 6 Node 1 3 4 1 2 5 6 5 6 1 2 3 4 1 2 1 2 3 4 3 4 5 6 5 6 5 6 3 4 Node 2 Node 3 Map Map Map Has Has Has Needs Needs Needs 1 2 Files 1 2 Files 3 4 3 4 5 6 5 6 Files (a) Uncoded Distributed Computing Scheme. 1 2 1 2 5 6 3 4 3 4 1 2 5 6 5 6 3 4 3 4 3 4 5 6 5 6 1 2 1 2 1 3 4 5 6 2 Node 3 Node 1 Node 2 Map Map Map Has Needs Has Needs Has Needs Files 2 6 1 Files 1 3 4 2 4 5 6 3 Files 5 (b) Coded Distributed Computing Scheme. Figure 2.4: Illustrations of the conventional uncoded distributed computing scheme with computation loadr = 1, and the proposed Coded Distributed Computing scheme with computation load r = 2, for computing Q = 3 functions from N = 6 inputs on K = 3 nodes. that since more local computations are performed, each node now only requires 2 other intermediate values, and an uncoded shuffling scheme would achieve a communication load of 2×3 3×6 = 1 3 . However, we can do much better with coding. As shown in Fig. 2.4(b), instead of unicasting individual intermediate values, every node multicasts a bit-wise XOR, denoted by⊕, of 2 locally computed intermediate values to the other two nodes, simultaneously satisfying their data demands. For example, knowing the blue/triangle in File 3, Node 2 can cancel it from the coded packet sent by Node 1, recovering the needed green/square in File 1. Therefore, this coding incurs a communication load of 3 3×6 = 1 6 , achieving a 2× gain from the uncoded shuffling. From the above example, we see that for the case ofs = 1, i.e., each of theQ output functions is computed on one node and the computations of the Reduce functions are symmetrically distributed across nodes, the proposed CDC scheme only requires performing bit-wise XOR as the encoding and decoding operations. However, for the case ofs> 1, as we will show in the following example, the proposed CDC scheme requires computing linear combinations of the intermediate values during the encoding process. Example 2.2 (CDC for s > 1). In this example, we consider a job of computing Q = 6 output functions fromN = 6 input files, usingK = 4 nodes. We focus on the case where the computation loadr = 2, and each Reduce function is computed bys = 2 nodes. In the Map phase, each file is mapped byr = 2 nodes. As shown in Fig. 2.5, the sets of the files mapped by the 4 nodes areM 1 = {w 1 ,w 2 ,w 3 }, M 2 = {w 1 ,w 4 ,w 5 }, M 3 = {w 2 ,w 4 ,w 6 }, and 18 Chapter 2. A Fundamental Tradeoff between Computation and Communication M 4 ={w 3 ,w 5 ,w 6 }. After the Map phase, Node k, k∈{1, 2, 3, 4}, knows the intermediate values of allQ = 6 output functions in the files inM k , i.e.,{v q,n :q∈{1,..., 6},w n ∈M k }. In the Reduce phase, we assign the computations of the Reduce functions in a symmetric manner such that every subset of s = 2 nodes compute a common Reduce function. More specifically as shown in Fig. 2.5, the sets of indices of the Reduce functions computed by the 4 nodes areW 1 ={1, 2, 3},W 2 ={1, 4, 5},W 3 ={2, 4, 6}, andW 4 ={3, 5, 6}. Therefore, for example, Node 1 still needs the intermediate values{v q,n :q∈{1, 2, 3},n∈{4, 5, 6}} through data shuffling to compute its assigned Reduce functions h 1 , h 2 , h 3 . Node 1 Node 3 Node 4 Node 2 Node 1 Node 2 Node 3 Node 4 Multicast Multicast Multicast Multicast Figure 2.5: Illustration of the CDC scheme to compute Q = 6 output functions from N = 6 input files distributedly at K = 4 computing nodes. Each file is mapped by r = 2 nodes and each output function is computed by s = 2 nodes. After the Map phase, every node knows 6 intermediate values, one for each output function, in every file it has mapped. The Shuffle phase proceeds in two rounds. In the first round, each node multicasts bit-wise XOR of intermediate values to subsets of two nodes. In the second round, each node splits an intermediate value vq,n evenly into two segments vq,n = (v (1) q,n ,v (2) q,n ), and multicasts two linear combinations of the segments that are cosntructed using coefficients α 1 , α 2 , and α 3 to the other three nodes. The data shuffling process consists of two rounds of communication over the multicast network. In the first round, intermediate values are communicated within each subset of 3 nodes. In the second round, intermediate values are communicated within the set of all 4 nodes. In what follows, we describe these two rounds of communication respectively. 19 Chapter 2. A Fundamental Tradeoff between Computation and Communication Round 1: Subsets of 3 nodes. We first consider the subset{1, 2, 3}. During the data shuffling, each node whose index is in{1, 2, 3} multicasts a bit-wise XOR of two locally computed intermediate values to the other two nodes: • Node 1 multicasts v 1,2 ⊕v 2,1 to Node 2 and Node 3, • Node 2 multicasts v 4,1 ⊕v 1,4 to Node 1 and Node 3, • Node 3 mulicasts v 4,2 ⊕v 2,4 to Node 1 and Node 2, Since Node 2 knows v 2,1 and Node 3 knows v 1,2 locally, they can respectively decode v 1,2 and v 2,1 from the coded message v 1,2 ⊕v 2,1 . We employ the similar coded shuffling scheme on the other 3 subsets of 3 nodes. After the first round of shuffling, • Node 1 recovers (v 1,4 ,v 1,5 ), (v 2,4 ,v 2,6 ) and (v 3,5 ,v 3,6 ), • Node 2 recovers (v 1,2 ,v 1,3 ), (v 4,2 ,v 4,6 ) and (v 5,3 ,v 5,6 ), • Node 3 recovers (v 2,1 ,v 2,3 ), (v 4,1 ,v 4,5 ) and (v 6,3 ,v 6,5 ), • Node 4 recovers (v 3,1 ,v 3,2 ), (v 5,1 ,v 5,4 ) and (v 6,2 ,v 6,4 ). Round 2: All 4 nodes. We first split each of the intermediate values v 6,1 , v 5,2 , v 4,3 , v 3,4 , v 2,5 , andv 1,6 into two equal-sized segments each containing T/2 bits, which are denoted by v (1) q,n and v (2) q,n for an intermediate value v q,n . Then, for some coefficients α 1 ,α 2 ,α 3 ∈F 2 T 2 , Node 1 multicasts the following two linear combinations of three locally computed segments to the other three nodes. v (1) 4,3 +v (1) 5,2 +v (1) 6,1 , (2.9) α 1 v (1) 4,3 +α 2 v (1) 5,2 +α 3 v (1) 6,1 . (2.10) Similarly, as shown in Fig. 2.5, each of Node 2, Node 3, and Node 4 multicasts two linear combinations of three locally computed segments to the other three nodes, using the same coefficients α 1 , α 2 , and α 3 . Having received the above two linear combinations, each of Node 2, Node 3, and Node 4 first subtracts out one segment available locally from the combinations, or more specifically, v (1) 6,1 20 Chapter 2. A Fundamental Tradeoff between Computation and Communication for Node 2, v (1) 5,2 for Node 3, and v (1) 4,3 for Node 4. After the subtraction, each of these three nodes recovers the required segments from the two linear combinations. More specifically, Node 2 recovers v (1) 4,3 and v (1) 5,2 , Node 3 recovers v (1) 4,3 and v (1) 6,1 , and Node 4 recovers v (1) 5,2 and v (1) 6,1 . It is not difficult to see that the above decoding process is guaranteed to be successful ifα 1 ,α 2 , andα 3 are all distinct from each other, which requires the field size 2 T 2 ≥ 3 (e.g., T = 4). Following the similar procedure, each node recovers the required segments from the linear combinations multicast by the other three nodes. More specifically, after the second round of data shuffling, • Node 1 recovers v 1,6 , v 2,5 and v 3,4 , • Node 2 recovers v 1,6 , v 4,3 and v 5,2 , • Node 3 recovers v 2,5 , v 4,3 and v 6,1 , • Node 4 recovers v 3,4 , v 5,2 and v 6,1 . We finally note that in the second round of data shuffling, each linear combination multicast by a node is simultaneously useful for the rest of the three nodes. 2.4 General Achievable Scheme: Coded Distributed Com- puting In this section, we formally prove the upper bounds in Theorem 2.1 and 2.2 by presenting and analyzing the Coded Distributed Computing (CDC) scheme. We focus on the more general case considered in Theorem 2.2 withs≥ 1, and the scheme for Theorem 2.1 simply follows by setting s = 1. We first consider the integer-valued computation load r∈{1,...,K}, and then generalize the CDC scheme for any 1≤ r≤ K. When r = K, every node can map all the input files and compute all the output functions locally, thus no communication is needed and L ∗ (K,s) = 0 for all s∈{1,...,K}. In what follows, we focus on the case where r<K. We consider sufficiently large number of input files N, and K r (η 1 − 1) < N≤ K r η 1 , for some η 1 ∈ N. We first inject K r η 1 −N empty files into the system to obtain a total of ¯ N = K r η 1 files, which is now a multiple of of K r . We note that lim N→∞ ¯ N N = 1. Next, we proceed to present the achievable scheme for a system with ¯ N input files w 1 ,...,w ¯ N . 21 Chapter 2. A Fundamental Tradeoff between Computation and Communication 2.4.1 Map Phase Design In the Map phase the ¯ N input files are evenly partitioned into K r disjoint batches of size η 1 , each corresponding to a subsetT ⊂{1,...,K} of size r, i.e., {w 1 ,...,w ¯ N } = ∪ T⊂{1,...,K},|T|=r B T , (2.11) whereB T denotes the batch of η 1 files corresponding to the subsetT . Given this partition, Nodek,k∈{1,...,K}, computes the Map functions of the files inB T ifk∈T . Or equivalently,B T ⊆M k ifk∈T . Since each node is in K−1 r−1 subsets of sizer, each node computes K−1 r−1 η 1 = r ¯ N K Map functions, i.e.,|M k | = r ¯ N K for all k∈{1,...,K}. After the Map phase, Nodek,k∈{1,...,K}, knows the intermediate values of allQ output functions in the files inM k , i.e.,{v q,n :q∈{1,...,Q},w n ∈M k }. 2.4.2 Coded Data Shuffling We recall that we focus on the case where the number of the output functions Q satisfies Q ( K s ) ∈ N, and enforce a symmetric assignment of the Reduce functions such that every subset of s nodes reduce Q ( K s ) functions. Specifically, Q = K s η 2 for some η 2 ∈N, and the computations of the Reduce functions are assigned symmetrically across the K nodes as follows. Firstly the Q Reduce functions are evenly partitioned into K s disjoint batches of size η 2 , each corresponding to a unique subsetP of s nodes, i.e., {1,...,Q} = ∪ P⊆{1,...,K},|P|=s D P , (2.12) whereD P denotes the indices of the batch of η 2 Reduce functions corresponding to the subsetP. Given this partition, Nodek,k∈{1,...,K}, computes the Reduce functions whose indices are inD P if k∈P. Or equivalently,D P ⊆W k if k∈P. As a result, each node computes K−1 s−1 η 2 = sQ K Reduce functions, i.e.,|W k | = sQ K for all k∈{1,...,K}. For a subsetS of{1,...,K} andS 1 ⊂S with|S 1 | =r, we denote the set of intermediate values needed by all nodes inS\S 1 , no node outsideS, and known exclusively by nodes in 22 Chapter 2. A Fundamental Tradeoff between Computation and Communication S 1 asV S\S 1 S 1 . More formally: V S\S 1 S 1 ,{v q,n :q∈ ∩ k∈S\S 1 W k ,q / ∈ ∪ k/ ∈S W k ,w n ∈ ∩ k∈S 1 M k ,w n / ∈ ∪ k/ ∈S 1 M k }. (2.13) We observe that the setV S\S 1 S 1 defined above contains intermediate values of r |S|−s η 2 output functions. This is because that the output functions whose intermediate values are included inV S\S 1 S 1 should be computed exclusively by the nodes inS\S 1 and a subset ofs− (|S|−r) nodes inS 1 . Therefore,V S\S 1 S 1 contains the intermediate values of a total of r s−(|S|−r) η 2 = r |S|−s η 2 output functions. Since every subset of r nodes map a unique batch of η 1 files, V S\S 1 S 1 contains|V S\S 1 S 1 | = r |S|−s η 1 η 2 intermediate values. Next, we first concatenate all intermediate values inV S\S 1 S 1 to construct a symbol U S\S 1 S 1 ∈ F 2 ( r |S|−s ) η 1 η 2 T . Then forS 1 ={σ 1 ,...,σ r }, we arbitrarily and evenly split U S\S 1 S 1 into r segments, each containing r |S|−s η 1 η 2 T r bits, i.e., U S\S 1 S 1 = U S\S 1 S 1 ,σ 1 ,U S\S 1 S 1 ,σ 2 ,...,U S\S 1 S 1 ,σr , (2.14) where U S\S 1 S 1 ,σ i ∈F 2 ( r |S|−s ) η 1 η 2 T r denotes the segment associated with Node σ i ∈S 1 . For eachk∈S, there are a total of |S|−1 r−1 subsets ofS with sizer that contain the element k. We index these subsets asS (k) [1],S (k) [2]...,S (k) [ |S|−1 r−1 ]. Within a subsetS (k) [i], the segment associated with Node k is U S\S (k) [i] S (k) [i],k , for all i = 1,..., |S|−1 r−1 . We note that each segmentU S\S (k) [i] S (k) [i],k ,i = 1,..., |S|−1 r−1 , is known by all nodes whose indices are inS (k) [i], and needed by all nodes whose indices are inS\S (k) [i]. 2.4.2.1 Encoding The shuffling scheme of CDC consists of multiple rounds, each corresponding to all subsets of the K nodes with a particular size. Within each subset, each node multicasts linear combinations of the segments that are associated with it to the other nodes in the subset. More specifically, for each subsetS⊆{1,...,K} of size max{r+1,s}≤|S|≤ min{r+s,K}, we definen 1 , |S|−1 r−1 andn 2 , |S|−2 r−1 . Then for eachk∈S, Nodek computesn 2 message symbols, denoted by X S k [1],X S k [2],...,X S k [n 2 ] as follows. For some coefficients α 1 ,...,α n 1 23 Chapter 2. A Fundamental Tradeoff between Computation and Communication where α i ∈F 2 ( r |S|−s ) η 1 η 2 T r for all i = 1,...,n 1 , Node k computes X S k [1] =U S\S (k) [1] S (k) [1],k +U S\S (k) [2] S (k) [2],k +··· +U S\S (k) [n 1 ] S (k) [n 1 ],k , X S k [2] =α 1 U S\S (k) [1] S (k) [1],k +α 2 U S\S (k) [2] S (k) [2],k +··· +α n 1 U S\S (k) [n 1 ] S (k) [n 1 ],k , . . . X S k [n 2 ] =α n 2 −1 1 U S\S (k) [1] S (k) [1],k +α n 2 −1 2 U S\S (k) [2] S (k) [2],k +··· +α n 2 −1 n 1 U S\S (k) [n 1 ] S (k) [n 1 ],k , (2.15) or equivalently, X S k [1] X S k [2] . . . X S k [n 2 ] = 1 1 ··· 1 α 1 α 2 ··· α n 1 . . . . . . . . . . . . α n 2 −1 1 α n 2 −1 2 ··· α n 2 −1 n 1 | {z } A S U S\S (k) [1] S (k) [1],k U S\S (k) [2] S (k) [2],k . . . U S\S (k) [n 1 ] S (k) [n 1 ],k . (2.16) We note that the above encoding process is the same at all nodes whose indices are inS, i.e., each of them multiplies the same matrix A S in (2.16) with the segments associated with it. Having generated the above message symbols, Node k multicasts them to the other nodes whose indices are inS. Remark 2.9. When s = 1, i.e., every output function is computed by one node, the above shuffling scheme only takes one round for all subsetsS of size|S| = r + 1. Instead of multicasting linear combinations, every node inS can simply multicast the bit-wise XOR of its associated segments to the other r nodes inS. 2.4.2.2 Decoding For j∈S and j6=k, there are a total of |S|−2 r−2 subsets ofS that have size r and simulta- neously contain j andk. Hence, among all n 1 segmentsU S\S (k) [1] S (k) [1],k ,U S\S (k) [2] S (k) [2],k ,...,U S\S (k) [n 1 ] S (k) [n 1 ],k associated with Node k, |S|−2 r−2 of them are already known at Node j, and the rest of n 1 − |S|−2 r−2 = |S|−1 r−1 − |S|−2 r−2 = |S|−2 r−1 =n 2 segments are needed by Node j. We denote the indices of the subsets that contain the elementk but not the elementj asb 1 jk ,b 2 jk ,...,b n 2 jk , such that 1≤b 1 jk <b 2 jk <···<b n 2 jk ≤n 1 , and j / ∈S (k) [b i jk ] for all i = 1, 2,...,n 2 . 24 Chapter 2. A Fundamental Tradeoff between Computation and Communication After receiving the symbols X S k [1],X S k [2],...,X S k [n 2 ] from Node k, Node j first re- moves the locally known segments from the linear combinations to generate n 2 symbols Y S jk [1],Y S jk [2],...,Y S jk [n 2 ], such that Y S jk [1] Y S jk [2] . . . Y S jk [n 2 ] = 1 1 ··· 1 α b 1 jk α b 2 jk ··· α b n 2 jk . . . . . . . . . . . . α n 2 −1 b 1 jk α n 2 −1 b 2 jk ··· α n 2 −1 b n 2 jk | {z } B S jk U S\S (k) [b 1 jk ] S (k) [b 1 jk ],k U S\S (k) [b 2 jk ] S (k) [b 2 jk ],k . . . U S\S (k) [b n 2 jk ] S (k) [b n 2 jk ],k , (2.17) where B S jk ∈F n 2 ×n 2 2 ( r |S|−s ) η 1 η 2 T r is a square sub-matrix of A S in (2.16) that contains the columns with indices b 1 jk ,b 2 jk ,...,b n 2 jk of A S k . Nodej can decode the desired segments from Nodek if the matrix B S jk is invertible. We note that B S jk is a Vandermonde matrix, and it is invertible ifα b 1 jk ,α b 2 jk ,...,α b n 2 jk are all distinct. This holds for all j∈S\{k} if there exist n 1 distinct coefficients in F 2 ( r |S|−s ) η 1 η 2 T r , which requires 2 ( r |S|−s ) η 1 η 2 T r ≥n 1 = |S|−1 r−1 , or equivalently T≥ r log ( |S|−1 r−1 ) ( r |S|−s )η 1 η 2 . Finally, the proposed coded shuffling scheme can successfully deliver all the required intermediate values within all subsetsS with max{r + 1,s}≤|S|≤ min{r +s,K}, if T is sufficiently large, i.e., T≥ max max{r+1,s}≤|S|≤min{r+s,K} r log |S|−1 r−1 r |S|−s η 1 η 2 . (2.18) 2.4.3 Correctness of CDC We demonstrate the correctness of the above shuffling scheme by showing that after the Shuffle phase, each node can decode all of the required intermediate values to compute its assigned Reduce functions. We use Node 1 as an example, and similar arguments apply to all other nodes. WLOG we assume that the Reduce functionh 1 is to be computed by Node 1. Node 1 will need a total of K−1 r η 1 distinct intermediate values ofh 1 from other nodes (it already knows r ¯ N K = ¯ N− K−1 r η 1 intermediate values ofh 1 by mapping the files inM 1 ). By the assignment of the Reduce functions, there exits a subsetS 2 of sizes containing Node 1 such that all nodes inS 2 need to computeh 1 . Then, during the data shuffling process within each subsetS containingS 2 (note that by the definition ofV S\S 1 S 1 in (2.13), the intermediate 25 Chapter 2. A Fundamental Tradeoff between Computation and Communication values of h 1 will not be communicated to Node 1 ifS 2 *S, and this is because that some node outsideS also wants to compute h 1 ), there are s−1 |S|−r−1 subsetsS 1 ofS with size |S 1 | = r such that 1 / ∈S 1 andS\S 1 ⊆S 2 , and thus Node 1 decodes s−1 |S|−r−1 η 1 distinct intermediate values ofh 1 . Therefore, the total number of distinct intermediate values of h 1 Node 1 decodes over the entire Shuffle phase is min{r+s,K} X `=max{r+1,s} s− 1 `−r− 1 ! K−s `−s ! η 1 = K− 1 r ! η 1 , (2.19) which matches the required number of intermediate values for h 1 . This is also true for all the other Reduce functions assigned to Node 1. 2.4.4 Communication Load In the above shuffling scheme, for each subsetS ⊆{1,...,K} of size max{r + 1,s}≤ |S| ≤ min{r + s,K}, each Node k ∈ S communicates n 2 = |S|−2 r−1 message symbols. Each of these symbols contains r |S|−s η 1 η 2 T r bits. Hence, all nodes whose indices are inS communicate a total of|S| |S|−2 r−1 r |S|−s η 1 η 2 T r bits. The overall communication load achieved by the proposed CDC scheme is L coded (r,s) = lim N→∞ min{r+s,K} X `=max{r+1,s} K ` ` r `−2 r−1 r `−s η 1 η 2 T QNT = lim N→∞ min{r+s,K} X `=max{r+1,s} ` K ` `−2 r−1 r `−s ¯ N r K r K s N = min{r+s,K} X `=max{r+1,s} ` K ` `−2 r−1 r `−s r K r K s . (2.20) 2.4.5 Non-Integer Valued Computation Load For non-integer valued computation load r≥ 1, we generalize the CDC scheme as follows. We first expand the computation loadr =αr 1 +(1−α)r 2 as a convex combination ofr 1 ,brc and r 2 ,dre, for some 0≤α≤ 1. Then we partition the set of ¯ N input files{w 1 ,...,w ¯ N } into two disjoint subsetsI 1 andI 2 of sizes|I 1 | =α ¯ N and|I 2 | = (1−α) ¯ N. We next apply the CDC scheme described above respectively to the files inI 1 with a computation load r 1 and the files inI 2 with a computation load r 2 , to compute each of the Q output functions 26 Chapter 2. A Fundamental Tradeoff between Computation and Communication at the same set of s nodes. This results in a communication load of lim N→∞ Qα ¯ NL coded (r 1 ,s)T +Q(1−α) ¯ NL coded (r 2 ,s)T QNT =αL coded (r 1 ,s)+(1−α)L coded (r 2 ,s), (2.21) where L coded (r,s) is the communication load achieved by CDC in (2.20) for integer-valued r,s∈{1,...,K}. Using this generalized CDC scheme, for any two integer-valued computation loads r 1 and r 2 , the points on the line segment connecting (r 1 ,L coded (r 1 ,s)) and (r 2 ,L coded (r 2 ,s)) are achievable. Therefore, for general 1≤ r≤ K, the lower convex envelop of the achievable points{(r,L coded (r,s)) : r∈{1,...,K}} is achievable. This proves the upper bound on the computation-communication function in Theorem 2.2 (also the achievability part of Theorem 2.1 by setting s = 1). Remark 2.10. The ideas of efficiently creating and exploiting coded multicasting oppor- tunities have been introduced in caching problems [50–52]. In this section, we illustrated how coding opportunities can be utilized in distributed computing to slash the load of com- municating intermediate values, by designing a particular assignment of extra computations across distributed computing nodes. We note that the calculated intermediate values in the Map phase mimics the locally stored cache contents in caching problems, providing the “side information” to enable coding in the following Shuffle phase (or content delivery). For the case of s = 1 where no two nodes are interested in computing a common Reduce function, the coded data shuffling of CDC is similar to a coded transmission strategy in wire- less D2D networks proposed in [52], where the side information enabling coded multicasting are pre-fetched in a specific repetitive manner in the caches of wireless nodes (in CDC such information is obtained by computing the Map functions locally). When s is larger than 1, i.e., every Reduce function needs to be computed at multiple nodes, our CDC scheme creates novel coding opportunities that exploit both the redundancy of the Map computations and the commonality of the data requests for Reduce functions across nodes, further reducing the communication load. Remark 2.11. Generally speaking, we can view the Shuffle phase of the considered dis- tributed computing framework as an instance of the index coding problem [55, 56], in which a central server aims to design a broadcast message (code) with minimum length to simul- taneously satisfy the requests of all the clients, given the clients’ side information stored in their local caches. Note that while a randomized linear network coding approach (see 27 Chapter 2. A Fundamental Tradeoff between Computation and Communication e.g., [57–59]) is sufficient to implement any multicast communication where messages are intended by all receivers, it is generally sub-optimal for index coding problems where every client requests different messages. Although the index coding problem is still open in gen- eral, for the considered distributed computing scenario where we are given the flexibility of designing Map computation (thus the flexibility of designing side information), we prove in the next two sections tight lower bounds on the minimum communication loads for the cases s = 1 and s> 1 respectively, demonstrating the optimality of the proposed CDC scheme. 2.5 Converse of Theorem 2.1 In this section, we prove the lower bound on L ∗ (r) in Theorem 2.1. Fork∈{1,...,K}, we denote the set of indices of the files mapped by Node k asM k , and the set of indices of the Reduce functions computed by Node k asW k . As the first step, we consider the communication load for a given file assignmentM, (M 1 ,M 2 ...,M K ) in the Map phase. We denote the minimum communication load under the file assignment M by L ∗ M . We denote the number of files that are mapped at j nodes under a file assignmentM, as a j M , for all j∈{1,...,K}: a j M = X J⊆{1,...,K}:|J|=j |(∩ k∈J M k )\(∪ i/ ∈J M i )|. (2.22) Node 1 Node 2 Node 3 Files 1 3 5 6 4 5 6 2 3 4 6 Files Files Figure 2.6: A file assignment for N = 6 files and K = 3 nodes. For example, for the particular file assignment in Fig. 2.6, i.e., M = ({1, 3, 5, 6},{4, 5, 6},{2, 3, 4, 6}), a 1 M = 2 since File 1 and File 2 are mapped on a single node (i.e., Node 1 and Node 3 respectively). Similarly, we have a 2 M = 3 (Files 3, 4, and 5), and a 3 M = 1 (File 6). For a particular file assignmentM, we present a lower bound onL ∗ M in the following lemma. 28 Chapter 2. A Fundamental Tradeoff between Computation and Communication Lemma 2.1. L ∗ M ≥ K P j=1 a j M N · K−j Kj . Next, we first demonstrate the converse of Theorem 2.1 using Lemma 2.1, and then give the proof of Lemma 2.1. Converse Proof of Theorem 2.1. It is clear that the minimum communication load L ∗ (r) is lower bounded by the minimum value of L ∗ M over all possible file assignments which admit a computation load of r: L ∗ (r)≥ inf M:|M 1 |+···+|M K |=rN L ∗ M . (2.23) Then by Lemma 2.1, we have L ∗ (r)≥ inf M:|M 1 |+···+|M K |=rN K X j=1 a j M N · K−j Kj . (2.24) For every file assignmentM such that|M 1 | +··· +|M K | =rN,{a j M } K j=1 satisfy a j M ≥ 0, j∈{1,...,K}, (2.25) K X j=1 a j M =N, (2.26) K X j=1 ja j M =rN. (2.27) Then since the function K−j Kj in (2.24) is convex in j, and by (2.26) K P j=1 a j M N = 1, (2.24) becomes L ∗ (r)≥ inf M:|M 1 |+···+|M K |=rN K− K P j=1 j a j M N K K P j=1 j a j M N (a) = K−r Kr , (2.28) where (a) is due to the requirement imposed by the computation load in (2.27). The lower bound on L ∗ (r) in (2.28) holds for general 1≤r≤K. We can further improve the lower bound for non-integer valued r as follows. For a particular r / ∈ N, we first find 29 Chapter 2. A Fundamental Tradeoff between Computation and Communication the line p +qj as a function of 1≤ j ≤ K connecting the two points (brc, K−brc Kbrc ) and (dre, K−dre Kdre ). More specifically, we find p,q∈R such that p +qj| j=brc = K−brc Kbrc , (2.29) p +qj| j=dre = K−dre Kdre . (2.30) Then by the convexity of the function K−j Kj in j, we have for integer-valued j = 1,...,K, K−j Kj ≥p +qj, j = 1,...,K. (2.31) Then (2.24) reduces to L ∗ (r)≥ inf M:|M 1 |+···+|M K |=rN K X j=1 a j M N · (p +qj) (2.32) = inf M:|M 1 |+···+|M K |=rN K X j=1 a j M N ·p + K X j=1 ja j M N ·q (2.33) (b) = p +qr, (2.34) where (b) is due to the constraints on{a j M } K j=1 in (2.26) and (2.27). Therefore,L ∗ (r) is lower bounded by the lower convex envelop of the points{(r, K−r Kr ) :r∈ {1,...,K}}. This completes the proof of the converse part of Theorem 2.1. Remark 2.12. Although the considered model only allows each node sending messages independently, we can show that even if the data shuffling process can be carried out in multiple rounds and dependency between messages are allowed, the lower bound on L ∗ (r) remains the same. We devote the rest of this section to the proof of Lemma 2.1. To prove Lemma 2.1, we develop a lower bound on the number of bits communicated by any subset of nodes, by induction on the size of the subset. In particular, for a subset of computing nodes, we first characterize a lower bound on the minimum number of bits required by a particular node in the subset, which is given by a cut-set bound separating this node and all the other nodes in the subset. Then, we combine this bound with the lower bound on the number of 30 Chapter 2. A Fundamental Tradeoff between Computation and Communication bits communicated by the rest of the nodes in the subset, which is given by the inductive argument. Proof of Lemma 2.1. Forq∈{1,...,Q},n∈{1,...,N}, we letV q,n be i.i.d. random variables uniformly distributed onF 2 T . We let the intermediate valuesv q,n be the realizations ofV q,n . For someQ⊆{1,...,Q} andN⊆{1,...,N}, we define V Q,N ,{V q,n :q∈Q,n∈N}. (2.35) Since each message X k is generated as a function of the intermediate values that are com- puted at Node k, the following equation holds for all k∈{1,...,K}. H(X k |V :,M k ) = 0, (2.36) where we use “:” to denote the set of all possible indices. The validity of the shuffling scheme requires that for allk∈{1,...,K}, the following equation holds : H(V W k ,: |X : ,V :,M k ) = 0. (2.37) For a subsetS⊆{1,...,K}, we define Y S , (V W S ,: ,V :,M S ), (2.38) which contains all the intermediate values required by the nodes inS and all the intermediate values known locally by the nodes inS after the Map phase. For any subsetS⊆{1,...,K} and a file assignmentM, we denote the number of files that are exclusively mapped by j nodes inS as a j,S M : a j,S M , X J⊆S:|J|=j |(∩ k∈J M k )\(∪ i/ ∈J M i )|, (2.39) and the message symbols communicated by the nodes whose indices are inS as X S ={X k :k∈S}. (2.40) 31 Chapter 2. A Fundamental Tradeoff between Computation and Communication Then we prove the following claim. Claim 2.1. For any subsetS⊆{1,...,K}, we have H(X S |Y S c)≥T |S| X j=1 a j,S M Q K · |S|−j j , (2.41) whereS c ,{1,...,K}\S denotes the complement ofS. We prove Claim 2.1 by induction. a. IfS ={k} for any k∈{1,...,K}, obviously H(X k |Y {1,...,K}\{k} )≥ 0 =Ta 1,{k} M Q K · 1− 1 1 . (2.42) b. Suppose the statement is true for all subsets of size S 0 . For anyS⊆{1,...,K} of size|S| =S 0 + 1 and any k∈S, we have H(X S |Y S c) = 1 |S| X k∈S H(X S ,X k |Y S c) (2.43) = 1 |S| X k∈S (H(X S |X k ,Y S c) +H(X k |Y S c)) (2.44) ≥ 1 |S| X k∈S H(X S |X k ,Y S c) + 1 |S| H(X S |Y S c). (2.45) From (2.45), we have H(X S |Y S c)≥ 1 |S|− 1 X k∈S H(X S |X k ,Y S c) (2.46) ≥ 1 S 0 X k∈S H(X S |X k ,V :,M k ,Y S c) (2.47) = 1 S 0 X k∈S H(X S |V :,M k ,Y S c). (2.48) For each k∈S, we have the following subset version of (2.36) and (2.37). H(X k |V :,M k ,Y S c) = 0, (2.49) H(V W k ,: |X S ,V :,M k ,Y S c) = 0. (2.50) 32 Chapter 2. A Fundamental Tradeoff between Computation and Communication Consequently, H(X S ,V W k ,: |V :,M k ,Y S c) =H(X S |V :,M k ,Y S c) (2.51) =H(V W k ,: |V :,M k ,Y S c) +H(X S |V W k ,: ,V :,M k ,Y S c). (2.52) The first term on the RHS of (2.52) can be lower bounded as follows. H(V W k ,: |V :,M k ,Y S c) =H(V W k ,: |V :,M k ,V W S c,: ,V :,M S c ) (2.53) (a) = H(V W k ,: |V :,M k ,V :,M S c ) (2.54) (b) = H(V W k ,: |V W k ,M k ∪M S c ) (2.55) (c) = X q∈W k H(V {q},: |V {q},M k ∪M S c ) (2.56) (d) = Q K T S 0 X j=0 a j,S\{k} M (2.57) ≥ Q K T S 0 X j=1 a j,S\{k} M , (2.58) where (a) is due to the independence of intermediate values and the fact thatW k ∩W S c =? (different nodes calculate different output functions), (b) and (c) are due to the indepen- dence of intermediate values, and (d) is due to the independence of the intermediate values and the fact that|W k | = Q K . The second term on the RHS of (2.52) can be lower bounded by the induction assumption: H(X S |V W k ,: ,V :,M k ,Y S c) =H(X S\{k} |Y (S\{k}) c) (2.59) ≥T S 0 X j=1 a j,S\{k} M Q K · S 0 −j j . (2.60) Thus by (2.48), (2.52), (2.58) and (2.60), we have H(X S |Y S c)≥ 1 S 0 X k∈S H(X S |V :,M k ,Y S c) (2.61) = 1 S 0 X k∈S H(V W k ,: |V :,M k ,Y S c) +H(X S |V W k ,: ,V :,M k ,Y S c) (2.62) 33 Chapter 2. A Fundamental Tradeoff between Computation and Communication ≥ 1 S 0 X k∈S T S 0 X j=1 a j,S\{k} M Q K +T S 0 X j=1 a j,S\{k} M Q K · S 0 −j j (2.63) = T S 0 X k∈S S 0 X j=1 a j,S\{k} M Q K · S 0 j (2.64) =T S 0 X j=1 Q K · 1 j X k∈S a j,S\{k} M . (2.65) By the definition of a j,S M , we have the following equations. X k∈S a j,S\{k} M = X k∈S N X n=1 1(file n is only mapped by some nodes inS\{k})· 1(file n is mapped by j nodes) (2.66) = N X n=1 1(file n is only mapped by j nodes inS)· X k∈S 1(file n is not mapped by Node k) (2.67) = N X n=1 1(file n is only mapped by j nodes inS)(|S|−j) (2.68) =a j,S M (S 0 + 1−j). (2.69) Applying (2.69) to (2.65) yields H(X S |Y S c)≥T S 0 X j=1 a j,S M Q K · S 0 + 1−j j (2.70) =T S 0 +1 X j=1 a j,S M Q K · S 0 + 1−j j . (2.71) c. Thus for all subsetsS⊆{1,...,K}, the following equation holds: H(X S |Y S c)≥T |S| X j=1 a j,S M Q K · |S|−j j , (2.72) which proves Claim 2.1. 34 Chapter 2. A Fundamental Tradeoff between Computation and Communication Then by Claim 2.1, letS ={1,...,K} be the set of all K nodes, L ∗ M ≥ H(X S |Y S c) QNT ≥ K X j=1 a j M N · K−j Kj . (2.73) This completes the proof of Lemma 2.1. . 2.6 Extension of CDC: linear computations and compressed CDC For the MapReduce-type distributed computing model defined in Section 2.1, when the Reduce function is commutative and associative, a “combiner function” was proposed in the original MapReduce framework [4] to pre-combine multiple intermediate values with the same key. Then, instead of sending multiple values to the reducer, the mapper sends the pre-combined value whose size is the same as one uncombined value, which significantly reduces the bandwidth consumption without any performance loss. In contrast to the above compression/combining technique that reduces the communication load by combining intermediate values for the same computation task, the proposed CDC scheme enables coding opportunities across intermediate results of different computation tasks to reduce the communication load. In this section, we propose a new scheme, named compressed coded distributed computing (in short, compressed CDC). It jointly exploits the compression/combining technique and the CDC scheme to significantly reduce the communication load for computation tasks with linear Reduce functions that are prevalent in data analytics (e.g., distributed gradient descent where the partial gradients computed at multiple distributed computing nodes are averaged to reduce to the final gradient). In particular, the compressed CDC scheme specifies a repetitive storage of the dataset across distributed computing nodes. Each node, after processing locally stored files, first pre-combines several intermediate values of a single computation task needed by another node. Having generated multiple such pre-combined packets for different tasks, the node further codes them to generate a coded multicast packet that is simultaneously useful for multiple tasks. Therefore, compressed CDC enjoys both the intra-computation gain from combining, and the inter-computation gain from coded multicasting. 35 Chapter 2. A Fundamental Tradeoff between Computation and Communication We characterize the achievable communication load of compressed CDC and show that it substantially outperforms both combining methods and CDC scheme. In particular, compared with the scheme that only relies on the combining technique, compressed CDC reduces the communication load by a factor that is proportional to the storage size of each computing node, which is significant for the common scenarios where large-scale machine learning tasks are executed on commodity servers with relatively small storage size. On the other hand, compared with the CDC scheme whose communication load scales linearly with the size of the input files, compressed CDC eliminates this dependency by pre-combining intermediate values of the same task, allowing the system to scale up to handle computations on arbitrarily large dataset. 2.6.1 Linear Reduce functions We consider the same distributed computing model defined in Section 2.1 to compute Q output functions φ 1 ,φ 2 ,...,φ Q from N input files w 1 ,w 2 ,...,w N , following a MapReduce structure. Here we focus on a class of computation jobs with linear aggregation for which the computation of each output function can be decomposed as the sum of N intermediate values computed from the input files, i.e., for q = 1,...,Q, φ q (w 1 ,...,w N ) =h q (v q,1 ,v q,2 ,...,v q,N ) =v q,1 +v q,2 ··· +v q,N , (2.74) where v q,n =g q (w n ) is the intermediate value of φ q computed from file w n . So far, we have introduced one computation job that involves computingQ functions. Here, we consider the scenario where J such computation jobs are executed in parallel, for some J∈N. We denote theN input files of jobj asw 1 (j),...,w N (j), and theQ output functions job j wants to compute as φ 1 (j),...,φ Q (j). 3 2.6.2 Network model The above described J computation jobs are executed distributedly on a computer cluster that consists of K distributed computing nodes, for some K∈N. These computing nodes 3 As an example, we can consider executing J machine learning tasks (e.g., image classification), each of which has its own dataset, and aims to obtain its own set of model parameters. Another example is the navigation application, where J navigation sessions, each of which requires to find the shortest path on a disjoint sector of the map, are executed in parallel. 36 Chapter 2. A Fundamental Tradeoff between Computation and Communication are denoted as Node 1,..., Node K. Here we assume K≤ N, and focus on a symmetric setting for the sake of load balancing, in which K|Q, and each node is responsible for computing Q K output functions for each job. The K nodes are connected through an error- free broadcast network. Each node has a local storage that can store up toμJN input files, i.e., μ fraction of the entire dataset that contains all input files from all jobs, for some μ satisfying 1 K ≤μ< 1. Before the computation starts, each node selects and storesμJN input files from the dataset. For each node k, we denote the set of indices of the files stored locally asM k . A valid file placement has to satisfy 1)|M k |≤μJN, for all k = 1, 2,...,K (local storage constraint), and 2)∪ k=1,...,K M k =∪ j=1,...,J {n (j) : n = 1, 2,...,N} (the entire dataset needs to be collectively stored across the cluster). 2.6.3 Computation model Map phase. For each file w n (j) of job j, n (j) ∈M k , Node k maps it into Q intermediate values v 1 (j) ,n (j),v 2 (j) ,n (j),...,v Q (j) ,n (j), one for each of the Q functions computed in job j. We assume that all the intermediate values across the J jobs have the same size of T bits, which is the case when for example, we are training J image classifiers in parallel using the same deep neural network. Shuffle phase. Before the Shuffle phase starts, for each computation job j, we assign the tasks of reducing the output functions symmetrically across the nodes, such that each node computes a disjoint subset of Q K functions. We denote the set of the indices of the output functions assigned to Node k for job j asS (j) k , j = 1, 2,...,J. In the Shuffle phase, each node k produces a message, denoted by X k ∈ F 2 ` k , as a function of the locally computed intermediate values in the Map phase (i.e., ∪ n (j) ∈M k {v 1 (j) ,n (j),v 2 (j) ,n (j),...,v Q (j) ,n (j)}), where ` k ∈N denotes the length of the message in bits. Having generated X k , Node k broadcasts it to all the other nodes. Definition 2.4 (Communication Load). We define the communication load, denoted by L, as the total number of bits contained in all broadcast messages, normalized by JQT , i.e., L, ` 1 +···+` K JQT . (2.75) 37 Chapter 2. A Fundamental Tradeoff between Computation and Communication Reduce phase. For each job j and each q (j) ∈S (j) k , j = 1, 2,...,J, Node k computes the output function φ q (j) as in (2.74), using the locally computed Map results and the received broadcast messages in the Shuffle phase. 2.6.4 Main Results For the above formulated distributed computing problem, we first study the effects of ap- plying the compression scheme and the CDC scheme individually on reducing the commu- nication load. Then, we present our main result, which is a communication load achieved by the proposed computing scheme that jointly utilizes compression and CDC. Exploiting the compression technique, each sender node pre-combines all the intermediate values needed at the receiver node for a particular function, and then sends the pre-combined value. We consider single-job strategies where we repeat the same steps to handle the scenario of executing a single job, for all J jobs. The following communication load can be achieved by solely applying compression. L compression = d 1 μ e− 1, 1 K ≤μ< 1 2 , 1, 1 2 ≤μ< 1. (2.76) The above communication load achieved by compression only depends on the storage size μ. In the regime of 1 2 ≤μ< 1, the communication load L compression is a constant that does not decrease as the storage size increases. This is because that as long as μ< 1, each node has to receive at least one intermediate value for each of the functions it is computing. When only applying the CDC scheme without compression, we can achieve the communi- cation load L CDC = (1−μ)N μK . (2.77) The CDC scheme creates coded multicast packets that are simultaneously useful for μK nodes. Hence, for fixed storage size μ, the achieved communication load L CDC decreases inversely proportionally with the network size (K). On the other hand, since the CDC 38 Chapter 2. A Fundamental Tradeoff between Computation and Communication scheme was designed to handle general Reduce functions that require each of the N inter- mediate values separately as the inputs, the loadL CDC also scales linearly with the number of input files (N). We propose the compressed coded distributed computing (compressed CDC) scheme, which jointly utilizes the combining and the coded multicasting techniques, and achieves a smaller communication load than those achieved by applying each of the two techniques individually. We present the performance of compressed CDC in the following theorem. Theorem 2.3. To execute J computation jobs with linear aggregation of intermediate re- sults, each of which processes N input files to compute Q output functions, distributedly over K computing nodes each with a local storage of size μ, the proposed compressed CDC scheme achieves the following communication load L compressed CDC = (1−μ)(μK + 1) μK , (2.78) for μK∈{1, 2,...,K− 1}, and sufficiently large J. We describe the general compressed CDC scheme in the next sub-section. Remark 2.13. Compared with the compression scheme whose communication load is in (2.76), the proposed compressed CDC scheme reduces the communication load by a factor of μ when 1 K ≤μ< 1 2 , and by a factor of 1−μ when 1 2 ≤μ< 1. In the scenarios where the cluster consists of many low-end computing nodes with small storage size (e.g., μ = 1 K ), this bandwidth reduction can scale with the network size. Also, in contrast to the compression scheme, the load L compressed CDC keeps decreasing as the storage size μ increases. Remark 2.14. Unlike the communication load in (2.77) achieved by the CDC scheme, the communication load achieved by the compressed CDC scheme does not grow with the number of input files. This is accomplished by incorporating the compression technique, i.e., pre-combining multiple intermediate values of the same Reduce function. 2.6.5 Description of the compressed CDC scheme In this sub-section, we describe the proposed compressed CDC scheme, and analyze its communication load. 39 Chapter 2. A Fundamental Tradeoff between Computation and Communication We consider the storage sizeμ such thatμK∈{1, 2,...,K− 1}, and take sufficiently many computation jobs to process in parallel, where the number of jobs J = γ K μK+1 , for some γ∈N. The proposed compressed CDC scheme operates on a batch of K μK+1 jobs at a time, and repeats the same operations γ times to process all the jobs. Therefore, it is sufficient to describe the scheme for the case of γ = 1. Along the general description of the compressed CDC scheme, we consider the following illustrative example. Example (compressed CDC). We have a distributed computing cluster that consists of K = 4 nodes each with a storage size of μ = 1 2 . On this cluster, we need to execute J = K μK+1 = 4 MapReduce jobs with linear Reduce functions, each of which requires processing N = 6 files to compute Q = 4 output functions. Each node is responsible for computing one output function, for each of the 4 jobs. In particular, Node k computes φ k (j) =v k (j) ,1 (j) +v k (j) ,2 (j) +··· +v k (j) ,6 (j), (2.79) for all j = 1,..., 4, where v k (j) ,n (j) is the intermediate value of the function φ k (j) of job j mapped from the input file w n (j) of job j. File placement. For each job j, j = 1, 2,..., K μK+1 , all of its input files w 1 (j),w 2 (j),...,w N (j) are stored exclusively on a unique subset ofμK + 1 nodes, and we de- note the set of indices of these nodes asK j . WithinK j , each filew n (j) of jobj is repeatedly stored on μK nodes. In particular, we first evenly partition the files w 1 (j),w 2 (j),...,w N (j) into μK + 1 batches, and label each batch by a unique size-μK subset ofK j , denoted by P j . Then, we store all the files in a batch on each of the μK nodes whose index is in the corresponding subsetP j . We denote the set of indices of the files from job j in a batch labelled by a subsetP j asB P j . The file placement is performed such that for eachP j ⊂K j with|P j | =μK, and each n (j) ∈B P j , we have n (j) ∈M k , (2.80) for all k∈P j , whereM k is the set of indices of all files stored at Node k. Applying the above file placement, each node inK j storesμK× N μK+1 files. Since each node is in K−1 μK subsets of{1, 2,...,K} of size μK + 1, it stores overall μKN μK+1 × K−1 μK =μJN files, satisfying its local storage constraint. 40 Chapter 2. A Fundamental Tradeoff between Computation and Communication Node 1 Node 2 Node 3 Node 4 files files Figure 2.7: File placement onto K = 4 computing nodes. For each j = 1, 2, 3, 4, we place the set of files for job j, {1 (j) , 2 (j) ,..., 6 (j) } onto a unique subset ofμK + 1 = 3 nodes, following a repetitive pattern where each file is stored on μK = 2 nodes. Example (compressed CDC: file placement). As shown in Fig. 2.7, we perform the file placement such that for eachj = 1, 2, 3, 4, the set of files from jobj,{1 (j) , 2 (j) ,..., 6 (j) } are placed on a unique subset of μK + 1 = 3 nodes. For example, the files of job 1, {1 (1) , 2 (1) ,..., 6 (1) } are exclusively stored on Nodes 1, 2, and 3. These files are partitioned into 3 batches, i.e.,B {1,2} ={3 (1) , 4 (1) },B {1,3} ={1 (1) , 2 (1) }, andB {2,3} ={5 (1) , 6 (1) }. Then, the files 3 (1) and 4 (1) are stored on Nodes 1 and 2, the files 1 (1) and 2 (1) are stored on Nodes 1 and 3, and the files 5 (1) and 6 (1) are stored on Nodes 2 and 3. Coded computing. After the file placement, the compressed CDC scheme starts the computation and data shuffling in subsets of μK + 1 nodes. Within each subsetK j , j = 1, 2,..., K μK+1 , that contains the indices of|K j |=μK + 1 nodes, the computing scheme proceeds in two stages. In the first stage, the nodes inK j process the files they have exclusively stored, i.e., the files of job j. In the second stage, they handle the files from other jobs. Stage 1 (coding for a single job). In the first stage, nodes inK j only process input files and compute output functions for job j. For ease of exposition, we drop all the job indices in the rest of the description of stage 1. According to the file placement, each node inK stores μKN μK+1 files of job j, and each node in the subsetP of μK nodes stores all the files in the batchB P . In the Map phase, each node k∈K maps all the files of job j it has stored locally, for all output functions of jobj. We note that after the Map phase, for each subsetP of sizeμK, andk 0 ∈K\P, each of the nodes inP has computed Q K intermediate values, one for each of the functions assigned to Node k 0 , from each of the files in the batchB P . More precisely, 41 Chapter 2. A Fundamental Tradeoff between Computation and Communication these intermediate values are {v q,n :q∈S k 0,n∈B P }. (2.81) In the Shuffle phase, within each subsetP ⊂ K of size μK, we first perform the pre- combining operation as follows. For each k∈P, Node k sums up the intermediate values computed in (2.81) to obtain the pre-combined values ¯ v q,P = X n∈B P v q,n , (2.82) for all q∈S k 0. Having computed Q K such pre-combined values{¯ v q,P :q∈S k 0}, the nodes inP concatenate them to generate a packet V P , and evenly and arbitrarily split it into μK segments. We label the segments by the elements inP. That is, forP ={i 1 ,i 2 ,...,i μK }, we have V P = (V P,i 1 ,V P,i 2 ,...,V P,i μK ). (2.83) Finally, each node k inK generates a coded packet X stage 1 k by computing bit-wise XOR (denoted by⊕) of the data segments labelled by k, i.e., X stage 1 k = ⊕ P⊂K:|P|=μK,k∈P V P,k , (2.84) and multicasts X k to all other nodes inK. After Node k receives a coded packet X stage 1 k 0 from Node k 0 , it cancels all the segments V P,k 0s with k∈P, and recovers the intended segment V K\{k},k 0. Repeating this decoding process for all received coded packets, Node k recovers V K\{k} , and hence ¯ v q,K\{k} , for all q∈S k . Using these values, together with the local Map results, Node k computes the output φ q for all q∈S k . After the first stage of computation, each node inK j completes its computation tasks for job j. Since each of the coded packets in (2.84) contains Q K × T μK bits, the communication load exerted in the Shuffle phase of the first stage is L stage 1 = Q K × (μK+1)T μK JQT = μK+1 μK JK . (2.85) 42 Chapter 2. A Fundamental Tradeoff between Computation and Communication Example (compressed CDC: coding for a single job). We start describing the pro- posed scheme in the subset of Nodes 1, 2, and 3. In the first stage of computation, since {1, 2, 3} =K 1 , these three nodes will focus on processing job 1. After the computations in the Map phase, these 3 nodes combine the local intermediate values such that Node 1 and Node 2 compute v 3,3 +v 3,4 , Node 1 and Node 3 compute v 2,1 +v 2,2 , and Node 2 and Node 3 compute v 1,5 +v 1,6 . Next, each node creates coded multicasting packets from the pre-combined packets, following the encoding process of the CDC scheme. By the end of this stage, Nodes 1, 2, and 3 compute their assigned functions for job 1. The first stage incurs a communication load of L stage 1 = 3/2 16 = 3 32 . Stage 2 (coding across jobs). In the second stage, we first take a nodei outsideK j , and then for eachk∈K j , we label the job whose input files are exclusively stored on the nodes in{i}∪K j \{k} as j k . Next, the nodes inP j k =K j \{k} process the files of job j k in the batchB P j k in the Map phase, and communicate the computed intermediate values needed by Node k in a coded manner. For a nodei∈{1, 2,...,K}\K j , and eachk∈K j , the nodes inP j k =K j \{k} share a batch of N μK+1 files inB P j k for job j k . In the Map phase, for each k 0 ∈P j k , Node k 0 computes Q K intermediate values, one for each function of job j k assigned to Node k inS (j k ) k , from each of the files in the batchB P j k . More precisely, each Nodek 0 computes the intermediate values {v q (j k ) ,n (j k ) :q (j k ) ∈S (j k ) k ,n (j k ) ∈B P j k }. (2.86) In the Shuffle phase, for each k∈K j , the nodes inP j k first pre-combine the Map results in (2.86) locally to compute ¯ v q (j k ) ,P j k = X n (j k ) ∈B P j k v q (j k ) ,n (j k ), (2.87) for all q (j k ) ∈S (j k ) k . Next, as similarly done in the first stage, the nodes inP j k first concatenate the above Q K pre- combined values{¯ v q (j k ) ,P j k :q (j k ) ∈S (j k ) k } to form a packet V P j k , and then split it into μK segments. We label these segments by the elements inP j k , i.e., forP j k ={i 1 ,i 2 ,...,i μK }, 43 Chapter 2. A Fundamental Tradeoff between Computation and Communication we have V P j k = (V P j k ,i 1 ,V P j k ,i 2 ,...,V P j k ,i μK ). (2.88) Finally, each node k 0 inK j generates a coded packet X stage 2 k 0 by computing bit-wise XOR of the data segments labelled by k 0 , i.e., X stage 2 k 0 = ⊕ t∈K j \{k 0 } V P j t ,k 0, (2.89) and multicasts X stage 2 k 0 to all other nodes inK j . We note that since the job index j t (whose input files are exclusively stored on nodes in {i}∪K j \{t}) is different for different t, the above coded packet is generated using intermediate values from different jobs. Having received a coded packet X stage 2 k 0 from Node k 0 , Node k cancels all the segments V P j t ,k 0s with k∈P jt , and recovers the intended segment V P j k ,k 0. Repeating this decoding process for all received coded packets, Node k recovers V P j k , and hence ¯ v q (j k ) ,P j k , for all q (j k ) ∈S (j k ) k . We repeat the above Map and Shuffle phase operations for all i∈{1, 2,...,K}\K j . By the end of the second stage, each node inK j recovers partial sums to compute functions from K−μK− 1 jobs. The communication load incurred in the Shuffle phase, for a particular i, is Q K × μK+1 μK JQ , and the total communication load of the second stage is L stage 2 = (K−μK−1) μK+1 μK JK . (2.90) Example (compressed CDC: coding across jobs). We now move on to describe the second stage of compressed CDC within the subsetK 1 ={1, 2, 3} via Fig. 2.8, where we represent the functions computed by Node 1, 2, and 3 by red/circle, green/square, and blue/triangle respectively, and the intermediate value of a function from a file n (j) as the corresponding color/shape labelled by n (j) . In this stage, as shown in Fig. 2.8, each node maps 4 files, two of which belong to a job, and the other two belong to another job. For example, Node 1 maps the files 5 (2) , 6 (2) from job 2, and files 1 (3) , 2 (3) from job 3, producing 44 Chapter 2. A Fundamental Tradeoff between Computation and Communication Node 3 Node 1 Node 2 Map Map Map has needs has needs files files + + = split = + = + + = + split + = split = + files has + needs Figure 2.8: Illustration of the operations in the second stage of compressed CDC, in the subset of Nodes 1, 2, and 3. Note that in this stage, pre-combined packets from different jobs are utilized to create coded multicast packets. two blue triangles labelled by 5 (2) and 6 (2) , and two green squares labelled by 1 (3) and 2 (3) . During data shuffling, each node first sums up the two intermediate values from the same job to create two pre-combined packets locally (e.g., the summation of blue triangles labelled by 5 (2) and 6 (2) , and the summation of green squares labelled by 1 (3) and 2 (3) at Node 1). Then, as shown in Fig. 2.8, each node splits each of the computed sums evenly into two segments, computes the bit-wise XOR of two segments, one from each sum, and multicasts it to the other two nodes. Finally, each node decodes the intended sum from the multicast packets using its locally computed intermediate values. The second stage incurs a communication load of L stage 2 = 3/2 16 = 3 32 . Having performed this two-stage operation on all subsets K j of μK + 1 nodes, j = 1, 2,..., K μK+1 , each node k has finished computing its assigned functions from K−1 μK jobs. For each of the remaining K μK+1 − K−1 μK jobs, say job j 0 , and each k 0 ∈K j 0, Node k receives a partial sum of N μK+1 intermediate values for each of the functions inS (j 0 ) k , in the subset{k}∪K j 0\{k 0 }. Summing up these μK + 1 partial sums, Node k finishes computing each of its assigned functions from job j 0 . The overall communication load of compressed CDC is L compressed CDC = K μK + 1 ! ×(L stage 1 +L stage 2 ) = (1−μ)(μK + 1) μK . (2.91) 45 Chapter 2. A Fundamental Tradeoff between Computation and Communication Example (compressed CDC: final reduction). After the two-stage computations in the subset{1, 2, 3}, we repeat the same operations in the other subsets of 3 nodes. In the end, taking Node 1 as an example, • In subset{1, 2, 3}, Node 1 computes φ 1 (1), and v 1 (4) ,3 (4) +v 1 (4) ,4 (4), • In subset{1, 2, 4}, Node 1 computes φ 1 (2), and v 1 (4) ,1 (4) +v 1 (4) ,2 (4), • In subset{1, 3, 4}, Node 1 computes φ 1 (3), and v 1 (4) ,5 (4) +v 1 (4) ,6 (4). Finally, Node 1 computesφ 1 (4) by adding up the received partial sums in the 3 subsets. We can verify that Nodes 2, 3, and 4 also successfully recover their assigned functions from the 4 jobs. The overall communication load is L compressed CDC = 3 32 × 2× 4 = 3 4 . Remark 2.15. For the above example, using only the combining technique to process each job, we would have communicated 4 pre-combined packets, one for each node, achieving a communication load L compression = 4 4 = 1. On the other hand, using the CDC scheme that only exploits the coded multicasting opportunities, we would have achieved a communication load of L CDC = 3 2 . 2.7 Extension of CDC: multi-stage dataflows Unlike simple computation tasks like Grep, Join and Sort, many distributed computing applications contain multiples stages of MapReduce computations. Examples of these ap- plications include machine learning algorithms [60], SQL queries for databases [61, 62], and scientific analytics [63]. One can express the computation logic of a multistage application as a directed acyclic graph (DAG) [64], in which each vertex represents a logical step of data transformation, and each edge represents the dataflow across processing vertices. We formalize a distributed computing model for multistage dataflow applications. We express a multistage dataflow as a layered-DAG, in which the processing vertices within a particular computation stage are grouped into a single layer. Each vertex represents a MapReduce-type computation, transforming a set of input files into a set of output files. The set of edges specifies 1) the order of the computations such that the head vertex of an edge does not start its computation until the tail vertex finishes, and 2) the input-output 46 Chapter 2. A Fundamental Tradeoff between Computation and Communication relationships between vertices such that the input files of a vertex consist of the output files of all vertices connected to it through incoming edges. For a given layered-DAG, we propose a coded computing scheme to achieve a set of computation-communication tuples, which characterizes the load of computation for each processing vertex, and the load of communication within each layer. The proposed scheme first specifies the computation loads of the Map and Reduce functions for each vertex (i.e., how many times a Map or a Reduce function should be calculated), and then exploits the CDC scheme to perform the computation for each vertex individually. 2.7.1 Problem Formulation: Layered-DAG We consider a computing task that processes N input files w 1 ,...,w N ∈ F 2 F to generate Q output files u 1 ,...,u Q ∈F 2 B, for some parameters F,B∈N. The overall computation is represented by a layered-DAGG = (V,A), in which the set of verticesV is composed of D layers, denoted byL 1 ,...,L D , for some D∈N. For each d = 1,...,D, we label the ith vertex in Layerd asm d,i , for alli = 1,...,|L d |. See Fig. 2.9 for the illustration of a 4-layer DAG. m 2,1 m 2,3 m 1,1 m 4,1 Layer 1 Layer 2 Layer 3 m 1,2 m 2,2 m 3,1 m 3,2 Layer 4 Figure 2.9: A 4-layer DAG. Each vertex m d,i processes N d,i input files w d,i 1 ,...,w d,i N d,i ∈F 2 F d,i , and computes Q d,i out- put files u d,i 1 ...,u d,i Q d,i ∈ F 2 B d,i , for some system parameters N d,i ,Q d,i ,F d,i ,B d,i ∈ N. In particular, the input files ofG are distributed as the inputs to the vertices in Layer 1, i.e., {w 1 ,...,w N } = ∪ i=1,...,|L 1 | {w 1,i 1 ,...,w 1,i N 1,i }, and the output files ofG are distributed as the outputs of the vertices in Layer D, i.e.,{u 1 ,...,u Q } = ∪ i=1,...,|L D | {u D,i 1 ,...,u D,i Q D,i }. Edges inA are between vertices in consecutive layers, i.e., A⊆ ∪ d=1,...,D−1 ∪ i=1,...,|L d |,j=1,...,|L d+1 | (m d,i ,m d+1,j ). (2.92) 47 Chapter 2. A Fundamental Tradeoff between Computation and Communication The input files of a vertex in Layer d, d = 2,...,D, consist of the output files of the vertices it connects to in the preceding layer. More specifically, for any d∈{2,...,D} and i∈{1,...,|L d |}, N d,i = P j:(m d−1,j ,m d,i )∈A Q d−1,j and {w d,i 1 ,...,w d,i N d,i } = ∪ j:(m d−1,j ,m d,i )∈A ∪ q∈{1,...,Q d−1,j } u d−1,j q . (2.93) For example in Fig. 2.9, the input files to the vertex m 3,1 consist of the output files of the vertices m 2,1 and m 2,3 . As a result, other than the number of input files for the vertices in Layer 1, we only need the number of output files at each vertex as the system parameters. The computation of the output file u d,i q , q = 1,...,Q d,i , of the vertex m d,i , for all d = 1,...,D, i = 1,...,|L d |, is decomposed as follows: u d,i q (w d,i 1 ,...,w d,i N d,i ) =h d,i q (g d,i q,1 (w d,i 1 ),...,g d,i q,N d,i (w d,i N d,i )), (2.94) where • The Map functions ~ g d,i n = (g d,i 1,n ,...,g d,i Q d,i ,n ) : F 2 F d,i → (F 2 T d,i ) Q , n ∈ {1,...,N d,i } maps the input file w d,i n into Q d,i length-T d,i intermediate values {v d,i q,n = g d,i q,n (w d,i n )∈ F 2 T d,i : q = 1,...,Q d,i }, for some T d,i ∈N. • The Reduce functions h d,i q : (F 2 T d,i ) N → F 2 B d,i , q ∈ {1,...,Q d,i } maps the interme- diate values of the output function u d,i q in all input files into the output file u d,i q = h d,i q (v d,i q,1 ,...,v d,i q,N d,i ). We compute the above layered-DAG using a K-server cluster, for some K∈ N. At each time instance, the servers only perform the computations of the vertices within a single layer. Each vertex in a layer is computed by a subset of servers. We denote the set of servers computing the vertex m d,i asK d,i ⊆{1,...,K}, where the selection ofK d,i is a design parameter. For each k∈K d,i , Server k computes a subset of Map functions of m d,i with indicesM d,i k ⊆{1,...,N d,i }, and a subset of Reduce functions with indicesW d,i k ⊆ {1,...,Q d,i }, whereM d,i k andW d,i k are design parameters. We denote the placements of the Map and Reduce functions form d,i as ¯ M d,i ,{M d,i k :k∈K d,i } and ¯ W d,i ,{W d,i k :k∈ K d,i } respectively. Data Locality. We prohibit transferring input files (or output files calculated in the pre- ceding layer) across servers, i.e., every node either stores the needed input files to compute 48 Chapter 2. A Fundamental Tradeoff between Computation and Communication the assigned Map functions (only to initiate the computations in Layer 1) or computes them locally from the assigned Reduce functions in the preceding layer. This implemen- tation provides a better fault-tolerance since the Reduce functions have to be calculated independently across servers. The computation of Layer d, d = 1,...,D, proceeds in three phases: Map, Shuffle, and Reduce. Map phase. For each vertex m d,i , i = 1,...,|L d |, in Layer D, each server k inK d,i computes its assigned Map functions~ g d,i n (w d,i n ) = (v d,i 1,n ,...,v d,i Q d,i ,n ), for all n∈M d,i k . Definition 2.5 (Computation Load). We define the computation load of vertex m d,i , d∈ {1,...,D}, i ∈ {1,...,|L d |}, denoted by r d,i , as the total number of Map functions of m d,i computed across the servers inK d,i , normalized by the number of input files N d,i , i.e., r d,i , P k∈K d,i |M d,i k | N d,i . 3 Shuffle phase. Each serverk,k∈{1,...,K}, creates a messageX d k as a function, denoted by ψ d k , of the intermediate values from all input files it has mapped in Layer d, i.e., X d k =ψ d k v d,i q,n :q∈{1,...,Q d,i },n∈M d,i k |L d | i=1 , and multicasts it to a subset of 1≤j≤K− 1 servers. Definition 2.6 (Communication Load). We define the communication load of Layer d, denoted by L d , as the total number of bits communicated in the Shuffle phase of Layer d. 3 By the end of the Shuffle phase, each serverk,k = 1,...,K, recovers all required intermedi- ate values for the assigned Reduce functions in Layerd, i.e.,{v d,i q,1 ,...,v d,i q,N d,i :q∈W d,i k } |L d | i=1 , from either the local Map computations or the multicast messages from the other servers. Reduce phase. Each server k, k = 1,...,K, computes the assigned Reduce functions to generate the output files of the vertices in Layer d, i.e.,{u d,i q = h d,i q (v d,i q,1 ,...,v d,i q,N d,i ) : q∈ W d,i k }, for all i = 1,...,|L d |. We say that a computation-communication tuple {(r d,1 ,...,r d,|L d | ,L d )} D d=1 is achievable if there exists an assignment of the Map and Reduce computations { ¯ M d,1 , ¯ W d,1 ..., ¯ M d,|L d | , ¯ W d,|L d | } D d=1 , and D shuffling schemes such that Server k, k = 1,...,K, can successfully compute all the Reduce functions in W d,i k , for all d∈{1,...,D} and i∈{1,...,|L d |}. 49 Chapter 2. A Fundamental Tradeoff between Computation and Communication Definition 2.7. We define the computation-communication region of a layered-DAG G = (V,A), denoted by C(G), as the closure of the set of all achievable computation- communication tuples. 3 2.7.2 CDC for Layered-DAG We propose a general Coded Distributed Computing (CDC) scheme for an arbitrary layered- DAG, which achieves the computation-communication tuples characterized in the following theorem. Theorem 2.4. For a layered-DAGG = (V,A) of D layers, the following computation- communication tuples are ahievable ∪ {r d,1 ,...,r d,|L d | ∈{1,...,K}} D d=1 {(r d,1 ,...,r d,|L d | ,L u d )} D d=1 , where L u d = P |L d | i=1 L coded (r d,i ,s d,i ,K)Q d,i N d,i T d,i . Here L coded (r,s,K) , min{r+s,K} P `=max{r+1,s} `( K ` )( `−2 r−1 )( r `−s ) r( K r )( K s ) , s d,i , max j:(m d,i ,m d+1,j )∈A r d+1,j , d<D, 1, d =D. , and N d,i = P j:(m d−1,j ,m d,i )∈A Q d−1,j ,∀d = 2,...,D. The above computation-communication tuples are achieved by the proposed CDC scheme for the layered-DAG, which first designs the parameters{s d,1 ,...,s d,|L d | } D d=1 that specify the placements of the computations of the Reduce functions, and then applies the CDC scheme for a cascaded distributed computing framework (see Theorem 2.2) to compute each of the vertices individually. Remark 2.16. The achieved communication load for vertex m d,i , L coded (r d,i ,s d,i ,K)Q d,i N d,i T d,i decreases as r d,i increases (more locally available Map results) and s d,i decreases (more data demands). Due to the specific way the parameter s d,i is chosen in Theorem 1, increasing the computation load r d+1,j of some vertex m d+1,j connected tom d,i can causes d,i to increase. In general, while more Map computations result in a smaller communication load in the current layer, they impose a larger communication load on the preceding layer. Next, we describe and analyze the proposed general CDC scheme to compute a layered- DAG. 50 Chapter 2. A Fundamental Tradeoff between Computation and Communication To start, we employ a uniform resource allocation such that every vertex is computed over all K servers, i.e.,K d,i ={1,...,K} for all d = 1,...,D and i = 1,...,|L d |. Remark 2.17. We note that the communication load L coded (r,s,K) in Theorem 7 is a decreasing function of K. That is, for fixed r and s, performing the computation of a vertex over a smaller number of serves yields a smaller communication load. However, the disadvantages of using less servers are 1) Each server needs to compute more Map and Reduce functions, incurring a higher computation load. 2) It may affect the symmetry of the data placement, increasing the communication load in the next layer (see discussions in the next subsection). For each vertex m d,i , i = 1,...,|L d |, in Layer d, we specify a computation load r d,i ∈ {1,...,K}, such that the computation of each Map function ofm d,i is placed onr d,i servers. We also define the reduce factor of m d,i , denoted by s d,i ∈{1,...,K}, as the number of servers that compute each Reduce function ofm d,i . To satisfy the data locality requirements (explained later), we select the reduce factor s d,i equal to the largest computation load of the vertex connected to m d,i in Layer d + 1, i.e., s d,i = max j:(m d,i ,m d+1,j )∈A r d+1,j , d<D, 1, d =D. (2.95) As an example, for a diamond DAG in Fig. 2.10, since the output files of m 1 will be used as the inputs for both m 2 and m 3 , we should compute each Reduce function of m 1 at s 1 = max{r 2 ,r 3 } servers. Also, since m 2 and m 3 both only connect to m 4 , we shall choose s 2 =s 3 =r 4 . m 2 m 3 m 1 m 4 Layer 1 Layer 2 Layer 3 Figure 2.10: A diamond DAG. The reduce factors s 1 ,...,s 4 are determined by the computation loads r 2 ,r 3 ,r 4 . 51 Chapter 2. A Fundamental Tradeoff between Computation and Communication Having selected the computation load r d,i and the reduce factor s d,i , we employ the CDC scheme to compute the vertex m d,i , over all K servers. We next briefly describe the CDC computation for m d,i . Map Phase Design. TheN d,i input files are evenly partitioned into K r d,i disjoint batches of size N d,i ( K r d,i ) , each of which is labelled by a subsetT ⊆{1,...,K} of size r d,i : {1,...,N d,i } ={B d,i T :T ⊆{1,...,K},|T| =r d,i }, (2.96) whereB d,i T denotes the batch corresponding to the subsetT . Given this partition, Server k, k∈{1,...,K} maps the files inB d,i T if k∈T . Reduce Functions Assignment. The Q d,i Reduce functions are evenly partitioned into K s d,i disjoint batches of size Q d,i ( K s d,i ) , each of which is labelled by a subsetP of s d,i nodes: {1,...,Q d,i }={D d,i P :P⊆{1,...,K},|P|=s d,i }, (2.97) whereD d,i P denotes the batch corresponding to the subsetP. Given this partition, Serverk,k∈{1,...,K} computes the Reduce functions whose indices are inD d,i P if k∈P. Coded Data Shuffling. In the Shuffle phase, within a subset of max{r d,i + 1,s d,i }≤ `≤ min{r d,i +s d,i ,K} servers, every r d,i of them shared some intermediate values that are simultaneously needed by the remaining `−r d,i servers. Each server multicasts enough linear combinations of the segments of these intermediate values until they can be decoded by all the intended servers. This achieves a communication load L coded (r d,i ,s d,i ,K) for vertex m d,i , where L coded (r,s,K) = min{r+s,K} P `=max{r+1,s} `( K ` )( `−2 r−1 )( r `−s ) r( K r )( K s ) is given in Theorem 2.4. Next we demonstrate that, the above CDC scheme can be applied to compute every vertex subject to the data locality constraint, using the reduce factors s d,i specified in (2.95). To do that, we focus on the computation of a vertex m d,i in Layer d. WLOG, we assume that m d,i only connects to a single vertexm d−1,k in Layerd− 1, hence the input files ofm d,i are the output files of m d−1,k and N d,i = Q d−1,k . Out of all vertices in Layer d connected to m d−1,k , say vertexm d,j has the largest computation load such that by (2.95), s d−1,k =r d,j , and each of the output files of m d−1,k is available on r d,j servers after the computation of 52 Chapter 2. A Fundamental Tradeoff between Computation and Communication Layer d− 1. By the above assignment of the Reduce functions, a batch of Q d−1,k ( K r d,j ) output files ofm d−1,k (or input files ofm d,i ), denoted byD d−1,k P , are available at allr d,j servers in a subsetP. To execute the Map phase of m d,i , we first evenly partition theD d−1,k P into r d,j r d,i sub-batches of size Q d−1,k ( K r d,j )( r d,j r d,i ) , each of which is sub-labelled by a subset ofP 0 ofr d,i nodes: D d−1,k P ={D d,i P,P 0 :P 0 ⊆P,|P 0 | =r d,i }, (2.98) whereD d,i P,P 0 denotes the sub-batch corresponding toP 0 . Then for each server k∈P, it maps all files inD d,i P,P 0 if k∈P 0 . Finally, we repeat this Map process for all subsetsP of size r d,j . Since every subset of r d,i servers are contained in K−r d,i r d,j −r d,i subsets of size r d,j , they map a total of Q d−1,k ( K r d,j )( r d,j r d,i ) × K−r d,i r d,j −r d,i = Q d−1,k ( K r d,i ) input files of m d,i . This is consistent with the above Map phase design for m d,i , i.e., for all T ⊆{1,...,K} of size r d,i , B d,i T = ∪ P⊆{1,...,K}:|P|=r d,j D d,i P,T , (2.99) whereB d,i T , as defined in (2.96), is the batch of input files of m d,i mapped by servers inT . We demonstrate in Fig. 2.11, the Map computations of the vertices m 2 and m 3 of the diamond DAG in Fig. 2.10 with Q 1 = 6 output files of m 1 , computation loads r 2 = 2 and r 3 = 1, using K = 3 servers. First we select the reduce factor of m 1 , s 1 = max{r 2 ,r 3 } = 2, such that every output file of m 1 , u 1 1 ,...,u 1 6 , is reduced on two servers. Having computed the output files ofm 1 that are also input files ofm 2 andm 3 , each server computes the Map functions of m 2 on all locally available files. However, since m 3 has a computation load r 3 = 1, each file is only mapped once on one server, e.g., u 1 3 and u 1 4 are both available on Server 1 and 2 after computing m 1 , but u 1 3 is mapped only on Server 1 and u 1 4 is mapped only on Server 4 in the Map phase of m 3 . Server 1 Server 2 Server 3 m 1 Reduce m 2 Map m 3 Map Figure 2.11: Illustration of the mapped files in the Map phases of the vertices m 2 and m 3 in the diamond DAG, for the case Q 1 =N 2 =N 3 = 6, r 2 = 2, and r 3 = 1. 53 Chapter 2. A Fundamental Tradeoff between Computation and Communication Using the above CDC scheme for each vertex, we can achieve a communication load L u d in Layer d, d = 1,...,D: L u d = |L d | X i=1 L coded (r d,i ,s d,i ,K)Q d,i N d,i T d,i . (2.100) Taking the union over all combinations of the computation loads achieves the computation- communication tuples in Theorem 2.4. Remark 2.18. Having characterized a set of computation-communication tuples using CDC, one can optimize the overall job execution time over the computation loads. Vary- ing computation loads affects the Map time, the Shuffle time and the Reduce time in each layer in different ways. For example, a smaller computation load can lead to a shorter Map time in the current layer and also a shorter Reduce time in the preceding layer, but may cause a long Shuffle phase in the current layer. In general, the design of optimum com- putation loads depends on the system parameters including input/output sizes, sizes of the intermediate values, server processing speeds and the network bandwidth. 54 Chapter 3 Coded TeraSort Having theoretically demonstrated the optimality of the Coded Distributed Computing (CDC) scheme in trading redundant computations for more network bandwidth in the previous chapter, we now aim to demonstrate the practical impact of coding in reducing the data shuffling load of distributed computing, and speeding up the overall computations. Particularly, we focus on “sorting”, which is a basic component of many data analytics application, and has data shuffling as its main bottleneck. In this chapter, we develop a new distributed sorting algorithm, named CodedTeraSort, that imposes structured redundancy in data to enable coding opportunities for efficient data shuffling, which results in speeding up the state-of-the-art algorithms by 1.97×- 3.39× in typical settings of interest. To date, there have been many distributed sorting algorithms developed to perform efficient distributed sorting on commodity hardware (see, e.g., [66, 67]). Out of these algorithms, TeraSort [26], originally developed to sort terabytes of data [68], is a commonly used benchmark in Hadoop MapReduce [30]. In consistence with the general structure of a MapReduce execution, in a TeraSort execution, each server node first maps each data point it stores locally into a particular partition of the key space, then all the data points in the same partition are shuffled to a single node, on which they are sorted within the partition to reduce the final sorted output. This chapter is mainly taken from [65], coauthored by the author of this document. 55 Chapter 3. Coded TeraSort Out of the above three steps, the time spent in the Map and the Reduce phases of the computation can be reduced by paralleling onto more processing nodes, while the shuffle time will almost remain constant. This is because that no matter how large the cluster size is, almost as much as the entire raw dataset of data need to be transferred over the network. Hence, data shuffling often becomes the bottleneck of the performance of the TeraSort algorithm (see, e.g., [10, 12]). In this chapter, we propose to leverage coding to overcome the shuffling bottleneck of TeraSort. In particular, we develop a novel distributed sorting algorithm, named CodedTeraSort, that incorporates the above coding ideas in CDC to inject structured com- putation redundancy in Map phase of TeraSort, in order to cut down its shuffling load. At a high-level CodedTeraSort can be explained as the following: • The input data points are split into disjoint files, and each file is stored on multiple carefully selected server nodes to create structured redundancy in data. • Each node maps all files that are assigned to it, following the Map procedure of TeraSort. • Each node utilizes the imposed structured redundancy in data placement to create coded packets for data shuffling, such that the multicast of each coded packet delivers data points to several nodes simultaneously, hence speeding up the data shuffling phase. • Each node decodes the data points that it needs for Reduce phase, from the received coded packets, and follows the Reduce procedure of TeraSort. We empirically evaluate the performance of CodedTeraSort through extensive experiments over Amazon EC2 clusters. While the underlying EC2 networking does not support network- layer multicast, we perform the application-layer multicast for shuffling of coded packets, using the broadcast API MPI Bcast from Open MPI [69]. Compared with the conven- tional TeraSort implementation, we demonstrate that CodedTeraSort achieves 1.97×- 3.39× speedup for typical settings of interest. Despite the extra overhead imposed by coding (e.g., generation of the coding plan, data encoding and decoding) and application- layer multicasting, the practically achieved performance gain approximately matches the gain theoretically promised by the CodedTeraSort algorithm. 56 Chapter 3. Coded TeraSort 3.1 Execution Time of Coded Distributed Computing (CDC) In the general Coded Distributed Computing (CDC) scheme proposed in the previous chap- ter, the computation of each Map task is repeated atr carefully chosen nodes (i.e., incurring computation load ofr), in order to enable the nodes to exchange coded multicast messages that are simultaneously useful for r other nodes. As a result, CDC reduces the commu- nication load by exactly a multiplicative factor of the computation load r, i.e., achieving L CDC (r) = 1 r L uncoded (r) = 1 r (1− r K ) = Θ( 1 r ). (3.1) The above tradeoff can be exploited to reduce the overall execution time of MapReduce, by balancing the computation load in the Map stage and the communication load in the Shuffle stage. To illustrate this, let us consider a MapReduce application for which the overall response time is composed of the time spent executing the Map tasks, denoted by T map , the time spent shuffling intermediate values, denoted by T shuffle , and the time spent executing the Reduce tasks, denoted by T reduce , i.e., T total, MR =T map +T shuffle +T reduce . (3.2) Using CDC, we can leverager× more computations in the Map phase, in order to reduce the communication load by the same multiplicative factor, where r∈N is a design parameter that can be optimized to minimize the overall execution time. Hence, CDC promises that we can achieve the overall execution time of T total, CDC ≈rT map + 1 r T shuffle +T reduce , (3.3) for any 1 ≤ r ≤ K, where K is the total number of nodes on which the distributed computation is executed. To minimize the above execution time, one would choose r ∗ = r T shuffle Tmap or r T shuffle Tmap , resulting in execution time of T ∗ total, CDC ≈ 2 q T shuffle T map +T reduce . (3.4) 57 Chapter 3. Coded TeraSort For example, in a MapReduce application that T shuffle is 10× - 100× larger than T map + T reduce , by comparing from (3.2) and (3.4), we note that CDC can reduce the execution time by approximately 1.5× - 5×. In the rest of this chapter, we demonstrate how to utilize the ideas from CDC in order to develop a new distributed sorting algorithm, named CodedTeraSort, that leverages coding to speedup the conventional sorting algorithm, TeraSort. We will also empirically demon- strate the performance of CodedTeraSort via experiments over Amazon EC2 clusters. 3.2 TeraSort TeraSort [68] is a conventional algorithm for distributed sorting of a large amount of data. The input data that is to be sorted is in the format of key-value (KV) pairs, meaning each input KV pair consists of a key and a value. For example, the domain of the keys can be 10-byte integers, and the domain of the values can be arbitrary strings. TeraSort aims to sort the input data according to their keys, e.g., sorting integers. Let us consider TeraSort for distributed sorting over K nodes, whose indices are denoted by a setK ={1,...,K}. The implementation consists of 5 components: File Placement, Key Domain Partitioning, Map Stage, Shuffle Stage, and Reduce Stage. In File Placement, the entire KV pairs are split into K disjoint files, and each file is placed on one of the K nodes. In Key Domain Partitioning, the domain of the keys is split into K partitions, and each node will be responsible for sorting the KV pairs whose keys fall into one of the partitions. In Map Stage, each node hashes each KV pair in its locally stored file into one of theK partitions, according to its key. In Shuffle Stage, the KV pairs in the same partition are delivered to the node that is responsible for sorting that partition. In Reduce Stage, each node locally sorts KV pairs belonging to its assigned partition. A simple example illustrating TeraSort is shown in Fig. 3.1. We next discuss each component in detail. 3.2.1 Algorithm Description 3.2.1.1 File Placement Let F denote the entire KV pairs to be sorted. They are split into K disjoint input files, denoted by F {1} ,...,F {K} . File F {k} is assigned to and locally stored at Node k. 58 Chapter 3. Coded TeraSort Hash Map 1 17 34 51 69 83 8 23 39 52 72 87 12 28 45 53 78 90 16 30 47 64 80 99 1,17 34 51,69 83 8,23 39 52,72 87 12 28,45 53 78,90 16 30,47 64 80,99 Sort [0,25) 1,17 8,23 12 16 34 39 28,45 30,47 51,69 52,72 53 64 83 87 78,90 80,99 1,8,12,16,17,23 28,30,34,39,45,47 51,52,53,64,69,72 78,80,83,87,90,99 Hash Hash Hash Shuffle Reduce Sort [25,50) Sort [50,75) Sort [75,100] Node 1 Node 2 Node 3 Node 4 Figure 3.1: Illustration of conventional TeraSort with K = 4 nodes and key domain partitions [0, 25), [25, 50), [50, 75), [75, 100]. A dotted box represents an input file. An input file is hashed into 4 groups of KV pairs, one for each partition. For each of the 4 partitions, the 4 groups of KV pairs belonging to that partition computed on 4 nodes are all fetched to a corresponding node, which sorts all KV pairs in that partition locally. 3.2.1.2 Key Domain Partitioning The key domain of the KV pair, denoted by P , is split into K ordered partitions, denoted by P 1 ,...,P K . Specifically, for any p∈ P i and any p 0 ∈ P i+1 , it holds that p < p 0 for all i∈{1,...,K− 1}. For example, when P = [0, 100] and K = 4, the partitions can be P 1 = [0, 25),P 2 = [25, 50),P 3 = [50, 75),P 4 = [75, 100]. Node k is responsible for sorting all KV pairs in the partition P k , for all k∈K. 3.2.1.3 Map Stage In this stage, each node hashes each KV pair in the locally stored file F {k} to the partition its key falls into. For each of the K key partitions, the hashing procedure on the file F {k} generates an intermediate value that contains the KV pairs in F {k} whose keys belong to that partition. More specifically, we denote the intermediate value of the partition P j from the fileF {k} as I j {k} , and the hashing procedure on the file F {k} is defined as n I 1 {k} ,...,I K {k} o ←Hash F {k} . 3.2.1.4 Shuffle Stage the intermediate value I k {j} calculated at Node j, j6=k, is unicast to Node k from Node j, for all k∈K. Since the intermediate value I k {k} is computed locally at Node k in the Map 59 Chapter 3. Coded TeraSort Table 3.1: Performance of TeraSort sorting 12GB data with K = 16 nodes and 100 Mbps network speed Map Pack Shuffle Unpack Reduce Total (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) 1.86 2.35 945.72 0.85 10.47 961.25 stage, by the end of the Shuffle stage, Nodek knows all intermediate values n I k {1} ,...,I k {K} o of the partition P k from all K files. 3.2.1.5 Reduce Stage In this stage, Nodek locally sorts all KV pairs whose keys fall into the partition P k , for all k∈K. Specifically, it sorts all intermediate values in the partition P k into a sorted list Q k as follows Q k ←Sort n I k {1} ,...,I k {K} o . Since the partitions are created in the ascending order as specified in the above Key Domain Partitioning step, the collection of the K sorted list generated in the Reduce stage, i.e., (Q 1 ,...,Q K ) represents the final sorted list of the entire input data. 3.2.2 Performance Evaluation To understand the performance of TeraSort, we performed an experiment on Amazon EC2 to sort 12GB of data by running TeraSort on 16 nodes. The breakdown of the total execution time is shown in Table 3.1. We observe from Table 3.1 that for a conventional TeraSort execution, 98.4% of the total execution time was spent in data shuffling, which is 508.5× of the time spent in the Map stage. Given the fact that data shuffling dominates the job execution time, the principle of optimally trading computation for communication of CDC can be applied to significantly improve the performance of TeraSort. Following the theoretical characterization of the total execution time achieved by CDC in (3.3), when executing the same sorting job using a coded version of TeraSort with a computation load of r ∗ = lq T shuffle Tmap m = 23 (i.e., each input file is repeatedly mapped on 23 servers), we could theoretically save the total execution time by approximately 10×. This great promise of using CDC to improve the performance of TeraSort motivates us to develop a novel coded distributed sorting algorithm, named 60 Chapter 3. Coded TeraSort CodedTeraSort, which integrates the coding technique of CDC into the above described TeraSort algorithm to reduce the total execution time. We describe CodedTeraSort in detail in the next section. 3.3 Coded TeraSort In this section, we describe the CodedTeraSort algorithm, which is developed by integrating the coding techniques of the Coded Distributed Computing scheme into the above described TeraSort algorithm. CodedTeraSort exploits redundant computations on the input files in the Map stage, enabling in-network coding opportunities to significantly slash the load of data shuffling. CodedTeraSort sorts a group of input KV pairs distributedly over K nodes, through the following 6 stages of operations: 1. Structured Redundant File Placement. The entire input KV pairs are split into many small files, each of which is repeatedly placed on multiple nodes according to a par- ticular pattern. 2. Map. Each node applies the hashing operation as in TeraSort on each of its assigned files. 3. Encoding to Create Coded Packets. Each node generates coded multicast packets from local results computed in Map stage. 4. Multicast Shuffling. Each node multicasts each of its generated coded packet to a specific set of other nodes. 5. Decoding. Each node locally decodes the required KV pairs from the received coded packets. 6. Reduce. Each node locally sorts the KV pairs within its assigned partition. Next, we describe the above 6 stages in detail. 61 Chapter 3. Coded TeraSort 3.3.1 Structured Redundant File Placement For some parameter r∈{1,...,K}, we first split the entire input KV pairs into N = K r input files. Unlike the file placement of TeraSort, CodedTeraSort places each of the N input files repetitively on r distinct nodes. We label an input file using a unique subsetS ofK with size|S| =r, i.e., the N input files are denoted by {F S :S⊆K,|S| =r}. (3.5) For example, when K = 4 and r = 2, the set of the input files is F {1,2} ,F {1,3} ,F {1,4} ,F {2,3} ,F {2,4} ,F {3,4} . We repetitively place an input file F S on each of the r nodes inS, and hence each node now stores Nr/K = K−1 r−1 files. As illustrated in a simple example in Fig. 3.2 for K = 4 and r = 2, the file F {2,3} is placed on Nodes 2 and 3. Node 2 has files F {1,2} ,F {2,3} ,F {2,4} . We note that this redundant file placement strategy induces a structured distribution of the input files such that every subset of r nodes have a unique file in common. As is done in the TeraSort, the key domain of the input KV pairs is split into K ordered partitions P 1 ,...,P K , and Node k is responsible for sorting all KV pairs in the partition P k in the Reduce stage, for all k∈K. Node 1 Node 2 Node 3 Node 4 1 28 51 78 8 30 52 80 12 34 53 83 17 45 69 90 23 47 72 99 16 39 64 87 Figure 3.2: An illustration of the structured redundant file placement in CodedTeraSort withK = 4 nodes andr = 2. 3.3.2 Map In this stage, each node repeatedly performs the Map stage operation of TeraSort described in Section 3.2.1.3, on each input file placed on that node. Specifically, for each file F S with 62 Chapter 3. Coded TeraSort k∈S that is placed on Node k, Node k hashes the KV pairs in F S to generate a set of K intermediate values n I 1 S ,...,I K S o . Only relevant intermediate values generated in the Map stage are kept locally for further processing. In particular, out of theK intermediate values n I 1 S ,...,I K S o generated from file F S , only I k S and I i S :i∈K\S are kept at Node k. This is because that the intermediate value I i S , required by Node i∈S\{k} in the Reduce stage, is already available at Node i after the Map stage, so Node k does not need to keep them and send them to the nodes in S\{k}. For example, as shown in Fig. 3.3, Node 1 does not keep the intermediate value I 2 {1,2} for Node 2. However, Node 1 keeps I 1 {1,2} ,I 3 {1,2} ,I 4 {1,2} , which are required by Nodes 1, 3, and 4 in the Reduce stage. Node 1 1 28 51 78 8 30 52 80 12 34 53 83 Hash Hash Hash Intermediate values at Node 1 Figure 3.3: An illustration of the Map stage at Node 1 in CodedTeraSort with K = 4, r = 2 and the key partitions [0, 25), [25, 50), [50, 75), [75, 100]. 3.3.3 Encoding to Create Coded Packets After the Map stage, each node has known locally a part of the KV pairs in the partition it is responsible for sorting, i.e.,{I k S :k∈S,|S| =r} for Node k. In the next stages of the computation, the server nodes need to communicate with each other to exchange the rest of the required intermediate values to perform local sorting in the Reduce stage. The role of the encoding process is to exploit the structured data redundancy created by the particular repetitive file placement described above, in order to create coded multicast packets that are simultaneously useful for multiple nodes, thus saving the load of commu- nicating intermediate values. For example in Fig. 3.3, Node 1 wants to send I 2 {1,3} = [30] to Node 2 and I 3 {1,2} = [51] to Node 3. Since I 3 {1,2} and I 2 {1,3} are already known at Node 2 and Node 3 respectively after the Map stage, instead of unicasting these two intermediate values individually, Node 1 can rather multicast a coded packet generated by XORing these two values, i.e., [30⊕ 51]. Then Node 2 and 3 can decode their required intermediate values using locally known intermediate values, e.g., Node 2 uses I 3 {1,2} = [51] to decode I 2 {1,3} 63 Chapter 3. Coded TeraSort by computing I 2 {1,3} = [30⊕ 51]⊕ [51] = [30]. By multicasting a coded packet instead of unicasting two uncoded ones, we save the load of communication by 50%. More generally, in the encoding stage, every node creates coded packets that are simulta- neously useful forr other nodes. Specifically, in every subsetM⊆K of|M| =r + 1 nodes, the encoding operation proceeds as follows. • For eacht∈M, the intermediate valueI t M\{t} , which is know at all nodes inM\{t}, is evenly and arbitrarily split into r segments, i.e., I t M\{t} ={I t M\{t},k :k∈M\{t}}, (3.6) where I t M\{t},k denotes the segment corresponding to Node k. • For each k∈M, we generate the coded packet of Node k inM, denoted by E M,k , by XORing all segments corresponding to Node k inM, 1 i.e., E M,k = ⊕ t∈M\{k} I t M\{t},k . (3.7) By the end of the Encoding stage, for each k∈K, Node k has generated K−1 r coded packets, i.e.,{E M,k :k∈M,|M| =r + 1}. Server 1 Server 2 Server 3 Known Encode Multicast Shuffle Decode Figure 3.4: An illustration of the encoding process within a multicast groupM ={1, 2, 3}. 1 All segments are zero-padded to the length of the longest one. 64 Chapter 3. Coded TeraSort In Fig. 3.4, we consider a scenario with r = 2, and illustrate the encoding process in the subsetM ={1, 2, 3}. Exploiting the particular structure imposed in the stage of file placement, each node creates a coded packet that contains data segments useful for the other 2 nodes. We summarize the the pseudocode of the Encoding stage at Node k in Algorithm 1. // At Node k // // Data Segmentation for eachM⊆K with|M| =r + 1 and k∈M do for each t∈M\{k} do Consider file indexF←M\{t} Evenly split I t F to r segments{I t F,j :j∈F} end for end for // Encode for eachM⊆K with|M| =r + 1 and k∈M do Initialize coded packet E M,k ←? for each t∈M\{k} do Consider file indexF←M\{t} E M,k ←E M,k ⊕I t F,k Store n I t F,j :j∈F\{k} o end for Store E M,k end for Algorithm 1: Encoding to Create Coded Packets 3.3.4 Multicast Shuffling After all coded packets are created at the K nodes, the Multicast Shuffling process takes place within each subset ofr+1 nodes. Specifically, within each groupM⊆K of|M| =r+1 nodes, each Node k∈M multicasts its coded packet E M,k to the other nodes inM\{k}. As we have seen in the encoding process, each coded packet is simultaneously useful for r other nodes. Therefore, compared with an uncoded shuffling scheme that solely uses unicast communications, the multicast shuffling employed by CodedTeraSort reduces the communi- cation load by exactlyr×. This gain is even higher compared with the TeraSort algorithm, for which no computation is repeated in the Map stage. This is because that even without multicasting, the redundant computations performed in the Map stage of CodedTeraSort 65 Chapter 3. Coded TeraSort already accumulate more locally available data needed for reduction, requiring less data to be shuffled across the network. 3.3.5 Decoding During the stage of Multicast Shuffling, within each multicast groupM⊆K of|M| =r +1 nodes, each Node k∈M receives a coded packet E M,u from Node u, for all u∈M\{k}. By the encoding process in (3.7), we have E M,u = ⊕ t∈M\{u} I t M\{t},u . (3.8) It is apparent that for all t∈M\{u,k}, we have k∈M\{t}, and Node k knows locally the intermediate values I t M\{t} , for all t∈M\{u,k}, from the Map stage. Therefore, it knows locally all the data segments{I t M\{t},u :t∈M\{u,k}}. Then Node k performs the decoding process by XORing these data segments with E M,u , i.e., E M,u ⊕ ⊕ t∈M\{u,k} I t M\{t},u ! =I k M\{k},u , (3.9) which recovers the data segment I k M\{k},u . Similarly, Node k recovers all data segments{I k M\{k},u : u∈M\{k}} from the received coded packets inM, and merge them back to obtain a required intermediate value I k M\{k} . Finally, we repeat the above decoding process for all subsets of size r + 1 that contain k, and Node k decodes the intermediate values{I k M\{k} :k∈M,|M| =r + 1}, which can be equivalently represented by n I k S :k / ∈S,|S| =r o . In Fig. 3.5, we consider a scenario with r = 2, and illustrate the above described decoding process in the subsetM ={1, 2, 3}. In this example, each node receives a multicast coded packet from each of the other two nodes. Each node decodes 2 data segments from the received coded packets, and merge them to recover a required intermediate value. We summarize the the pseudocode of the Decoding stage at Node k in Algorithm 2. 66 Chapter 3. Coded TeraSort Node 1 Node 2 Node 3 Node 4 Multicast Group Figure 3.5: An illustration of the decoding process within a multicast groupM ={1, 2, 3}. // At Node k // for eachM⊆K with|M| =r + 1 and k∈M do Consider file indexF←M\{k} for each u∈F do Initialize decoded segment D k F,u ←E M,u for each Node t∈F\{u} do Consider file indexW←M\{t} D k F,u ←D k F,u ⊕I t W,u end for end for I k F ← Merge n D k F,u :u∈F o Store I k F end for Algorithm 2: Decoding 3.3.6 Reduce After the Decoding stage, we note that Node k has obtained all KV pairs in the partition P k , for all k∈K. In particular, the KV pairs n I k S :k∈S,|S| =r o are obtained locally in the Map stage, and the KV pairs n I k S :k / ∈S,|S| =r o are obtained in the above Decoding stage. In this final stage, Node k, k = 1,...,K, performs the Reduce process as described in Section 3.2.1.5 for the TeraSort algorithm, sorting the KV pairs in partition P k locally. 67 Chapter 3. Coded TeraSort 3.4 Evaluation We imperially demonstrate the performance gain of CodedTeraSort through experiments on Amazon EC2 clusters. In this section, we first present the choices we have made for the implementation. Then, we describe experiment setup. Finally, we discuss the experiment results. 3.4.1 Implementation Choices We first describe the following implementation choices that we have made for both TeraSort and CodedTeraSort algorithms. Data Format: All input KV pairs are generated from TeraGen [26] in the standard Hadoop package. Each input KV pair consists of a 10-byte key and a 90-byte value. A key is a 10-byte unsigned integer, and the value is an arbitrary string of 90 bytes. The KV pairs are sorted based on their keys, using the standard integer ordering. Platform and Library: We choose Amazon EC2 as the evaluation platform. We implement both TeraSort and CodedTeraSort algorithms in C++, and use Open MPI library [69] for communications among EC2 instances. Coordinator Worker 1 Worker 2 Worker K ... Amazon EC2 Worker 3 Figure 3.6: The coordinator-worker system architecture. System Architecture: As shown in Fig. 3.6, we employ a system architecture that consists of a coordinator node and K worker nodes, for some K∈N. Each node is run as an EC2 instance. The coordinator node is responsible for creating the key partitions and placing the input files on the local disks of the worker nodes. The worker nodes are responsible for distributedly executing the stages of the sorting algorithms. In-Memory Processing: After the KV pairs are loaded from the local files into the workers’ memories, all intermediate data that are used for encoding, decoding and local sorting are 68 Chapter 3. Coded TeraSort persisted in the memories, and hence there is no disk I/O involved during the executions of the algorithms. Node 1 Node 2 Node 3 Node 4 time Node 1 Node 2 Node 3 Node 4 time (a) serial unicast (b) serial multicast Figure 3.7: (a) Serial unicast in the Shuffle stage of TeraSort; a solid arrow represents a unicast. (b) Serial multicast in the Multicast Shuffle stage of CodedTeraSort; a group of solid arrows starting at the same node represents a multicast. In the TeraSort implementation, each node sequentially steps through Map, Pack, Shuffle, Unpack, and Reduce stages. The Map, Shuffle, and Reduce stages follow the descriptions in Section 3.2. In the Reduce stage, the standard sort std::sort is used to sort each partition locally. To better interpret the experiment results, we add the Pack and the Unpack stages to separate the time of serialization and deserialization from the other stages. The Pack stage serializes each intermediate value to a continuous memory array to ensure that a single TCP flow is created for each intermediate value (which may contain multiple KV pairs) when MPI Send is called 2 . The Unpack stage deserializes the received data to a list of KV pairs. In the Shuffle stage, intermediate values are unicast serially, meaning that there is only one sender node and one receiver node at any time instance. Specifically, as illustrated in Fig. 3.7(a), Node 1 starts to unicast to Nodes 2, 3, and 4 back-to-back. After Node 1 finishes, Node 2 unicasts back-to-back to Nodes 1, 3, and 4. This continues until Node 4 finishes. In the CodedTeraSort implementation, each node sequentially steps through CodeGen, Map, Encode, Multicast Shuffling, Decode, and Reduce stages. The Map, Encode, Multicast Shuffling, Decode, and Reduce stages follow the descriptions in Section 3.3. In the CodeGen (or code generation) stage, firstly, each node generates all file indices, as subsets of r nodes. Then each node uses MPI Comm split to initialize K r+1 multicast groups each containing r + 1 nodes on Open MPI, such that multicast communications will be performed within each of these groups. The serialization and deserialization are implemented respectively in 2 Creating a TCP flow per KV pair leads to inefficiency from overhead and convergence issue. 69 Chapter 3. Coded TeraSort Table 3.2: Sorting 12 GB data with K = 16 nodes and 100 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 1.86 2.35 945.72 0.85 10.47 961.25 CodedTeraSort: r = 3 6.06 6.03 5.79 412.22 2.41 13.05 445.56 2.16× CodedTeraSort: r = 5 23.47 10.84 8.10 222.83 3.69 14.40 283.33 3.39× the Encode and the Decode stages. In Multicast Shuffling, MPI Bcast is called to multicast a coded packet in a serial manner, so only one node multicasts one of its encoded packets at any time instance. Specifically, as illustrated in Fig. 3.7(b), Node 1 multicasts to the other 2 nodes in each multicast group Node 1 is in. For example, Node 1 first multicasts to Node 2 and 3 in the multicast group{1, 2, 3}. After Node 1 finishes, Node 2 starts multicasting in the same manner. This process continues until Node 4 finishes. 3.4.2 Experiment Setup We conduct experiments using the following configurations to evaluate the performance of CodedTeraSort and TeraSort on Amazon EC2: • The coordinator runs on a r3.large instance with 2 processors, 15 GB memory, and 32 GB SSD. • Each worker node runs on an m3.large instance with 2 processors, 7.5 GB memory, and 32 GB SSD. • The incoming and outgoing traffic rates of each instance are limited to 100 Mbps. 3 • 12 GB of input data (equivalently 120 M KV pairs) is sorted. We evaluate the run-time performance of TeraSort and CodedTeraSort, for different com- binations of the number of workers K and the parameter r. All experiments are repeated 5 times, and the average values are reported. 3.4.3 Experiment Results The breakdowns of the execution times with K = 16 workers and K = 20 workers are shown in Tables 3.2 and 3.3 respectively. We observe an overall 1.97×-3.39× speedup of 3 This is to alleviate the effects of the bursty behaviors of the transmission rates in the beginning of some TCP sessions. The rates are limited by traffic control command tc [70]. 70 Chapter 3. Coded TeraSort Table 3.3: Sorting 12 GB data with K = 20 nodes and 100 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 1.47 2.00 960.07 0.62 8.29 972.45 CodedTeraSort: r = 3 19.32 4.68 4.89 453.37 1.87 9.73 493.86 1.97× CodedTeraSort: r = 5 140.91 8.59 7.51 269.42 3.70 10.97 441.10 2.20× CodedTeraSort as compared with TeraSort. From the experiment results we make the following observations: • The total execution time of CodedTeraSort improves over TeraSort whose communi- cation time in the Shuffle stage dominates the computation times of the other stages. • For CodedTeraSort, the time spent in the CodeGen stage is proportional to K r+1 , which is the number of multicast groups. • The Map time of CodedTeraSort is approximately r times higher than that of TeraSort. This is because that each node hashesr times more KV pairs than that in TeraSort. Specifically, the ratios of the CodedTeraSort’s Map time to theTeraSort’s Map time from Table 3.2 are 6.03/1.86≈ 3.2 and 10.84/1.86≈ 5.8, and from Table 3.3 are 4.68/1.47≈ 3.2 and 8.59/1.47≈ 5.8. • While CodedTeraSort theoretically promises a factor of more than r× reduction in shuffling time, the actual gains observed in the experiments are slightly less than r. For example, for an experiment with K = 16 nodes and r = 3, as shown in Table 3.2, the speedup of the Shuffle stage is 945.72/412.22≈ 2.3 < 3. This phenomenon is caused by the following two factors. 1) Open MPI’s multicast API (MPI Bcast) has an inherent overhead per a multicast group, for instance, a multicast tree is constructed before multicasting to a set of nodes. 2) Using the MPI Bcast API, the time of multicasting a packet to r nodes is higher than that of unicasting the same packet to a single node. In fact, as measured in [2], the multicasting time increases logarithmically with r. • The sorting times in the Reduce stage of both algorithms depend on the available memories of the nodes. CodedTeraSort inherently has a higher memory overhead, e.g., it requires persisting more intermediate values in the memories than TeraSort for coding purposes, hence its local sorting process takes slightly longer. This can be observed from the Reduce column in Tables 3.2 and 3.3. 71 Chapter 3. Coded TeraSort Further, we observe the following trends from both tables: The impact of redundancy parameter r: As r increases, the shuffling time reduces substan- tially by approximately r times. However, the Map execution time increases linearly with r, and more importantly the CodeGen time increases as K r+1 . Hence, for small values of r (r< 6) we observe overall reduction in execution time, and the speedup increases. How- ever, as we further increaser, the CodeGen time will dominate the execution time, and the speedup decreases. Hence, in our evaluations, we have limited r to be at most 5. 4 The impact of worker numberK: AsK increases, the speedup decreases. This is due to the following two reasons. 1) The number of multicast groups, i.e., K r+1 , grows exponentially withK, resulting in a longer execution time of the CodeGen process. 2) When more nodes participate in the computation, for a fixed r, less amount of KV pairs are hashed at each node locally in the Map stage, resulting in less locally available intermediate values and a higher communication load. In addition to the results in Tables 3.2 and 3.3, more experiments have been performed and their results are presented in Appendix B, in which we observe up to 4.1× speedup. 3.5 Conclusion and Future Directions In this chapter, we integrate the principle of Coded Distributed Computing into the TeraSort algorithm, developing a novel distributed sorting algorithm CodedTeraSort. CodedTeraSort specifies a structured redundant placement of the input files that are to be sorted, such that the same file is repetitively processed at multiple nodes. The results of this redundant processing enable in-network coding opportunities that substantially reduce the load of data shuffling. We also empirically demonstrate the significant performance gain of CodedTeraSort over TeraSort, whose execution is limited by data shuffling. Finally, we end this chapter by highlighting three future research directions. • Beyond Sorting Algorithms. Having successfully demonstrated the impact of coding in improving the performance of TeraSort, we can apply the coding concept to develop 4 The redundancy parameter r is also limited by the total storage available at the nodes. Since for a choice of redundancy parameter r, each piece of input KV pairs should be stored at r nodes, we can not increase r beyond total available storage at the worker nodes input size . 72 Chapter 3. Coded TeraSort coded versions of many other distributed computing applications whose performance is limited by data shuffling (e.g., Grep, SelfJoin). In particular, mobile distributed computing applications like mobile augmented reality and recommender systems are of special interest since the communications through wireless links are much slower. • Scalable Coding. We observe from the experiment results that the coding complexity (i.e., the time spent at CodeGen stage) increases as K r+1 . Hence, as the redundancy parameterr gets large the coding overhead (including the time spent in generating the coding plan, encoding, and decoding) becomes comparable with or even longer than the time spent in Map and Reduce stages. It is of great interest to design efficient and scalable coding procedures to maintain a low coding overhead. • Asynchronous Execution. In the experiments, we executed the stages of the compu- tation one after another in a synchronous manner. Also, the data shuffling was per- formed serially such that only one node is communicating (unicasting for TeraSort and multicasting for CodedTeraSort) at a time. It is interesting to explore the impact of coding in an asynchronous setting with parallel communications. 73 Chapter 4 A Scalable Framework for Wireless Distributed Computing Recent years have witnessed a rapid growth of computationally intensive applications on mobile devices, such as mapping services, voice/image recognition, and augmented reality. The current trend for developing these applications is to offload computationally heavy tasks to a “cloud”, which has greater computational resources. While this trend has its merits, there is also a critical need for enabling wireless distributed computing, in which computation is carried out using the computational resources of a cluster of wireless devices collaboratively. Wireless distributed computing eliminates, or at least de-emphasizes, the need for a core computing environment (i.e., the cloud), which is critical in several im- portant applications, such as autonomous control and navigation for vehicles and drones, in which access to the cloud can be very limited. Also as a special case of the emerging “Fog computing architecture” [28], it is expected to provide significant advantages to users by improving the response latency, increasing their computing capabilities, and enabling complex applications in machine learning, data analytics, and autonomous operation (see e.g., [73–75]). The major challenge in developing a scalable framework for wireless distributed computing is the significant communication load, required to exchange the intermediate results among the mobile nodes. In fact, even when the processing nodes are connected via high-bandwidth This chapter is mainly taken from [71, 72], coauthored by the author of this document. 74 Chapter 4. A Scalable Framework for Wireless Distributed Computing inter-server communication bus links (e.g., a Facebook’s Hadoop cluster), it is observed in [11] that 33% of the job execution time is spent on data shuffling. The communication bottleneck is expected to get much more severe as we move to a wireless medium where the communication resources are much more scarce. More generally, as the network size increases, while the computation resources grow linearly with network size, the overall communication bandwidth is fixed and can become the bottleneck. This raises the following fundamental question. Is there a scalable framework for wireless distributed computing, in which the required communication load is fixed and independent of the number of users? Our main contribution is to provide an affirmative answer to this question by developing a framework for wireless distributed computing that utilizes a particular repetitive structure of computation assignments at the users, in order to provide coding opportunities that reduce the required communication by a multiplicative factor that grows linearly with the number of users; hence resulting in a scalable design. The developed framework can be considered as an extension of our previously proposed coded distributed computing framework for a wireline setting in [1, 42, 43], into the wireless distributed computing domain. To develop such a framework, we exploit three opportunities in conjunction: 1. Side-Information: When a sub-task has been processed in more than one node, the resulting intermediate outcomes will be available in all those nodes as side-information. This provides some opportunities for coding across the results and creates packets that useful for multiple nodes. 2. Coding: In communication, coding is used to protect data against channel failure. In contrary, here we use coding to develop packets useful to more than one processing nodes. This allows us to exploit the multicasting environment of the wireless medium and save communication overhead. 3. Multicasting: Wireless medium by nature is a multicasting environment. It means that when a signal is transmitted, it can be heard by all the nodes. This phenomenon is often considered restrictive, as it creates interference. However, here, we exploit it to our benefit by creating and transmitting signals that help several nodes simultane- ously. 75 Chapter 4. A Scalable Framework for Wireless Distributed Computing 1 2 3 4 Map 1 2 3 4 1 2 3 4 1 2 3 4 5 6 1 2 1 2 5 6 1 2 5 6 1 2 5 6 3 4 5 6 3 4 5 6 3 4 5 6 3 4 5 6 1 3 4 5 6 2 Needs 5 6 Map Needs 1 2 Map Needs 3 4 (a) Uplink. 1 2 3 4 1 2 3 4 1 2 3 4 1 2 5 6 1 2 5 6 1 2 5 6 3 4 5 6 3 4 5 6 3 4 5 6 5 6 1 2 3 4 C 2 ( , , ) 1 3 4 5 6 2 C 1 ( , , ) 1 3 4 5 6 2 (b) Downlink. Figure 4.1: Illustration of the CWDC scheme for an example of 3 mobile users. Motivating Example Let’s first illustrate our scalable design of wireless distributed computing through an ex- ample. Consider a scenario where 3 mobile users want to run an application (e.g., image recognition). Each user has an input (e.g., an image) to process using a dataset (e.g., a feature repository of objects) provided by the application. However, the local memory of an individual user is too small to store the entire dataset, and they have to collaboratively perform the computation. The entire dataset consists of 6 equally-sized files, and each user can store at most 4 of them. The computation is performed distributedly following a MapReduce-like structure. More specifically, every user computes a Map function, for each of the 3 inputs and each of the 4 files stored locally, generating 12 intermediate values. Then the users communicate with each other via an access point they all wirelessly connect to, which we call data shuffling. After the data shuffling, each user knows the intermediate values of his own input in all 6 files, and passes them to a Reduce function to calculate the final output result. During data shuffling, since each user already has 4 out of 6 intended intermediate values locally, she would need the remaining 2 from the other users. Thus, one would expect a communication load of 6 (in number of intermediate values) on the uplink from users to the access point and 6 on the downlink in which the access point simply forwards the intermediate values to the intended users. However, we can take advantage of the oppor- tunities mentioned above to significantly reduce the communication loads. As illustrated in Fig. 4.1, through careful placement of the dataset into users’ memories, we can design a coded communication scheme in which every user sends a bit-wise XOR, denoted by⊕, 76 Chapter 4. A Scalable Framework for Wireless Distributed Computing of 2 intermediate values on the uplink, and then the access point, without decoding any individual value, simply generates 2 random linear combinations C 1 (·,·,·) and C 2 (·,·,·) of the received messages and broadcasts them to the users, simultaneously satisfying all data requests. Using this coded approach, we achieve an uplink communication load of 3 and a downlink communication load of 2. We generalize the above example by designing a coded wireless distributed computing (CWDC) framework that applies to arbitrary type of applications, network size and storage size. In particular, we propose a specific dataset placement strategy, and a joint uplink- downlink communication scheme exploiting coding at both the mobile users and the access point. For a distributed computing application with K users each can store a μ fraction of the dataset, the proposed CWDC scheme achieves the (normalized) communication loads L uplink ≈L downlink ≈ 1 μ − 1. (4.1) We note that the proposed scheme is scalable since the achieved communication loads are independent ofK. As we show in Fig. 4.2, compared with a conventional uncoded shuffling scheme with a communication load μK· ( 1 μ − 1) that explodes as the network expands, the proposed CWDC scheme reduces the load by a multiplicative factor of Θ(K), which scales linearly with the network size. We also extend our scalable design to a decentralized setting, in which the dataset place- ment is done at each user independently without knowing other collaborating users. For such a common scenario in mobile applications, we propose a decentralized scheme with a communication load close to that achieved in the centralized setting, particularly when the number of participating users is large. Finally, we demonstrate that for both the centralized setting and the decentralized setting with large number of users, the proposed CWDC schemes achieve the minimum possible communication loads that cannot be improved by any other scheme, by developing tight lower bounds. 77 Chapter 4. A Scalable Framework for Wireless Distributed Computing 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20 Storage Size (μ) Uplink Communication Load Uncoded CWDC (Optimal) 10 × 0.05 (a) Uplink. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20 Storage Size (μ) Downlink Communication Load Uncoded CWDC (Optimal) 11 × 0.05 (b) Downlink. Figure 4.2: Comparison of the communication loads achieved by the uncoded scheme with those achieved by the proposed CWDC scheme, for a network of K = 20 users. Here the storage size μ≥ 1 K = 0.05 such that the entire dataset can be stored across the users. Prior Works The idea of applying coding in shuffling the intermediate results of MapReduce-like dis- tributed computing frameworks was recently proposed in [1, 42, 43], for a wireline scenario where the computing nodes can directly communicate with each other through a shared link. In this chapter, we consider a wireless distributed computing environment, in which wireless computing nodes exchange the intermediate results through an access point. More specifically, we extend the coded distributed computing framework in [1, 42, 43] in the following aspects. • We extend the wireline setting in [1, 43] to a wireless setting, and develop the first scalable framework with constant communication loads for wireless distributed com- puting. • During data shuffling, other than designing codes at the users for uplink communica- tion, we also design novel optimum code at the access point for downlink communi- cation. • We consider a decentralized setting that is of vital importance for wireless distributed computing, where each user has to decide his local storage content independently. We develop optimal decentralized dataset placement strategy and uplink-downlink communication scheme to achieve the minimum communication loads asymptotically. 78 Chapter 4. A Scalable Framework for Wireless Distributed Computing The idea of efficiently creating and exploiting coded multicasting opportunities was initially proposed in the context of cache networks in [50, 51], and extended in [52, 53], where caches pre-fetch part of the content in a way to enable coding during the content delivery, mini- mizing the network traffic. In this paper, we demonstrate that such coding opportunities can also be utilized to significantly reduce the communication load of wireless distributed computing applications. However, the proposed coded framework for wireless distributed computing differs significantly from the coded caching problems, mainly in the follow as- pects. • In [50, 51], a central server has the entire dataset and broadcasts coded messages to satisfy users’ demands. In this work, the access point neither stores any part of the dataset, nor performs any computation. We designed new codes at both the users and the access point for data shuffling. • The cache contents are placed without knowing the users’ demands in the coded caching problems, while here the dataset placement is performed knowing that each user has her own unique computation request (input). • Our scheme is faced with the challenge of symmetric computation enforced by the MapReduce-type structure, i.e., a Map function computes intermediate values for all inputs. Such symmetry is not enforced in coded caching problems. There have also been several recent works on communication design and resource allocation for mobile-edge computation offloading (see e.g., [76, 77]), in which a part of the computation is offloaded to clouds located at the edges of cellular networks. In contrast to these works, in this chapter our focus is on the scenario that the “edge” only facilitates the communication required for distributed computing, and all computations are done distributedly at the users. 4.1 System Model We consider a system that has K mobile users, for some K∈N. As illustrated in Fig. 4.3, all users are connected wirelessly to an access point (e.g., a cellular base station or a Wi-Fi router). The uplink channels of theK users towards the access point are orthogonal to each other, and the signals transmitted by the access point on the downlink are received by all the users. 79 Chapter 4. A Scalable Framework for Wireless Distributed Computing Uplink Downlink files Dataset Input Output Input Output Input Output Access Point User 1 User 2 User K files files Figure 4.3: A wireless distributed computing system. The system has a dataset (e.g., a feature repository of objects in a image recognition ap- plication) that is evenly partitioned into N files w 1 ,...,w N ∈ F 2 F , for some N,F ∈ N. Every User k has a length-D input d k ∈ F 2 D (e.g., user’s image in the image recognition application) to process using the N files. To do that, as shown in Fig. 4.3, User k needs to compute φ( d k |{z} input ;w 1 ,...,w N | {z } dataset ), (4.2) where φ :F 2 D× (F 2 F ) N →F 2 B is an output function that maps the input d k to an output result (e.g., the returned result after processing the image) of length B∈N. We assume that every mobile user has a local memory that can store up to μ fractions of the dataset (i.e., μN files), for some constant parameter μ that does not scale with the number of users K. Throughout the paper, we consider the case 1 K ≤ μ < 1, such that each user does not have enough storage for the entire dataset, but the entire dataset can be stored collectively across all the users. We denote the set of indices of the files stored by User k asU k . The selections ofU k s are design parameters, and we denote the design ofU 1 ,...,U K as dataset placement. The dataset placement is performed in prior to the computation (e.g., users download parts of the feature repository when installing the image recognition application). Remark 4.1. The employed physical-layer network model is rather simple and one can do better using a more detailed model and more advanced techniques. However we note that any wireless medium can be converted to our simple model using (1) TDMA on uplink; and (2) broadcast at the rate of weakest user on downlink. Since the goal of the paper is to introduce a “coded” framework for scalable wireless distributed computing, we decide to 80 Chapter 4. A Scalable Framework for Wireless Distributed Computing abstract out the physical layer and focus on the amount of data needed to be communicated. Distributed Computing Model. Motivated by prevalent distributed computing struc- tures like MapReduce [4] and Spark [5], we assume that the computation for input d k can be decomposed as φ(d k ;w 1 ,...,w N ) =h(g 1 (d k ;w 1 ),...,g N (d k ;w N )), (4.3) where as illustrated in Fig. 4.4, • The “Map” functions g n (d k ;w n ) :F 2 D×F 2 F→F 2 T , n∈{1,...,N}, k∈{1,...,K}, maps the input d k and the file w n into an intermediate value v k,n =g n (d k ;w n )∈F 2 T , for some T∈N, • The “Reduce” function h : (F 2 T ) N →F 2 B maps the intermediate values for input d k in all files into the output value φ(d k ;w 1 ,...,w N ) = h(v k,1 ,...,v k,N ), for all k∈ {1,...,K}. Remark 4.2. Note that for every set of output functions such a Map-Reduce decomposition exists (e.g., setting g n 0 s to identity and h to φ(d k ;∗)). However, such a decomposition is not unique, and in the distributed computing literature, there has been quite some work on developing appropriate decompositions of computations like join, sorting and matrix multiplication (see e.g., [4, 15]), which are suitable for efficient distributed computing. Here we do not impose any constraint on how the Map and Reduce functions are chosen (for example, they can be arbitrary linear or non-linear functions). Map Functions Reduce Functions Figure 4.4: A two-stage distributed computing framework decomposed into Map and Reduce functions. 81 Chapter 4. A Scalable Framework for Wireless Distributed Computing We focus on the applications in which the size of the users’ inputs is much smaller than the size of the computed intermediate values, i.e., D T . As a result, the overhead of disseminating the inputs is negligible, and we assume that the users’ inputs d 1 ,...,d K are known at each user before the computation starts. Remark 4.3. The above assumption holds for various wireless distributed computing appli- cations. For example, in a mobile navigation application, an input is simply the addresses of the two end locations. The computed intermediate results contain all possible routes be- tween the two end locations, from which the shortest one (or the fastest one considering the traffic condition) is computed for the user. Similarly, for a set of “filetring” applications like the aforementioned image recognition (or similarly augmented reality) and recommen- dation systems, the inputs are light-weight queries (e.g., the feature vector of an image) that are much smaller than the filtered intermediate results containing all attributes of related information. For example, an input can be multiple words describing the type of restaurant a user is interested in, and the intermediate results returned by a recommendation system application can be a list of relevant information that include customers’ comments, pictures, and videos of the recommended restaurants. Following the decomposition in (4.3), the overall computation proceeds in three phases: Map, Shuffle, and Reduce. Map Phase: User k, k∈{1,...,K} computes the Map functions of d 1 ,...,d K based on the files inU k . For each inputd k and each filew n inU k , Userk computesg n (d k ,w n ) =v k,n . Shuffle Phase: In order to compute the output value for the input d k , User k needs the intermediate values that are not computed locally in the Map phase, i.e.,{v k,n : n / ∈ U k }. Users exchange the needed intermediate values via the access point they all wirelessly connect to. As a result, the Shuffle phase breaks into two sub-phases: uplink communication and downlink communication. On the uplink, user k creates a message W k as a function of the intermediate values com- puted locally, i.e., W k =ψ k ({v k,n :k∈{1,...,K},n∈U k }), (4.4) and communicates W k to the access point. 82 Chapter 4. A Scalable Framework for Wireless Distributed Computing Definition 4.1 (Uplink Communication Load). We define the uplink communication load, denoted by L u , as the total number of bits in all uplink messages W 1 ,...,W K , normalized by the number of bits in the N intermediate values required by a user (i.e., NT ). We assume that the access point does not have access to the dataset. Upon decoding all the uplink messagesW 1 ,...,W K , the access point generates a messageX from the decoded uplink messages, i.e., X =ρ(W 1 ,...,W K ), (4.5) and then broadcasts X to all users on the downlink. Definition 4.2 (Downlink Communication Load). We define the downlink communication load, denoted by L d , as the number of bits in the downlink message X, normalized by NT . Reduce Phase: User k, k∈{1,...,K} uses the locally computed results{~ g n : n∈U k } and the decoded downlink message X to construct the inputs to the corresponding Reduce function, and calculates the output value φ(d k ;w 1 ,...,w N ) =h(v k,1 ,...,v k,N ). Example (Uncoded Scheme). As a benchmark, we consider an uncoded scheme, where each user receives the needed intermediate values sent uncodedly by some other users and forwarded by the access point, achieving the communication loads L uncoded u (μ) =L uncoded d (μ) =μK· ( 1 μ − 1). (4.6) We note that the above communication loads of the uncoded scheme grow with the number of users K, overwhelming the limited spectral resources. In this chapter, we argue that by utilizing coding at the users and the access point, we can accommodate any number of users with a constant communication load. Particularly, we propose in the next section a scalable coded wireless distributed computing (CWDC) scheme that achieves minimum possible uplink and downlink communication load simultaneously, i.e., L coded u =L optimum u ≈ 1 μ − 1, (4.7) L coded d =L optimum d ≈ 1 μ − 1. (4.8) 83 Chapter 4. A Scalable Framework for Wireless Distributed Computing 4.2 The Proposed CWDC Scheme In this section, we present the proposed CWDC scheme for a centralized setting, in which the dataset placement is designed in a centralized manner knowing the number and the identities of the users that will participate in the computation. We first consider the storage sizeμ∈{ 1 K , 2 K ,..., 1} such thatμK∈N. We assume thatN is sufficiently large such that N = K μK η for some η∈N. 1 Dataset Placement and Map Phase Execution. We evenly partition the indices of the N files into K μK disjoint batches, each containing the indices of η files. We denote a batch of file indices asB T , which is labelled by a unique subsetT ⊂{1,...,K} of size |T| =μK. As such defined, we have {1,...,N}={i :i∈B T ,T ⊂{1,...,K},|T| =μK}. (4.9) Userk,k∈{1,...,K}, stores locally all the files whose indices are inB T ifk∈T . That is, U k = ∪ T :|T|=μK,k∈T B T . (4.10) As a result, each of theN files is stored byμK distinct users. After the Map phase, Userk, k∈{1,...,K}, knows the intermediate values of all K output functions in each file whose index is inU k , i.e.,{v q,n :q∈{1,...,K},n∈U k }. In the above motivating example, 6 files are partitioned into 3 2 = 3 batches, each containing 2 files. Each batch corresponds to 2 users and each user stores 2 corresponding batches. Uplink Communication. For any subsetW⊂{1,...,K}, and any k / ∈W, we denote the set of intermediate values needed by User k and known exclusively by users inW as V k W . More formally: V k W ,{v k,n :n∈ ∩ i∈W U i ,n / ∈ ∪ i/ ∈W U i }. (4.11) In the above motivating example, we haveV 1 {2,3} ={v 1,5 ,v 1,6 },V 2 {1,3} ={v 2,1 ,v 2,2 } and V 3 {1,2} ={v 3,3 ,v 3,4 }. For all subsetsS⊆{1,...,K} of size μK + 1: 1 For small number of files N < K μK , we can apply the coded wireless distributed computing scheme to a smaller subset of users, achieving a part of the gain in reducing the communication load. 84 Chapter 4. A Scalable Framework for Wireless Distributed Computing 1. For each User k∈S,V k S\{k} is the set of intermediate values that are requested by User k and are in the files whose indices are in the batchB S\{k} , and they are exclu- sively known at all users whose indices are inS\{k}. We evenly and arbitrarily split V k S\{k} , into μK disjoint segments{V k S\{k},i :i∈S\{k}}, whereV k S\{k},i denotes the segment associated with User i inS\{k} for Userk. That is,V k S\{k} = ∪ i∈S\{k} V k S\{k},i . 2. User i, i∈S, sends the bit-wise XOR, denoted by⊕, of all the segments associated with it inS, i.e., User i sends the coded segment W S i , ⊕ k∈S\{i} V k S\{k},i . Since the coded messageW S i contains η μK T 2 bits for alli∈S, there are a total of (μK+1)η μK T bits communicated on the uplink in every subsetS of size μK + 1. Therefore, the uplink communication load achieved by this coded scheme is L coded u (μ) = ( K μK+1 )(μK+1)·η·T μK·NT = 1 μ − 1, μ∈{ 1 K , 2 K ,..., 1}. (4.12) Downlink Communication. For all subsetsS⊆{1,...,K} of size μK + 1, the access point computesμK random linear combinations of the uplink messages generated based on the subsetS: C S j ({W S i :i∈S}), j = 1,...,μK, (4.13) and multicasts them to all users inS. Since each linear combination contains η μK T bits, the coded scheme achieves a downlink communication load L coded d (μ)= ( K μK+1 )η·T NT = μK μK+1 ·( 1 μ − 1), μ∈{ 1 K , 2 K ,..., 1}. (4.14) After receiving the random linear combinations C S 1 ,...,C S μK , User i, i∈S, cancels all segments she knows locally, i.e., ∪ k∈S\{i} {V k S\{k},j :j∈S\{k}}. Consequently, Useri obtains μK random linear combinations of the required μK segments{V i S\{i},j :j∈S\{i}}. Remark 4.4. The above uplink and downlink communication schemes require coding at both the users and the access point, creating multicasting messages that are simultaneously useful for many users. Such idea of efficiently creating and exploiting coded multicast 2 Here we assume that T is sufficiently large such that T μK ∈N. 85 Chapter 4. A Scalable Framework for Wireless Distributed Computing opportunities was initially proposed in the coded caching problems in [50, 51], and extended to D2D networks in [52]. WhenμK is not an integer, we can first expandμ =αμ 1 +(1−α)μ 2 as a convex combination of μ 1 ,bμKc/K and μ 2 ,dμKe/K. Then we partition the set of the N files into two disjoint subsetsI 1 andI 2 of sizes|I 1 | =αN and|I 2 | = (1−α)N. We next apply the above coded scheme respectively to the files inI 1 where each file is stored at μ 1 K users, and the files inI 2 where each file is stored at μ 2 K users, yielding the following communication loads. L coded u (μ) =α( 1 μ 1 − 1) + (1−α)( 1 μ 2 − 1), (4.15) L coded d (μ) =α μ 1 K μ 1 K+1 · ( 1 μ 1 − 1) + (1−α) μ 2 K μ 2 K+1 · ( 1 μ 2 − 1). (4.16) Hence, for general storage size μ, CWDC achieves the following communication loads. L coded u (μ) = Conv( 1 μ − 1), (4.17) L coded d (μ) = Conv( μK μK+1 · ( 1 μ − 1)), (4.18) where Conv(f(μ)) denotes the lower convex envelop of the points {(μ,f(μ)) : μ ∈ { 1 K , 2 K ,..., 1}} for function f(μ). We summarize the performance of the proposed CWDC scheme in the following theorem. Theorem 4.1. For a wireless distributed computing application with a dataset of N files, andK users that each can storeμ∈{ 1 K , 2 K ,..., 1} fraction of the files, the proposed CWDC scheme achieves the following uplink and downlink communication loads for sufficiently large N. L coded u (μ) = 1 μ − 1, (4.19) L coded d (μ) = μK μK+1 · ( 1 μ − 1). (4.20) For general 1 K ≤μ≤ 1, the achieved loads are as stated in (4.17) and (4.18). Remark 4.5. Theorem 1 implies that, for largeK,L coded u (μ)≈L coded d (μ)≈ 1 μ −1, which is independent of the number of users. Hence, we can accommodate any number of users with- out incurring extra communication load, and the proposed scheme is scalable. The reason 86 Chapter 4. A Scalable Framework for Wireless Distributed Computing for this phenomenon is that, as more users joint the network, with an appropriate dataset placement, we can create coded multicasting opportunities to reduce the communication loads by a factor that scales linearly with K. Such phenomenon was also observed in the context of cache networks (see e.g., [50]). Remark 4.6. As illustrated in Fig. 4.2, the proposed CWDC scheme utilizes coding at the mobile users and the access point to reduce the uplink and downlink communication load by a factor of μK and μK + 1 respectively, which scale linearly with the aggregated storage size of the system. When μ = 1 K , which is the minimum storage size required to accomplish distributed computing, the CWDC scheme reduces to the uncoded scheme when the access point simply forwards the received uncoded packets. Remark 4.7. Compared with distributed computing over wired servers where we only need to design one data shuffling scheme between servers in [1], here in the wireless setting we jointly design uplink and downlink shuffling schemes, which minimize both the uplink and downlink communication loads. Remark 4.8. We can view the Shuffle phase as an instance of the index coding prob- lem [55, 56], in which a central server aims to design a broadcast message with minimum length to satisfy the requests of all the clients, given the clients’ local side information. While a random linear network coding approach (see e.g., [57–59]) is sufficient to implement any multicast communication, it is generally sub-optimal for index coding problems where every client requests different messages. However, for the considered wireless distributed comput- ing scenario where we are given the flexibility of designing dataset placement (thus the side information), we can prove that the proposed CWDC scheme is optimum in minimizing communication loads (see Section 4.4). Remark 4.9. We note that the coding opportunities created and exploited in the proposed coded scheme belong to a type of in-network coding, which aims to combat interference in wireless networks, and deliver the information bits required by each of the users respectively with maximum spectral efficiency. This type of coding is distinct from source coding, or data compression (see e.g., [78]), which aims to remove the redundant information in the original intermediate values each of the users requests. Interestingly, the above proposed coded communication scheme can be applied on top of data compression. That is, we can first compress the intermediate values to minimize the number of information bits each user requests, then we apply the proposed coded communication scheme on the compressed values, in order to deliver them to intended users with minimum utilization of the wireless links. 87 Chapter 4. A Scalable Framework for Wireless Distributed Computing So far, we have considered the scenario where the dataset placement is designed in a cen- tralized manner, i.e., the dataset placement is designed knowing which users will use the application. However, a more practical scenario is that before computation, the dataset placement at each user is performed in a decentralized manner without knowing when the computation will take place and who will take part in the computation. In the next section, we describe how we can extend the proposed CWDC scheme to facilitate the computation in such a decentralized setting. 4.3 The Proposed CWDC Scheme for the Decentralized Set- ting We consider a decentralized system, in which in a computing instance, among the many users who have installed the application, a random and a priori unknown subset of users, denoted byK, participate in the computation. The dataset placement is performed independently at each user by randomly storing a subset of μN files, according to a common placement distributionP . In this case, we define the information loss of the system, denoted by Δ, as the fraction of the files that are not stored by any participating user. Once the computation starts, the participating users inK of size K are fixed, and their identities are revealed to all the participating users. Then they collaboratively perform the computation as in the centralized setting. The participating users process their inputs over the available part of the dataset stored collectively by all participating users. More specifically, every user k inK now computes φ( d k |{z} input ;{w n :n∈ ∪ k∈K U k } | {z } available dataset ). (4.21) In what follows, we present the proposed CWDC scheme for the above decentralized setting, including a random dataset placement strategy, an uplink communication scheme and a downlink communication scheme. Dataset Placement. We use a uniformly random dataset placement, in which every user independently stores μN files uniformly at random. With high probability for large N, the information loss approximately equals (1−μ) K , which converges quickly to 0 asK increases. 88 Chapter 4. A Scalable Framework for Wireless Distributed Computing For a decentralized random dataset placement, files are stored by random subsets of users. During data shuffling, we first greedily categorize the available files based on the number of users that store the file, then for each category we deliver the corresponding interme- diate values in an opportunistic way using the coded communication schemes described in Section 4.2 for the centralized setting. Uplink Communication. For all subsetsS⊆{1,...,K} with size|S|≥ 2: 1. For each k∈S, we evenly and arbitrarily splitV k S\{k} defined in (4.11), into|S|−1 disjoint segmentsV k S\{k} ={V k S\{k},i :i∈S\{k}}, and associate the segmentV k S\{k},i with the user i∈S\{k}. 2. User i, i∈S, sends the bit-wise XOR, denoted by⊕, of all the segments associated with it inS, i.e., User i sends the coded segment W S i , ⊕ k∈S\{i} V k S\{k},i . 3 Using the proposed uniformly random dataset placement, for any subsetS⊆{1,...,K}, the number of files exclusively stored by all users inS can be characterized by μ |S| (1− μ) K−|S| N +o(N). Thus, when the proposed communication scheme proceeds on a subsetS of size|S| =j + 1 users, the resulting uplink communication load converges to j+1 j μ j (1−μ) K−j for large N. As a result, we achieve the following total uplink communication load L coded decent,u (K,μ) = K−1 X j=1 K j + 1 ! j + 1 j μ j (1−μ) K−j . (4.22) Downlink Communication. For allS ⊆{1,...,K} of size|S|≥ 2, the access point computes|S|− 1 random linear combinations of the uplink messages generated based on the subsetS: C S j ({W S i :i∈S}), j = 1,...,|S|− 1, (4.23) and multicasts them to all users inS. 3 Since the dataset placement is now randomized, we zero-pad all elements in{V k S\{k},i : k∈S\{i}} to the maximum length max k∈S\{i} |V k S\{k},i | in order to complete the XOR operation. 89 Chapter 4. A Scalable Framework for Wireless Distributed Computing Using similar reasoning as calculating the uplink communication load, the total downlink communication load equals L coded decent,d (K,μ) = K−1 X j=1 K j + 1 ! μ j (1−μ) K−j . (4.24) Next, we summarize the performance of the decentralized CWDC scheme in the following theorem. Theorem 4.2. For an application with a dataset ofN files, andK users that each can store μ fraction of the files, the proposed decentralized CWDC scheme achieves an information loss Δ = (1−μ) K and the following communication loads with high probability for sufficiently large N. L coded decent,u = K−1 X j=1 K j + 1 ! j + 1 j μ j (1−μ) K−j , (4.25) L coded decent,d = K−1 X j=1 K j + 1 ! μ j (1−μ) K−j . (4.26) 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 Storage Size (μ) Uplink Communication Load Achieved by CWDC Centralized Decentralized (a) Uplink. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 Storage Size (μ) Downlink Communication Load Achieved by CWDC Centralized Decentralized (b) Downlink. Figure 4.5: Comparison of the communication loads achieved by the centralized and the decentralized CWDC schemes, for a network of K = 20 participating users. Remark 4.10. In Fig. 4.5, we numerically evaluate the communication loads achieved by the proposed centralized and decentralized schemes, in a network with 20 participating users. We observe that although the loads of the decentralized scheme are higher than those of the centralized scheme, the communication performance under these two settings are very 90 Chapter 4. A Scalable Framework for Wireless Distributed Computing close to each other. As K becomes large, the information loss achieved by the decentralized CWDC approaches 0, and both loads in (4.25) and (4.26) approach 1 μ − 1, which equals the asymptotic loads achieved by the centralized scheme (see Remark 4.5). Hence, when the number of participating users is large, there is little loss in making the system decentralized. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of Users Store the Same File Probability Density Figure 4.6: Concentration of the number of users each files is stored at around μK. Each curve demonstrates the normalized fraction of files that are stored by different numbers of users, for a particular number of participating users K. The density functions are computed for a storage size μ = 0.4, and for K = 2 3 ,..., 2 7 . Remark 4.11. To understand the fact that the proposed decentralized scheme performs close to the centralized one when the number of participating users is large, we notice the fact that when uniformly random dataset placement is used, as demonstrated in Fig. 4.6, almost all files are stored by approximately μK users for large K, which coincides with the optimal dataset placement for the centralized setting. Thus, coding gains of the the proposed decentralized communication schemes would be also very close to those of the centralized schemes. Such phenomenon was also observed in [51] for caching problems with decentralized content placement. 4.4 Optimality of the Proposed CWDC Schemes In this section, we demonstrate in the following two theorems, that the proposed CWDC schemes achieve the minimum uplink and downlink communication loads using any scheme, for the centralized setting and the decentralized setting respectively. Theorem 4.3. For a centralized wireless distributed computing application using any dataset placement and communication schemes that achieve an uplink load L u and a down- link load L d , L u and L d are lower bounded by L coded u (μ) and L coded d (μ) as stated in Theo- rem 4.1 respectively. 91 Chapter 4. A Scalable Framework for Wireless Distributed Computing Remark 4.12. Using Theorem 4.1 and 4.3, we have completely characterized the minimum achievable uplink and downlink communication loads, using any dataset placement, uplink and downlink communication schemes for the centralized setting. This implies that the proposed centralized CWDC scheme simultaneously minimizes both uplink and downlink communication loads required to accomplish distributed computing, and no other scheme can improve upon it. This also demonstrates that there is no fundamental tension between optimizing uplink and downlink communication in wireless distributed computing. For a dataset placementU ={U k } K k=1 , we denote the minimum possible uplink and downlink communication loads, achieved by any uplink-downlink communication scheme to accom- plish wireless distributed computing, by L ∗ u (U) and L ∗ d (U) respectively. We next prove Theorem 4.3 by deriving lower bounds on L ∗ u (U) and L ∗ d (U) respectively. 4.4.1 Lower Bound on L ∗ u (U) For a given dataset placementU, we denote the number of files that are stored at j users as a j U , for all j∈{1,...,K}, i.e., a j U = X J⊆{1,...,K}:|J|=j |(∩ k∈J U k )\(∪ i/ ∈J U i )|. (4.27) For anyU, it is clear that{a j U } K j=1 satisfy K X j=1 a j U =N, (4.28) K X j=1 ja j U =μNK. (4.29) We start the proof with the following lemma, which characterizes a lower bound on L ∗ u (U) in terms of the distribution of the files in the dataset placementU, i.e., a 1 U ,...,a K U . Lemma 4.1. L ∗ u (U)≥ K P j=1 a j U N · K−j j . Lemma 4.1 can be proved following the similar steps in the proof of Lemma 2.1 in Chap- ter 2, after replacing the downlink broadcast message X with the uplink unicast messages W 1 ,...,W K in conditional entropy terms (since X is a function of W 1 ,...,W K ). 92 Chapter 4. A Scalable Framework for Wireless Distributed Computing Next, since the function K−j j in Lemma 4.1 is convex in j, and by (4.28) that K P j=1 a j U N = 1 and (4.29), we have L ∗ u (U)≥ K− K P j=1 j a j U N K P j=1 j a j U N = K−μK μK = 1 μ − 1. (4.30) We can further improve the lower bound in (4.30) for a particular μ such that μK / ∈ N. For a given storage size μ, we first find two points (μ 1 , 1 μ 1 − 1) and (μ 2 , 1 μ 2 − 1), where μ 1 ,bμKc/K andμ 2 ,dμKe/K. Then we find the linep +qt connecting these two points as a function of t, 1 K ≤ t≤ 1, for some constants p,q∈ R. We note that p and q are different for different μ and p +qt| t=μ 1 = 1 μ 1 − 1, (4.31) p +qt| t=μ 2 = 1 μ 2 − 1. (4.32) Then by the convexity of the function 1 t − 1, the function 1 t − 1 cannot be smaller then the function p +qt at the points t = 1 K , 2 K ,..., 1. That is, for all t∈{ 1 K ,..., 1}, 1 t − 1≥p +qt. (4.33) By Lemma 4.1, we have L ∗ u (U)≥ K X j=1 a j U N · K−j j (4.34) = X t= 1 K ,...,1 a tK U N · 1 t − 1 (4.35) ≥ X t= 1 K ,...,1 a tK U N · (p +qt) (4.36) =p +qμ, (4.37) Therefore, for general 1 K ≤μ≤ 1, L ∗ u (U) is lower bounded by the lower convex envelop of the points{(μ, 1 μ − 1) :μ∈{ 1 K , 2 K ,..., 1}}. 93 Chapter 4. A Scalable Framework for Wireless Distributed Computing 4.4.2 Lower Bound on L ∗ d (U) The lower bound on the minimum downlink communication load L ∗ d (U) can be proved following the similar steps of lower bounding the minimum uplink communication load L ∗ u (U), after making the following enhancements to the downlink communication system: • We consider the access point as the (K + 1)th user who has stored allN files and has a virtual input to process. Thus the enhanced downlink communication system has K + 1 users, and the dataset placement for the enhanced system ¯ U,{U,U K+1 }, (4.38) whereU K+1 is equal to{1,...,N}. • We assume that every one of the K + 1 users can broadcast to the rest of the users, where the broadcast message is generated by mapping the locally stored files. Apparently the minimum downlink communication load of the system cannot increase after the above enhancements. Thus the lower bound on the minimum downlink communication load of the enhanced system is also a lower bound for the original system. Then we can apply the same arguments in the proof of Lemma 4.1 to the enhanced downlink system of K + 1 users, obtaining a lower bound on L ∗ d (U), as described in the following corollary: Corollary 4.1. L ∗ d (U)≥ K P j=1 a j U N · K−j j+1 . Proof. Applying Lemma 4.1 to the enhanced downlink system yields L ∗ d ( ¯ U)≥ K+1 X j=1 a j ¯ U N · K + 1−j j ≥ K+1 X j=2 a j ¯ U N · K + 1−j j (4.39) = K X j=1 a j+1 ¯ U N · K−j j + 1 . (4.40) 94 Chapter 4. A Scalable Framework for Wireless Distributed Computing Since the access point has stored every file, a j+1 ¯ U = a j U , for all j∈{1,...,K}. Therefore, (4.40) can be re-written as L ∗ d (U)≥L ∗ d ( ¯ U)≥ K X j=1 a j U N · K−j j + 1 . (4.41) Then following the same arguments as in the proof for the minimum uplink communication load, we have L ∗ d (U)≥ K−μK μK+1 = μK μK+1 · ( 1 μ − 1). (4.42) For general 1 K ≤μ≤ 1, L ∗ d (U) is lower bounded by the lower convex envelop of the points {(μ, μK μK+1 ( 1 μ − 1)) :μ∈{ 1 K , 2 K ,..., 1}}. This completes the proof of Theorem 4.3. Theorem 4.4. Consider a decentralized wireless distributed computing application. For any random dataset placement with a placement distributionP that achieves an information loss Δ, and communication schemes that achieve communication loads L u andL d with high probability for large N, L u and L d are lower bounded by 1 μ − 1 when K is large and Δ approaches 0. Remark 4.13. When the number of participating users is large (large K), the above lower bound in Theorem 4.4 coincides with the asymptotic loads achieved by the proposed decen- tralized CWDC scheme stated in Theorem 4.2 (see Remark 4.10). Therefore, the proposed decentralized scheme is asymptotically optimal. Next we prove that for any decentralized dataset placement, the minimum achievable com- munication loads are lower bounded by 1 μ − 1 when the number of participating users is large and the information loss approaches zero. Hence, the asymptotic communication loads achieved by the proposed decentralized scheme can not be further improved. In particular, for a particular realization of the dataset placementU with information loss Δ(U), we de- note the minimum possible uplink and downlink communication loads by L ∗ decent,u (U) and L ∗ decent,d (U), and derive lower bounds on L ∗ decent,u (U) and L ∗ decent,d (U) respectively. We note that given the information loss Δ(U), 1− Δ(U) fraction of files are available across the participating users, all of which need to be processed to compute the outputs (see 95 Chapter 4. A Scalable Framework for Wireless Distributed Computing (4.21)). Among those files, ¯ μ(U), μ 1−Δ(U) fraction of them are stored by each participating user. Following the same steps in proving the lower bounds of the centralized setting, the minimum communication loads for the dataset placementU are lower bounded as follows. L ∗ decent,u (U)≥ 1 ¯ μ(U) − 1 (1− Δ(U)) (4.43) = 1− Δ(U) μ − 1 (1− Δ(U)), (4.44) L ∗ decent,d (U)≥ ¯ μ(U)K ¯ μ(U)K + 1 1 ¯ μ(U) − 1 (1− Δ(U)) (4.45) = μK μK + 1− Δ(U) 1− Δ(U) μ −1 (1−Δ(U)). (4.46) Since the above bounds hold for any realization of dataset placementU, for a decentralized dataset placement scheme with a distribution P that achieves an information loss Δ(P ), communication loadsL ∗ decent,u (P ),L ∗ decent,d (P ) with high probability, the following inequal- ities hold. L ∗ decent,u (P )≥ 1− Δ(P ) μ − 1 (1− Δ(P )), (4.47) L ∗ decent,d (P )≥ μK μK + 1− Δ(P ) 1−Δ(P ) μ −1 (1−Δ(P )). (4.48) Hence, when the number of active users are large, the achievable uplink and downlink communication loads, for any decentralized dataset placement scheme with a distribution P that achieves a vanishing information loss are bounded by L ∗ decent,u (P )≥ 1 μ − 1, (4.49) L ∗ decent,d (P )≥ 1 μ − 1. (4.50) This completes the proof of Theorem 4.4. 96 Chapter 5 A Universal Coded Computing Architecture for Mobile Edge Processing Having study the mobile edge computing scenario where the computations are performed distributedly on the mobile users in the previous chapter, we now focus on another preva- lent mobile edge computing paradigm where the computation requests of mobile users are uploaded to edge servers for processing. For this setting, we study in this chapter how coding techniques can help to minimize the load of computation at the edge servers, and simultaneously maximize the spectral efficiency to deliver the computed results back to the users via underlying wireless links. We consider a mobile edge computing (or fog computing) scenario (see, e.g., [76, 81, 82]), in which K mobile users (e.g., smartphones and smart cars), denoted by User 1,..., User K, offload their computation tasks to M computing nodes at the network edge (e.g., base stations), denoted by EN 1,..., EN M. In particular, User k, k = 1,...,K, has an input request d k , and requests the computation of an output functionφ(d k ), which is performed at the edge nodes. The executions of many edge computing applications, including navigation, recommendation systems, and augmented reality, follow this computing paradigm (see, This chapter is mainly taken from [79, 80], coauthored by the author of this document. 97 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing e.g, [28, 83, 84]). For example, in a navigation application, a group of smart cars upload their locations and intended destinations to the edge, and the edge nodes compute the fastest routes based on the map and traffic information and return them to the cars. In the above described edge computing paradigm, each user k, k = 1, 2,...,K, as shown in Fig. 5.1, first uploads its request d k to the network edge. Upon receiving all the input requests, the edge nodes process them in two phases: the computation phase, and the communication phase. In the computation phase, the edge nodes compute the output functions from the users’ inputs. In the communication phase, the edge nodes deliver their computed results, through the underlying wireless links, to the users. For any computing architecture over the above model, the performance metrics we are interested in are the computation load measured by the total number of functions computed across the edge nodes in the computation phase, and the communication load exerted on the wireless links in the communication phase. EN 1 User 1 Wants Edge Offload Offload EN 2 EN M Computation Phase Communication Phase Computation Results Offloaded Requests by Users User 2 Wants User K Wants Move Move Move Offload Figure 5.1: A mobile edge computing system consisting of K mobile users and M edge nodes. The edge processing consists of the computation phase and the communication phase. The users’ requests are processed at the edge nodes in the computation phase, and the computed results are delivered back to the users in the communication phase. In order to minimize the computation load of the edge computing system, for example, consider the case where we have the same number of edge nodes as the users (i.e., M =K), in the computation phase, we can have each edge node dedicate to process the request from only one user, and perform orthogonal communications to deliver the results in the communication phase. This approach minimizes the computation load (since each request is processed only once), but the communication is very inefficient (the wireless channel is used K times). The edge nodes can perform more computations to save communica- tion. For example, on the other extreme, we can have every single edge node process all the requests, such that the physical layer in the communication phase emulates a K×K multiple-input single-output (MISO) broadcast channel, which can be diagonalized into K parallel interference-free channels, and the requested outputs of all users can be delivered 98 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing simultaneously with a single use of the wireless channel. This approach minimizes the communication load at the cost of a very high computation load (K 2 output functions are computed). The above computing schemes represent two rather trivial point in the set of all possible computation-communication load pairs that accomplish the computation task and deliver the results, which is denoted by the computation-communication load region. In this work, our objective is to formalize and characterize the entire load region for mobile edge pro- cessing. In particular, we focus on universal schemes in which the edge nodes perform the computations without knowing the channel gains towards the users in the upcoming com- munication phase. Universal computation is in fact a common practice in mobile computing systems (see e.g., [85, 86]), where the computation phase and the communication phase are executed independently of each other. This is primarily due to the fact that the channel state information (CSI) at the communication phase cannot be predicted ahead of time at the computation phase. Our main result is a full characterization of the computation-communication load region for linear output functions. In particular, we show that the load region is dominated by one corner point that simultaneously achieves the minimum computation load and the minimum communication load. To establish this result, we first argue that each edge node should execute coded computations, in which computation tasks are executed on some linear com- binations of the inputs, rather than executing the task on each input individually. We then propose a universal coded edge computing (UCEC) scheme, in which coded computations are performed at the edge nodes to create messages that neutralize all interference signals in the upcoming communication phase. In particular, the edge nodes create coded inputs, each as the sum of certain carefully selected users’ inputs, and pass them into the output functions to compute coded results. The coded computation at each node is such that (i) it achieves the minimum computation load of 1 computation unit/user’s input, (ii) it is in- dependent of CSI in communication phase, i.e., it is universal, (iii) no matter what the CSI in communication phase is, the coded computation results allow the edge nodes to create messages that cancel all interference signals over the air and achieve the maximum spectral efficiency of 1 symbol/user/channel use. The coding technique in the proposed UCEC architecture is motivated by the “aligned network diagonalization” technique in [87], and the “aligned interference neutralization” technique in [88], for communications over a two-hop relay network. In particular, we 99 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing develop the coded computations, following the patterns of aligned signals recovered at the relays in [87, 88]. These aligned signals can be used for signaling over the next hop of the relays such that all interference are cancelled at the destinations. We notice and exploit in this work the fact that the aligned signals at the relays are independent of CSI of the next hop. While this property is completely irrelevant in communication over the two-hop relay network, it allows us to decouple computation phase from communication phase and develop universal computing schemes for edge processing. We also extend the UCEC architecture to account for the situation of missing edge nodes in the communication phase. This phenomenon can be caused by various reasons including 1) some edge nodes compute significantly slower than the others, or become irresponsive before communicating the computation results, 2) some edge nodes are out of the communication range and get disconnected from the users due to users’ mobility. In this scenario, only a random subset of the edge nodes can reach the users in the communication phase, and the computation phase has to be designed without knowing which edge nodes can later communicate to the users. We propose a robust universal architecture that operates at a slightly higher computation load, and still maintains the minimum communication load, for communication from any subset of the edge nodes to the users. Related works The above mobile edge computing model has been widely considered in the literature. While most of the works (see, e.g., [76, 89, 90]) focus on finding the optimal resource allocation strategies, we demonstrate the key role that coding can play in achieving the optimal computation-communication tradeoff. Recently, various coding techniques have been applied to speed up distributed computing (see, e.g., [1, 2, 42, 54, 65, 71, 91–96]). In particular, for the edge computing setting, while here we assume that the dataset used to process users’ requests is entirely available at each edge node, another scenario where each node can only store a part of the dataset was considered in [96], and a coding framework for the edge nodes to collaboratively process users’ requests was proposed, in order to minimize the load of exchanging intermediate computation results between edge nodes. An interesting future direction is to design optimal coding schemes for mobile edge computing models that account for communications both from edge nodes to the users and between edge nodes. 100 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing 5.1 Problem Formulation We consider a mobile edge computing problem, in which K mobile users (e.g., smart cars) offload their computation tasks to M edge nodes (e.g., base stations), for some K,M∈N. We denote theK users as User 1,..., UserK, and theM edge nodes as EN 1,..., ENM. User k, k = 1,...,K, has a sequence of input vectors (d k [i]) ∞ i=1 (e.g., a list of interested destinations), where for each i∈ N, d k [i]∈ R D , for some D∈ N. For each input vector d k [i], User k wants to compute an output function (e.g., the fastest route) φ :R D →R. We focus on linear functions such that φ(αd m [i] +βd n [j]) =αφ(d m [i]) +βφ(d n [j]), (5.1) for all coefficients α,β∈R, m,n∈{1,...,K}, and i,j∈N. Remark 5.1. One example of the above computation is to compute the inner product of a target row vector a T and the input vectors, i.e., φ(d k [i]) = a T d k [i] for all k = 1,...,K. This type of computation is the basic component of matrix algebra that underlies many edge processing applications like navigation and object recognition. We consider edge computing schemes that process a block of F input vectors at each user, for some design parameter F ∈ N, i.e., the edge nodes process F inputs d k [1],..., d k [F ] from Userk, for allk = 1,...,K. The overall computation scheme proceeds in two phases: computation phase and communication phase. The edge nodes compute the intended output functions in the computation phase, and return the computed results to the users in the communication phase. 5.1.1 Computation phase In the beginning of the computation phase, EN m, m = 1,...,M, is given ` m linear com- binations of the users’ inputs, denoted byL m1 ,L m2 ...,L m`m , for some ` m ∈N, i.e., L mj = F X i=1 K X k=1 α (k) mj [i]d k [i], (5.2) where α (k) mj [i]∈R is the coefficient in the jth linear combination at EN m, corresponding to the ith input of User k, for all j = 1,...,` m . 101 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing For each linearly encoded inputL mj , EN m computes a function s mj =φ(L mj ) = F X i=1 K X k=1 α (k) mj [i]φ(d k [i]). (5.3) Definition 5.1 (Computation Load). We define the computation load, denoted byr, as the total number of functions computed across all edge nodes, normalized by the total number of required functions. That is, r, P M m=1 `m FK . 3 5.1.2 Communication phase After the computation phase, only a subset of the M edge nodes communicate their com- putation results back to the users through the wireless channels. We denote the set of the indices of these nodes asM c ⊆{1,...,M}. This phenomenon of missing edge nodes in the communication phase may be caused by multiple reasons like 1) nodes being slow or irresponsive, 2) loss of connectivity when users are moving. The communication process from the edge nodes to the users are performed over T time slots, for someT∈N. The symbol communicated by ENm at timet,t = 1,...,T , denoted byX m (t)∈R, is generated as a function, denoted byψ m (t), of the local computation results at EN m, for all m∈M c , i.e., X m (t) =ψ m (t) ({s mj :j = 1, 2,...,` m }). (5.4) Each edge node has an average power constraint of P , i.e., 1 T T X t=1 E[X m (t) 2 ]≤P, (5.5) for all m∈M c . The received symbol at User k in time t, t = 1,...,T , Y k (t) = X m∈Mc h km (t)X m (t) +Z k (t), (5.6) whereh km (t)∈R,k = 1,...,K,m∈M c , is the channel gain from ENm to Userk in time t. The channel gains{h km (t) : k = 1,...,K, m∈M c } T t=1 are time-varying, and they are 102 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing drawn i.i.d. across time and space from a continuous distribution with a bounded second moment. We assume that in the communication phase, at any time t, the instantaneous channel state information{h km (t) : k = 1,...,K, m∈M c } is available at all edge nodes inM c and mobile users. Z k (t)∼N (0, 1) is the additive white Gaussian noise at User k in time t. Definition 5.2 (Communication Load). We define the communication load, denoted by L, as the total number of communication time slots T normalized by the number of output functions required by each user, i.e., L, T F . 3 After T time slots of communication, User k, k = 1,...,K, for each i = 1,...,F , recon- structs the intended output function φ(d k [i]) using a decoding function ρ k [i]. That is, ˆ φ(d k [i]) =ρ k [i] {Y k (t)} T t=1 . (5.7) We assume that acrossk = 1,...,K,i = 1,...,F , the input vectors d k [i]s are independent random vectors, each of which follows an arbitrary distribution onR D , and the correspond- ing output functions d k [i]s are independent random variables, each of which has a bounded second moment. We define the recovery distortion of the output function computed from the input vector d k [i] at User k as D k [i] =E{( ˆ φ(d k [i])−φ(d k [i])) 2 }. (5.8) We say that a computation-communicate load pair (r,L) is achievable, if there exist a computation scheme with a computation load of at most r and a communication load of at mostL, such that by the end of the communication phase, Userk,k = 1,...,K, can obtain a noisy version of φ(d k [i]), for all i = 1,...,F , or more precisely, lim P→∞ log(1/D k [i]) logP = 1. (5.9) We define the computation-communication load region of this mobile edge computing sys- tem, denoted byC, as the closure of the set of all achievable computation-communication load pairs. 103 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing In the rest of this chapter, we first consider the scenario in which no edge node is missing in the communication phase, i.e.,M c ={1,...,M}. In section 5.5, we extend the proposed edge computing architecture to deal with the scenario of missing edge nodes, i.e.,|M c |<M. 5.2 Motivation and Main Results For the mobile edge computing scenario formulated in the previous section, since each output function needs to be computed at least once, the minimum computation load is at least 1. On the other hand, in the communication phase, even when we can create parallel communication channels for the K users, we need to use the channel at least F times, one for delivering an output function required by a single user. Hence, the minimum communication load is also at least 1. Given the above individual lower bounds on the computation load and the communication load, we ask the following question: Can we achieve the minimum computation load and the minimum communication load simultaneously? Or is there an edge computing scheme that achieves the load pair (1, 1)? We first show through the following example, the achievability of the (1, 1) pair when the edge nodes know the channel gains towards the users in the communication phase when they are performing local computations. We note that this situation often occurs when the channels from the edge nodes to the users are static or varying very slowly. Example 5.1 (channel-state-informed coded computing). We consider a mobile edge com- puting scenario in which K = 2 mobile users offload their computation tasks to M = 2 edge nodes. User k, k = 1, 2, has an input vector d k ∈ R D , and wants to compute an inner-product between a target vector a∈R D and its input d k , i.e., φ k = a T d k . In this case, we assume that the four channel gains h 11 , h 12 , h 21 , and h 22 from the edge nodes to the users in the communication phase are known in prior at the two edge nodes when they are processing the users’ input requests. Utilizing the channel state information, and the linearity of the vector inner-product operation, the edge nodes can create coded inputs, and process them to obtain coded computation results that are ready for interference neutralization over the air at the users. Computation phase: Before the computation starts, EN 1 generates a linear combination of the two input vectorsL 1 =−h 22 d 1 +h 12 d 2 , and EN 2 also generates a linear combination 104 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing EN 1 User 1 User 2 EN 2 Compute Communicate Decode Figure 5.2: Channel-state-informed coded edge computing for the scenario of K = 2 users and M = 2 edge nodes. Using channel state information to design coded computations allows zero-forcing the interference signal at each user. L 2 = h 21 d 1 −h 11 d 2 . Then, as shown in Fig. 5.2, EN 1 and EN 2 respectively computes the inner-product of the target vector a and its generated coded input to obtain the coded outputs s 1 and s 2 . That is s 1 = a T L 1 =−h 22 φ 1 +h 12 φ 2 , (5.10) s 2 = a T L 2 =h 21 φ 1 −h 11 φ 2 . (5.11) Each edge node processes a single coded input in the computation phase, achieving a com- putation load of r = 1. Communication phase: Having computed s m locally, EN m simply communicates X m = γs m to the two users, form = 1, 2, whereγ is some scaling factor used to satisfy the power constraint. As a result, User k, k = 1, 2, receives a noisy version of φ k free of interference, i.e., Y 1 =γ(h 12 h 21 −h 11 h 22 )φ 1 +Z 1 , (5.12) Y 2 =γ(h 21 h 12 −h 22 h 11 )φ 2 +Z 2 . (5.13) From the received signalY k ,k = 1, 2, Userk can successfully reconstruct its intended output function. Since the wireless channel is accessed simultaneously by both edge nodes for only once, we also achieve the minimum communication load of L = 1. 105 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing Remark 5.2. We have demonstrated through the above example the key impact of coding on simultaneously minimizing the computation load and maximizing the spectral efficiency. In particular, this type of coding over the input requests needs to be communication-aware, such that the computed coded results can be directly utilized to create messages that zero-force the interference signals over the air at each user. While the above channel-state-informed coded computing scheme achieves the optimal load pair (1, 1) for static channel gains, it is not commonly applicable in mobile edge computing environments. This is due to the fact that the channel states are often time-varying between the computation phase and the communication phase (e.g., when the users are moving), and the channel gains in the communication phase cannot be predicted ahead of time. Therefore, we should focus on universal schemes in which the edge nodes perform the computations without knowing the channel gains towards the users in the future communication phase (i.e., the coefficients in (5.2) are independent of the channel states). Motivated by this phenomenon, we ask the following question: Is there a universal edge computation architecture that simultaneously achieves the minimum computation load and the minimum communication load, i.e., the load pair (1, 1), without requiring channel state information in the computation phase at the edge nodes? We answer the above question affirmatively, and present the main result of this paper in the following theorem. Theorem 5.1. For a mobile edge computing system withK mobile users andK edge nodes, there exists a universal computation architecture, named universal coded edge computing (UCEC), that achieves the minimum computation load and the minimum communication load simultaneously, i.e., the load pair (1, 1), for time-varying channels and no channel state information available at the edge nodes in the computation phase. We prove Theorem 5.1 in Section 5.4, by describing the proposed UCEC architecture, and analyzing its performance. Remark 5.3. The key feature of the proposed UCEC architecture is that in the computation phase, the edge nodes create coded input vectors without using any channel state information, and compute coded output results with a computation load of r = 1. In the communication 106 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing phase, the edge nodes create downlink signals from the coded output results. These signals are designed such that all the interference signals at the users are neutralized over the air, hence achieving the maximum spectral efficiency. Remark 5.4. Theorem 5.1 implies that when no channel state information is available in the computation phase, the computation-communication load region has a simple shape that is dominated by a single corner point (1, 1). Hence, performing computations without being aware of the channel states does not cause any performance loss. Remark 5.5. Theorem 5.1 implies that in contrast to the channel-state-informed coded computing scheme in Example 1, we can execute the computation phase separately from the communication phase, without losing any performance. For example, we can perform the computations at some remote edge clusters without knowing when and how the computed results will be delivered to the mobile users, and later have the access points close to the users (e.g., base stations) communicate the results, and still simultaneously achieve the minimum computation load and the minimum communication load. Remark 5.6. The coding techniques of the UCEC architecture are motivated by the “aligned network diagonalization” (AND) technique in [87], and the “aligned interference neutraliza- tion” (AIN) technique in [88], for communication over a two-hop relay network. AND and AIN design signals at the sources, such that desired linear combinations of message sym- bols are recovered at the relays. Then, these linear combinations are utilized to deliver the intended messages to the destinations free of interference. In the UCEC architecture, the edge nodes construct coded computations following the alignment patterns at the relays in AND and AIN. In particular, coded inputs are created as the sum of subsets of users’ inputs, without using the channel states on the wireless links towards the users. Remark 5.7. We can directly apply the proposed UCEC architecture in Theorem 5.1 to the general case of K users and M edge nodes. In particular, when M > K, we can use any K out of the M edge nodes to achieve the load pair (1, 1). When M <K, we can split the K users intod K M e partitions of size M (except that one partition has size K−Mb K M c). Then we repeatedly apply the UCEC scheme between the M edge nodes and each of the user partitions, achieving a load pair (1,d K M e). Overall, the UCEC architecture achieves the load pair (1,d K M e), for the case of K users and M edge nodes. In the next section, we illustrate the key ideas of the proposed UCEC architecture through a simple example. 107 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing 5.3 Illustration of the Universal Coded Edge Computing scheme via a simple example We consider the same mobile edge computing system as in Example 5.1, whereK = 2 mobile users offload their computation tasks to M = 2 edge nodes. User k, k = 1, 2, has a block of F input vectors d k [1],..., d k [F ], and wants to compute the inner-products between a target vector a∈R D and its input vectors, i.e., φ k [i] = a T d k [i],i = 1, 2,...,F . In contrast to Example 5.1, now we do not assume the knowledge of channel states at the edge nodes in the computation phase. We first consider a case F = 3, i.e., User k, k = 1, 2, wants to compute 3 output vectors φ k [i] = a T d k [i], i = 1, 2, 3. Before the computation phase starts, EN 1 generates 2 linear combinations of the inputs, L 11 = d 1 [1], (5.14) L 12 = d 1 [2] + d 2 [1], (5.15) and EN 2 generates a single linear combination L 2 = d 1 [1] + d 2 [1]. (5.16) We note that the above generated linear combinations do not depend on the channel gains. In the computation phase, EN 1 computes two functions: s 11 = a T L 11 =φ 1 [1], (5.17) s 12 = a T L 12 =φ 1 [2] +φ 2 [1]. (5.18) Also, EN 2 computes one function s 2 = a T L 2 =φ 1 [1] +φ 2 [1]. (5.19) In the communication phase, the edge nodes deliver the computation results to the users in 2 time slots. To start, EN 1 selects two transmit directions specified by two pre-coding vectors v 11 and v 12 , and EN 2 selects one transmit direction specified by a pre-coding 108 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing vector v 2 , for some v 11 , v 12 , v 2 ∈ R 2 . Then, the edge nodes create the following symbols for transmission. X 1 (1) X 1 (2) = v 11 s 11 + v 12 s 12 = v 11 φ 1 [1] + v 12 (φ 1 [2] +φ 2 [1]), (5.20) X 2 (1) X 2 (2) = v 2 s 2 = v 2 (φ 1 [1] +φ 2 [1]). (5.21) The received signals at User 1 over the two time slots are Y 1 (1) Y 1 (2) = h 11 (1) 0 0 h 11 (2) | {z } H 11 X 1 (1) X 1 (2) + h 12 (1) 0 0 h 12 (2) | {z } H 12 X 2 (1) X 2 (2) + Z 1 (1) Z 1 (2) | {z } Z 1 (5.22) = (H 11 v 11 + H 12 v 2 )φ 1 [1] + H 11 v 12 φ 1 [2] + (H 11 v 12 + H 12 v 2 )φ 2 [1] | {z } interference +Z 1 , (5.23) and the received signal at User 2 over the two time slots are Y 2 (1) Y 2 (2) = h 21 (1) 0 0 h 21 (2) | {z } H 21 X 1 (1) X 1 (2) + h 22 (1) 0 0 h 22 (2) | {z } H 22 X 2 (1) X 2 (2) + Z 2 (1) Z 2 (2) | {z } Z 2 (5.24) = (H 21 v 12 + H 22 v 2 )φ 2 [1] + H 21 v 12 φ 1 [2] + (H 21 v 11 + H 22 v 2 )φ 1 [1] | {z } interference +Z 2 . (5.25) By choosing the transmit directions v 11 , v 12 , and v 2 satisfying H 11 v 12 =−H 12 v 2 , H 21 v 11 = −H 22 v 2 , the edge nodes can zero-force the the above interference signals over the air. After the communication phase, User 1 recovers a noisy estimate of φ 1 [1] and φ 1 [2] respec- tively, and User 2 recovers a noisy estimate of φ 2 [1]. Next, we swap the role of User 1 and User 2, and perform the same computation and communication operations as before to deliver φ 2 [2] and φ 2 [3] to User 2, and φ 1 [3] User 1. Using this scheme, we achieve a computation load ofr = (2+1)×2 3×2 = 1, and a communication load of L = 2×2 3 = 4 3 . We note that without channel state information in the computation phase, the edge nodes can still perform coded computations to reduce the communication 109 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing load by 33.3% (the communication load would have been 2 if uncoded computations and orthogonal communications were employed), while maintaining the minimum computation load of 1. The techniques utilized in the above UCEC scheme are motivated by the “aligned inter- ference neutralization” (AIN) technique in [88], and the “aligned network diagonalization” (AND) technique in [87], for communications over a two-hop wireless relay network consist- ing of a set of source nodes, a set of relay nodes, and a set of destination nodes. AIN and AND design the transmitted signals at the sources, such that interference signals are aligned at the relays. Particularly, each aligned signal at the relays is the sum of some message symbols, which does not depend on the channel gains on either hop of the network. On the second hop, relays create messages that cancel interference signals over the air at the destinations. For the mobile edge computing problem considered here, in the computation phase, the edge nodes generate coded input vectors, following how the message symbols are aligned at the relays in AIN or AND, which is performed oblivious of the channel gains. As a result, edge nodes can similarly create downlink signals from the coded computation results to zero-force interference over the air at all users. We finally note that for the above mobile edge computing system consisting of K = 2 users and M = 2 edge nodes, if in general we consider a block of F = 2W− 1 input vectors at each user, for someW∈N, employing coding techniques motivated by the AIN scheme, we can use the firstW time slots to simultaneously communicateW output functions to User 1 and W− 1 output functions to User 2, and use the second W time slots to simultaneously communicateW output functions to User 2 andW−1 output functions to User 1. Hence, we can achieve a computation-communication load pair (1, 2W 2W−1 ), which goes to the optimal pair (1, 1) as W increases. 5.4 Universal Coded Edge Computing Architecture In this section, we prove Theorem 5.1 by describing a universal coded edge computing (UCEC) architecture, for a mobile edge computing scenario of K users and K edge nodes, and time-varying channels. The proposed scheme does not use the channel gains between the edge nodes and the users when executing the computation phase, and still asymptotically achieves the optimal computation-communication load pair. 110 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing 5.4.1 Overview of UCEC In the general UCEC architecture, the K edge nodes process a sequence of F > 1 input requests from each user, and asymptotically achieve both the minimum computation load of r = 1, and the minimum communication load of L = 1. Specifically, the edge nodes compute a total of KF +o(F ) output functions from some coded inputs, and deliver F computation results to each user over F +o(F ) time slots. The number of inputs F is decided by some design parameter and the system configuration (i.e., the number of users and the number of edge nodes). Without using the channel states towards the users, each edge node generates some coded inputs, each of which is a summation ofK inputs, one from each user. Having locally processed these coded inputs in the computation phase, each node aligns the computed coded results along some transmit directions, such that the received signal at each user is a noisy linear combination of the F output functions intended at that user. Repeating this communication process sufficiently many times, each user obtains enough number of equations to decode all its intended functions. 5.4.2 Generating coded inputs We consider the scenario of processing F = N K 2 input vectors from each user, for some N ∈ N. Each input vector at each user corresponds to a unique element from the set Δ N ,{0, 1,...,N− 1} K 2 . More specifically, for each k = 1, 2,...,K, and each element p = (p 11 ,p 12 ,...,p KK )∈ Δ N , we label the input of User k corresponding to p as d p k . Before the computation starts, ENm,m = 1,...,K, for each (p 0 11 ,p 0 12 ,...,p 0 KK )∈ Δ N+1 = {0, 1,...,N} K 2 , creates a coded input vectorL (p 0 11 ,p 0 12 ,...,p 0 KK ) m as the sum of certain input vectors as follows. L (p 0 11 ,p 0 12 ,...,p 0 KK ) m = K X k=1 d (p 0 11 ,p 0 12 ,...,p 0 km −1,...,p 0 KK ) k . (5.26) 5.4.3 Computation phase Having prepared the coded inputs for all p 0 in Δ N+1 , EN m, m = 1, 2,...,K, computes a coded output from each of them. That is, for each p 0 = (p 0 11 ,p 0 12 ,...,p 0 KK ), ENm computes 111 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing the following function. s p 0 m =φ L (p 0 11 ,p 0 12 ,...,p 0 KK ) m = K X k=1 φ d (p 0 11 ,p 0 12 ,...,p 0 km −1,...,p 0 KK ) k . (5.27) Here the value ofφ(d p 0 k ) = 0 if any element in p 0 isN or−1. We note that the computation performed in (5.27) is universal, i.e., no channel state information is involved. We also note that the above coded function s p 0 m computed at EN m resembles the signal aligned at Relay m in the AND scheme proposed in [87]. In the computation phase, each edge node computes|Δ N+1 | = (N + 1) K 2 functions. The UCEC scheme achieves a computation load of r = K(N+1) K 2 KF = (N+1) K 2 N K 2 . 5.4.4 Communication phase The communication phase of UCEC ranges over T,|Δ N+1 | = (N + 1) K 2 time slots. We first denote the channel matrix from the edge nodes to the users at time t, t = 1, 2,...,T , as H(t), h 11 (t) ··· h 1K (t) . . . . . . . . . h K1 (t) ··· h KK (t) . (5.28) At timet, since the channel gainsh km ,k,m = 1, 2,...,K, are independently and identically drawn from some continuous distribution, H(t) is invertible with probability 1. Also, since the channel gains{h km (t) :k = 1,...,K,m = 1,...,K} are independently and identically distributed across times, we have Pr[| det(H(t))|> 0,∀t = 1, 2,...,T ] = 1. (5.29) Hence, for any 0<< 1, we can find a δ such that Pr[|T ⊆{1, 2,...,T} :| det(H(t))|>δ,∀t∈T|<N K 2 ]≤. (5.30) 112 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing At time t, t = 1, 2,...,T , EN m first computes| det(H(t))| from the channel matrix H(t). If| det(H(t))|≤δ, ENm will remain silent, i.e.,X m (t) = 0. Otherwise, EMm inverts H(t) to obtain B(t), i.e., B(t) = b 11 (t) ··· b K1 (t) . . . . . . . . . b 1K (t) ··· b KK (t) , H(t) −1 , (5.31) and generates Q(t) p 0 = Q(t) (p 0 11 ,p 0 12 ,...,p 0 KK ) , Q 1≤i,j≤K b ij (t) p 0 ij , for each p 0 ∈ Δ N+1 . Here Q(t) p 0 is the transmit direction that will be used to align the computation result s p 0 m when constructing the communication signal. Next, ENm creates its communicate symbolX m (t) as the following linear combination of the local computation results, for allm = 1, 2,...,K. X m (t)=γ X p 0 ∈Δ N+1 Q(t) p 0 s p 0 m =γ X p∈Δ N Q(t) p K X k=1 b km (t)φ(d p k ) , (5.32) whereγ is some scaling factor to enforce the power constraint. To demonstrate the existence of the scaling factorγ that enforce the power constraint thatE[|X m (t)|]≤P , it is sufficient to show that E[|Q(t) p 0 | 2 ]≤∞, for all p 0 ∈ Δ N+1 . To show this, we first note that Q(t) p 0 is a monomial of the variables b ij (t), i,j = 1, 2,...,K. Based on (5.31), each b ij (t) can be written as a ratio between a polynomial of the variables h ij (t), i,j = 1, 2,...,K and det(H(t)). Since EM m communicates when| det(H(t))| > δ, we have E[|Q(t) p 0 | 2 ]≤∞. Therefore, we can find a scaling factor γ that allows the input symbols of all K edge nodes to satisfy the power constraint. When the channel matrix H(t) at time t, t = 1, 2,...,T , satisfies| det(H(t))| > δ, the vector of the received signals at the K users in time t is Y 1 (t) . . . Y K (t) = H(t) X 1 (t) . . . X K (t) + Z 1 (t) . . . Z K (t) = H(t)γ X p∈Δ N Q(t) p P K k=1 b k1 (t)φ(d p k ) . . . P K k=1 b kK (t)φ(d p k ) + Z 1 (t) . . . Z K (t) 113 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing = B(t) −1 γ X p∈Δ N Q(t) p B(t) φ(d p 1 ) . . . φ(d p K ) + Z 1 (t) . . . Z K (t) =γ X p∈Δ N Q(t) p φ(d p 1 ) . . . φ(d p K ) + Z 1 (t) . . . Z K (t) =γ P p∈Δ N Q(t) p φ(d p 1 ) . . . P p∈Δ N Q(t) p φ(d p K ) + Z 1 (t) . . . Z K (t) . (5.33) We can see from (5.33) that at Userk,k = 1, 2,...,K, all interference signals, i.e.,{φ(d p k 0 ) : k 0 6=k, p∈ Δ N }, are zero-forced over the air, and the received signal Y k (t) is a noisy linear combination of the F =N K 2 intended functions{φ(d p k ) : p∈ Δ N } at User k. We know from (5.30) that with probability of at least 1−, the number of time slots in which | det(H(t))|>δ is at leastN K 2 . We label the firstN K 2 time slots in which| det(H(t))|>δ as t 1 ,t 2 ,...,t N K 2, and each user utilizes the received signals in these time slots to decode its intended functions. In particular, in times t 1 ,t 2 ,...,t N K 2, User k receives the following signals. Y k (t 1 ) Y k (t 2 ) . . . Y k (t N K 2) =γ P p∈Δ N Q(t 1 ) p φ(d p k ) P p∈Δ N Q(t 2 ) p φ(d p k ) . . . P p∈Δ N Q(t N K 2) p φ(d p k ) + Z k (t 1 ) Z k (t 2 ) . . . Z k (t N K 2) (5.34) =γ X p∈Δ N Q(t 1 ) p Q(t 2 ) p . . . Q(t N K 2) p φ(d p k ) + Z k (t 1 ) Z k (t 2 ) . . . Z k (t N K 2) (5.35) We let Q denote the N K 2 × N K 2 matrix whose columns are [Q(t 1 ) p Q(t 2 ) p ··· Q(t N K 2) p ] T , for all p∈ Δ N . We recall that for each` = 1, 2,...,N K 2 , {Q(t ` ) p : p∈ Δ N } is a set of distinct monomials on the variables b ij (t ` ), i,j = 1, 2,...,K. Hence, by Lemma 1 in [87], det(Q) is a non-identically zero polynomial on the variables b ij (t ` ), ` = 1, 2,...,N K 2 , and i,j = 1, 2,...,K. Utilizing the relationship between B(t) 114 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing and H(t) in (5.31), we can also represent det(Q) as a ratio of the polynomials on the variables h ij (t ` ),` = 1, 2,...,N K 2 , andi,j = 1, 2,...,K, which cannot be identically zero. Therefore, for the same in (5.30), we can find a δ 0 > 0, such that Pr[det(Q)≥δ 0 ]≥ 1−. When det(Q) ≥ δ 0 , User k can recover the intended functions {φ(d p k ) : p ∈ Δ N } by computing [ ˆ φ(d p k )] p∈Δ N = 1 γ Q −1 Y k (t 1 ) Y k (t 2 ) . . . Y k (t N K 2) = [ ˆ φ(d p k )] p∈Δ N + 1 γ Q −1 Z k (t 1 ) Z k (t 2 ) . . . Z k (t N K 2) , (5.36) with vanishing distortion as the transmit power P increases. We recall that for the above decoding to be successful at each user, we require that 1) the number of time slots in which| det(H(t))|>δ needs to be at leastN K 2 , and 2) det(Q)≥δ 0 . Since the marginal probability for each of these two events to occur is at least 1−, the joint probability that both of these two events occur is at least 1− 2. Therefore, by pick an arbitrarily close to 0, each user can reconstruct their intended functions with vanishing distortion, with probability arbitrarily close to 1. The communication load incurred in the communication phase is L = T F = (N+1) K 2 N K 2 . The above UCEC scheme achieves a computation-communication load load pair (r,L) = ( (N+1) K 2 N K 2 , (N+1) K 2 N K 2 ), which goes to the optimal pair (1, 1) as N gets larger. 5.5 Robust Universal Coded Edge Computing In this section, we extend the proposed UCEC architecture to tackle the scenarios in which only a subset of theM edge nodes communicate their computation results to the users in the communication phase. This problem of missing edge nodes can result from various reasons including 1) some nodes compute significantly slower than the others, or the computation is incomplete due to hardware/software failures. 2) the users move out of the communication range of some edge nodes, and become unreachable. For the above scenario of missing edge nodes, we propose a robust universal coded edge computing architecture with the following desired features. 115 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing • The computation phase is performed oblivious of the channel state information, and which edge nodes will be missing in the communication phase. • There exists a communication scheme that delivers all intended functions to the users with the maximum spectral efficiency, regardless of which edge nodes are missing in the communication phase. We state the performance of the robust computing architecture in the following theorem. Theorem 5.2. For a mobile edge computing scenario withK users andM≥K edge nodes, there exists a robust universal edge computating scheme that achieves a computation load of r = M K and a communication load of L = 1, and tolerates up to missing any M−K edge nodes in the communication phase, for time-varying channels and no channel state information in the computation phase at the edge nodes. Remark 5.8. For the case of M =K, Theorem 5.2 reduces to the result of Theorem 5.1, where no edge node is absent in the communication phase. Theorem 5.2 illustrates a tradeoff between the computation load and the system robustness to missing edge nodes. Remark 5.9. The edge nodes missing in the communication phase can be viewed as “strag- glers” in a distributed computing cluster, which refer to distributed computing workers that do not return their local computation results for final aggregation. Recently, many research works study utilizing coding theory (e.g., MDS codes) to optimally inject (coded) redun- dant computations to alleviate the straggler’s effect (see, e.g., [2, 38–40]). While most of these works consider a master-worker architecture where the workers’ computation results are deliver to a master for reduction through wired links free of interference, we consider a setting where the edge nodes need to deliver the computation results to each of the K mobile users, through an underlying wireless interference channel. For this setting, we propose a universal edge computing architecture, which specifies a set of coded computations at the edge nodes that do not require the knowledge of channel state information. Then, for any downlink channel realization, the proposed architecture specifies a communication scheme at any subset of K non-straggling edge nodes to return all computation results to the users with the maximum spectrum efficiency. In the rest of this section, we describe the proposed robust architecture, and analyze its computation and communication loads. For this architecture to be robust to any subset of M−K missing nodes, each user provides more input requests (compared with the UCEC 116 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing scheme) to accommodate the increased number of unique pairs between the users and the edge nodes (from K 2 to KM). This results in an additional computation load at the edge nodes. However, since the nodes compute the results for each of the (user, node) pairs, no matter which M−K edge nodes are missing in the communication phase, we can view the system as if it had K users and K edge node, for which all the coded computations have been performed as in the UCEC scheme, and none of the edge nodes is missing during the communication phase. Finally, after some grouping of the input requests, the surviving edge nodes can directly use the communication scheme of the UCEC architecture to achieve the maximum spectrum efficiency. 5.5.1 Generating coded inputs In this case, since the system contains K users and M≥K edge nodes, each user requests to process F = N KM input vectors, for some N ∈ N. Each input vector at each user corresponds to a unique element from the set Γ N ,{0, 1,...,N− 1} KM . More specifically, for eachk = 1, 2,...,K, and each element q = (q 11 ,q 12 ,...,q KM )∈ Γ N , we label the input of User k corresponding to q as d q k . Compared with the above UCEC scheme designed for the case where the K users know in prior which K edge nodes will serve their computational requests, each user increases the number of input requests fromN K 2 toN KM to accommodate a total ofKM unique (user, node) pairs. Before the computation starts, ENm,m = 1,...,M, for each (q 0 11 ,q 0 12 ,...,q 0 KM )∈ Γ N+1 = {0, 1,...,N} KM , creates a coded input vectorL (q 0 11 ,q 0 12 ,...,q 0 KM ) m as the sum of certain input vectors as follows. L (q 0 11 ,q 0 12 ,...,q 0 KM ) m = K X k=1 d (q 0 11 ,q 0 12 ,...,q 0 km −1,...,q 0 KM ) k . (5.37) We note that each edge node generates|Γ N+1 | = (N + 1) KM coded inputs. 5.5.2 Computation phase Having created the coded inputs for all q 0 in Γ N+1 , EN m, m = 1, 2,...,K, computes a coded output from each of them. That is, for each q 0 = (q 0 11 ,q 0 12 ,...,q 0 KM ), ENm computes 117 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing the following function. s q 0 m =φ L (q 0 11 ,q 0 12 ,...,q 0 KM ) m = K X k=1 φ d (q 0 11 ,q 0 12 ,...,q 0 km −1,...,q 0 KM ) k . (5.38) The value of φ(d q 0 k ) is set to be 0 if any element in q is N or−1. Similar to the UCEC scheme, the above computations performed at the edge nodes do not require channel state information, and hence this computing scheme is universal. In the computation phase, each edge node computes|Γ N+1 | = (N + 1) KM functions. The robust edge computing scheme achieves a computation load of r = M(N+1) KM KF = M K · (N+1) KM N KM . 5.5.3 Communication phase Due to the heterogeneity of the computation and communication capabilities, only a random subset ofK edge nodes establish communication links towards the users. We denote the set of indices of these nodes asM c ={i 1 ,...,i K }. In the communication phase, the edge nodes inM c create downlink messages based on their local computation results and communicate them to the K users. For eachK 2 -dimensional vector p = (p 1i 1 ,...,p 1i K ,...,p Ki K )∈ Δ N ={0, 1,...,N− 1} K 2 , we define a set of MK-dimensional vectorsD(p) as follows. D(p),{q : q∈ Γ N , q kj =p kj , for all k = 1,...,K and all j =i 1 ,...,i K }. (5.39) We note that for each p ∈ Δ N , D(p) contains N K(M−K) elements. For each user k, and p ∈ Δ N , we index the subset of N K(M−K) input vectors {d q k : q ∈ D(p)} as d p k [1], d p k [2]..., d p k [N K(M−K) ]. Similarly, for each K 2 -dimensional vector p 0 = (p 0 1i 1 ,...,p 0 1i K ,...,p 0 Ki K ) ∈ Δ N+1 = {0, 1,...,N} K 2 , we define a set of MK-dimensional vectors D 0 (p 0 ),{q 0 : q 0 ∈ Γ N+1 , q 0 kj =p 0 kj , 118 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing for all k = 1,...,K and all j =i 1 ,...,i K }. (5.40) For each p 0 ∈ Δ N+1 , as shown in (5.38) above, EN m∈M c has computed a total of N K(M−K) functions in the computation phase, one for each q 0 ∈D 0 (p 0 ), 1 and we index these functions as s p 0 m [1],s p 0 m [2],...,s p 0 m [N K(M−K) ]. Now, we have by (5.38), for each m∈M c , and each p 0 ∈ Δ N+1 , s p 0 m [n] = K X k=1 φ d (p 0 1i 1 ,p 0 1i 2 ,...,p 0 km −1,...,p 0 Ki K ) k [n] , (5.41) for all n = 1, 2,...,N K(M−K) . We note that the coded computation result in (5.41) emulates the result computed us- ing the UCEC scheme in (5.27). Therefore, starting from this point, for each n = 1, 2,...,N K(M−K) , we can repeat the same communication process as the UCEC archi- tecture described in Section 5.4.4, with the following channel matrix at time t. H Mc (t) = h 1i 1 (t) ··· h 1i K (t) . . . . . . . . . h Ki 1 (t) ··· h Ki K (t) . (5.42) After a total of|Δ N+1 |N K(M−K) = (N +1) K 2 ×N K(M−K) time slots, each userk decodes all of its intended functions{φ(d p k [n]) : p∈ Δ N ,n = 1, 2,...,N K(M−K) } ={φ(d q k ) : q∈ Γ N } with vanishing distortion, with probability arbitrarily close to 1. This communication process incurs a communication load of L = |Δ N+1 |N K(M−K) F = (N+1) K 2 N K(M−K) N MK . The above robust edge computing architecture achieves a computation-communication load load pair (r,L) = ( M K · (N+1) KM N KM , (N+1) K 2 N K(M−K) N MK ), which goes to the pair ( M K , 1) asN gets larger. 1 By (5.38), at EN m∈Mc, s (q 0 11 ,q 0 12 ,...,q 0 KM ) m = 0 if q 0 kj = N for any j ∈{1,...,M}\Mc and k ∈ {1,...,K}. Hence, we only consider the functions s (q 0 11 ,q 0 12 ,...,q 0 KM ) m with q 0 kj ∈{0,...,N− 1} for all j∈ {1,...,M}\Mc and k∈{1,...,K}. 119 Chapter 5. A Universal Coded Computing Architecture for Mobile Edge Processing 5.6 Conclusion We propose a universal coded edge computing (UCEC) architecture, for a mobile edge computing system where mobile users offload their computation requests to edge nodes for processing. We demonstrate that coding is necessary for both computation and com- munication of the output results, in order to achieve optimal performance. We also show that UCEC performs optimal coding in a universal manner for time-varying communication channels, i.e., the coded computation and communication do not use the channel state in- formation. Finally, we extend the UCEC architecture to gain robustness to the scenarios of missing edge nodes, such that the delivery of the computation results can be accomplished by any subset of surviving edge nodes. 120 Chapter 6 A Unified Coding Framework with Straggling Servers In the previous chapter, we have proposed the Coded Distributed Computing (CDC) scheme to significantly reduce the load of shuffling intermediate data in distributed computing applications. Recently in [2, 97], another type of coding, i.e., Maximum Distance Separable (MDS) codes, was applied to some linear computation tasks (e.g., matrix multiplication), in order to alleviate the effects of straggling servers and shorten the computation phase of distributed computing applications. In this chapter, we propose a unified coded framework for distributed computing with straggling servers, by introducing a tradeoff between “latency of computation” and “load of communication” for linear computation tasks. We show that the coding schemes of CDC and [2] can then be viewed as special instances of the proposed coding framework by con- sidering two extremes of this tradeoff: minimizing either the load of communication or the latency of computation individually. Furthermore, the proposed coding framework provides a natural tradeoff between computation latency and communication load in distributed computing, and allows to systematically operate at any point on that tradeoff. More specifically, we focus on a distributed matrix multiplication problem in which for a matrix A and N input vectors x 1 ,..., x N , we want to compute N output vectors This chapter is mainly taken from [95], coauthored by the author of this document. 121 Chapter 6. A Unified Coding Framework with Straggling Servers y 1 = Ax 1 ,..., y N = Ax N . The computation cannot be performed on a single server node since its local memory is too small to hold the entire matrix A. Instead, we carry out this computation using K distributed computing servers collaboratively. Each server has a local memory, with the size enough to store up to equivalent of μ fraction of the entries of the matrix A, and it can only perform computations based on the contents stored in its local memory. Matrix multiplication is one of the building blocks to solve data analytics and machine learning problems (e.g., regression and classification). Many such applications of big data analytics require massive computation and storage power over large-scale datasets, which are nowadays provided collaboratively by clusters of computing servers, using effi- cient distributed computing frameworks such as Hadoop MapReduce [4], Dryad [61] and Spark [5]. Therefore, optimizing the performance of distributed matrix multiplication is of vital importance to improve the performance of the distributed computing applications. A distributed implementation of matrix multiplication proceeds in three phases: Map, Shuffle and Reduce. In the Map phase, every server multiplies the input vectors with the locally stored matrix that partially represents the target matrix A. When a subset of servers finish their local computations such that their Map results are sufficient to recover the output vectors, we halt the Map computation and start to Shuffle the Map results across the servers in which the final output vectors are calculated by specific Reduce functions. Within the above three-phase implementation, the coding approach of [1] targets at mini- mizing the shuffling load of intermediate Map results. It introduces a particular repetitive structure of Map computations across the servers, and utilizes this redundancy to enable a specific type of network coding in the Shuffle phase (named coded multicasting) to minimize the communication load. We term this coding approach as “Minimum Bandwidth Code” . In [71, 72], the Minimum Bandwidth Code was employed in a fully decentralized wireless distributed computing framework, achieving a scalable architecture with a constant load of communication. The other coding approach of [2], however, aims at minimizing the latency of Map computations by encoding the Map tasks using MDS codes, so that the run-time of the Map phase is not affected by up to a certain number of straggling servers. This coding scheme, which we term as “Minimum Latency Code”, results in a significant reduction of Map computation latency. In this chapter, we formalize a tradeoff between the computation latency in the Map phase (denoted by D) and the communication (shuffling) load in the Shuffle phase (denoted by L) for distributed matrix multiplication (in short, the Latency-Load Tradeoff ), in which as 122 Chapter 6. A Unified Coding Framework with Straggling Servers 400 600 800 1000 1200 1400 1600 1800 50 100 150 200 250 300 350 400 450 Computation Latency (D) Communication Load (L) Minimum Latency Code Minimum Bandwidth Code Figure 6.1: The Latency-Load tradeoff, for a distributed matrix multiplication job of computing N = 840 output vectors using K = 14 servers each with a storage size μ = 1/2. illustrated in Fig. 6.1, the above two coded schemes correspond to the two extreme points that minimizeL andD respectively. Furthermore, we propose a unified coded scheme that organically integrates both of the coding techniques, and allows to systematically operate at any point on the introduced tradeoff. For a given computation latency, we also prove an information-theoretic lower bound on the minimum required communication load to accomplish the distributed matrix multiplication. This lower bound is proved by first concatenating multiple instances of the problem with different reduction assignments of the output vectors, and then applying the cut-set bound on subsets of servers. At the two end points of the tradeoff, the proposed scheme achieves the minimum communication load to within a constant factor. We finally note that there has been another tradeoff between the computation load in the Map phase and the communication load in the Shuffle phase for distributed computing, which is introduced and characterized in [1]. In this chapter, we are fixing the amount of computation load (determined by the storage size) at each server, and focus on character- izing the tradeoff between the computation latency (determined by the number of servers that finish the Map computations) and the communication load. Hence, the considered tradeoff can be viewed as an extension of the tradeoff in [1] by introducing a third axis, namely the computation latency of the Map phase. 123 Chapter 6. A Unified Coding Framework with Straggling Servers 6.1 Problem Formulation 6.1.1 System Model We consider a matrix multiplication problem in which given a matrix A∈F m×n 2 T for some integersT ,m andn, andN input vectors x 1 ,..., x N ∈F n 2 T , we want to computeN output vectors y 1 = Ax 1 ,..., y N = Ax N . We perform the computations usingK distributed servers. Each server has a local memory of sizeμmnT bits (i.e., it can store equivalent of μ fraction of the entries of the matrix A), for some 1 K ≤μ≤ 1. 1 We allow applying linear codes for storing the rows of A at each server. Specifically, Server k, k∈{1,...,K}, designs an encoding matrix E k ∈F μm×m 2 T , and stores U k = E k A. (6.1) The encoding matrices E 1 ,..., E K are design parameters and is denoted as storage design. The storage design is performed in prior to the computation. Remark 6.1. For the Minimum Bandwidth Code in [1], each server stores μm rows of the matrix A. Thus, the rows of the encoding matrix E k was chosen as a size-μm subset of the rows of the identity matrix I m , according to a specific repetition pattern. While for the Minimum Latency Code in [2], E k was generated randomly such that every server stores μm random linear combinations of the rows of A, achieving a (μmK,m) MDS code. 6.1.2 Distributed Computing Model We assume that the input vectors x 1 ,..., x N are known to all the servers. The overall computation proceeds in three phases: Map, Shuffle, and Reduce. Map Phase: The role of the Map phase is to compute some coded intermediate values according to the locally stored matrices in (6.1), which can be used later to re-construct the output vectors. More specifically, for all j = 1,...,N, Server k, k = 1,...,K, computes the intermediate vectors z j,k = U k x j = E k Ax j = E k y j . (6.2) 1 Thus enough information to recover the entire matrixA can be stored collectively on the K servers. 124 Chapter 6. A Unified Coding Framework with Straggling Servers We denote the latency for Server k to compute z 1,k ,..., z N,k as S k . We assume that S 1 ,...,S K are i.i.d. random variables, and denote the qth order statistic, i.e., the qth smallest variable of S 1 ,...,S K as S (q) , for all q ∈ {1,...,K}. We focus on a class of distributions of S k such that E{S (q) } =μNg(K,q), (6.3) for some function g(K,q). The Map phase terminates when a subset of servers, denoted byQ⊆{1,...,K}, have finished their Map computations in (6.2). A necessary condition for selectingQ is that the output vectors y 1 ..., y N can be re-constructed by jointly utilizing the intermediate vectors calculated by the servers inQ, i.e.,{z j,k : j = 1,...,N,k∈Q}. However, one can allow redundant computations inQ, since if designed properly, they can be used to reduce the load of communicating intermediate results, for servers inQ to recover the output vectors in the following stages of the computation. Remark 6.2. The Minimum Bandwidth Code in [1] waits for all servers to finish their computations, i.e.,Q ={1,...,K}. For the Minimum Latency Code in [2],Q is the subset of the fastestd 1 μ e servers in performing the Map computations. Definition 6.1 (Computation Latency). We define the computation latency, denoted by D, as the average amount of time spent in the Map phase. 3 After the Map phase, the job of computing the output vectors y 1 ..., y N is continued exclu- sively over the servers inQ. The final computations of the output vectors are distributed uniformly across the servers inQ. We denote the set of indices of the output vectors as- signed to Server k asW k , and{W k : k ∈ Q} satisfy 1)W k ∩W k 0 = ?, ∀k 6= k 0 , 2) |W k | =N/|Q|∈N,∀k∈Q. 2 Shuffle Phase: The goal of the Shuffle phase is to exchange the intermediate values cal- culated in the Map phase, to help each server recover the output vectors it is responsible for. To do this, every server k inQ generates a message X k from the locally computed intermediate vectors z 1,k ,..., z N,k through an encoding function φ k , i.e., X k =φ k (z 1,k ,..., z N,k ), (6.4) 2 We assume that NK, and|Q| divides N for allQ⊆{1,...,K}. 125 Chapter 6. A Unified Coding Framework with Straggling Servers such that upon receiving all messages{X k : k∈Q}, every server k∈Q can recover the output vectors inW k . We assume that the servers are connected by a shared bus link. After generating X k , Server k multicasts X k to all the other servers inQ. Definition 6.2 (Communication Load). We define the communication load, denoted by L, as the average total number of bits in all messages{X k :k∈Q}, normalized by mT (i.e., the total number of bits in an output vector). 3 Reduce Phase: The output vectors are re-constructed distributedly in the Reduce phase. Specifically, Userk,k∈Q, uses the locally computed vectors z 1,k ,..., z N,k and the received multicast messages{X k : k∈Q} to recover the output vectors with indices inW k via a decoding function ψ k , i.e., {y j :j∈W k } =ψ k (z 1,k ,..., z N,k ,{X k :k∈Q}). (6.5) For such a distributed computing system, we say a latency-load pair (D,L)∈R 2 is achiev- able if there exist a storage design{E k } K k=1 , a Map phase computation with latency D, and a shuffling scheme with communication load L, such that all output vectors can be successfully reduced. Definition 6.3. We define the latency-load region, as the closure of the set of all achievable (D,L) pairs. 3 6.1.3 Illustrating Example In order to clarify the formulation, we use the following simple example to illustrate the latency-load pairs achieved by the two coded approaches discussed above. We consider a matrix A consisting of m = 12 rows a 1 ,..., a 12 . We have N = 4 input vectors x 1 ,..., x 4 , and the computation is performed on K = 4 servers each has a storage size μ = 1 2 . We assume that the Map latency S k , k = 1,..., 4, has a shifted-exponential distribution function F S k (t) = 1−e −( t μN −1) ,∀t≥μN, (6.6) 126 Chapter 6. A Unified Coding Framework with Straggling Servers and by e.g., [98], the average latency for the fastest q, 1≤q≤ 4, servers to finish the Map computations is D(q) =E{S (q) } =μN 1 + K X j=K−q+1 1 j . (6.7) Server 1 Server 2 Server 3 Server 4 Map Shuffle (a) Server 1 Server 2 Server 3 Server 4 Map Shuffle (b) Figure 6.2: Illustration of the Minimum Bandwidth Code in [1] and the Minimum Latency Code in [2]. (a) Minimum Bandwidth Code. Every row of A is multiplied with the input vectors twice. For k = 1, 2, 3, 4, Server k reduces the output vector y k . In the Shuffle phase, each server multicasts 3 bit-wise XORs, denoted by⊕, of the calculated intermediate values, each of which is simultaneously useful for two other servers. (b) Minimum Latency Code. A is encoded into 24 coded rows c 1 ...,c 24 . Server 1 and 3 finish their Map computations first. They then exchange enough number (6 for each output vector) of intermediate values to reduce y 1 ,y 2 at Server 1 andy 3 ,y 4 at Server 3. Minimum Bandwidth Code [1]. The Minimum Bandwidth Code in [1] repeatedly stores each row of A at μK servers with a particular pattern, such that in the Shuffle phase, μK required intermediate values can be delivered with a single coded multicast message, which results in a coding gain of μK. We illustrate such coding technique in Fig. 6.2(a). As shown in Fig. 6.2(a), a Minimum Bandwidth Code repeats the multiplication of each row of A with all input vectors x 1 ,..., x 4 , μK = 2 times across the 4 servers, e.g., a 1 is multiplied at Server 1 and 2. The Map phase continues until all servers have finished their Map computations, achieving a computation latency D(4) = 2× (1 + P 4 j=1 1 j ) = 37 6 . For k = 1, 2, 3, 4, Server k will be reducing output vector y k . In the Shuffle phase, as shown in Fig. 6.2(a), due to the specific repetition of Map computations, every server multicasts 3 bit- wise XORs, each of which is simultaneously useful for two other servers. For example, upon 127 Chapter 6. A Unified Coding Framework with Straggling Servers receiving a 1 x 3 ⊕a 3 x 2 from Server 1, Server 2 can recover a 3 x 2 by canceling a 1 x 3 and Server 3 can recover a 1 x 3 by canceling a 3 x 2 . Similarly, every server decodes the needed values by canceling the interfering values using its local Map results. The Minimum Bandwidth Code achieves a communication load L = 3× 4/12 = 1. The Minimum Bandwidth Code can be viewed as a specific type of network coding [57], or more precisely index coding [55, 56], in which the key idea is to design “side information” at the servers (provided by the Map results), enabling multicasting opportunities in the Shuffle phase to minimize the communication load. Minimum Latency Code [2]. The Minimum Latency Code in [2] uses MDS codes to generate some redundant Map computations, and assigns the coded computations across many servers. Such type of coding takes advantage of the abundance of servers so that one can terminate the Map phase as soon as enough coded computations are performed across the network, without needing to wait for the remaining straggling servers. We illustrate such coding technique in Fig. 6.2(b). For this example, a Minimum Latency Code first has each server k, k = 1,..., 4, indepen- dently and randomly generate 6 random linear combinations of the rows of A, denoted by c 6(k−1)+1 ,..., c 6(k−1)+6 (see Fig. 6.2(b)). We note that{c 1 ,..., c 24 } is a (24, 12) MDS code of the rows of A. Therefore, for any subsetD⊆{1,..., 24} of size|D| = 12, using the inter- mediate values{c i x j :i∈D} can recover the output vector y j . The Map phase terminates once the fastest 2 servers have finished their computations (e.g., Server 1 and 3), achieving a computation latency D(2) = 2×(1 + 1 3 + 1 4 ) = 19 6 . Then Server 1 continues to reduce y 1 and y 2 , and Server 3 continues to reduce y 3 and y 4 . As illustrated in Fig. 6.2(b), Server 1 and 3 respectively unicasts the intermediate values it has calculated and needed by the other server to complete the computation, achieving a communication load L=6×4/12=2. From the above descriptions, we note that the Minimum Bandwidth Code uses about twice of the time in the Map phase compared with the Minimum Latency Code, and achieves half of the communication load in the Shuffle phase. They represent the two end points of a general latency-load tradeoff characterized in the next section. 128 Chapter 6. A Unified Coding Framework with Straggling Servers 6.2 Main Results The main results of this chapter are, 1) a characterization of a set of achievable latency- load pairs by developing a unified coded framework, 2) an outer bound of the latency-load region, which are stated in the following two theorems. Theorem 6.1. For a distributed matrix multiplication problem of computing N output vectors using K servers, each with a storage size μ≥ 1 K , the latency-load region contains the lower convex envelop of the points {(D(q),L(q)) :q =d 1 μ e,...,K}, (6.8) in which D(q) =E{S (q) } =μNg(K,q), (6.9) L(q) =N bμqc X j=sq B j j +N min 1− ¯ μ− bμqc X j=sq B j , B sq−1 sq−1 , (6.10) where S (q) is the qth smallest latency of the K i.i.d. latencies S 1 ,...,S K with some distri- butionF to compute the Map functions in (6.2), g(K,q) is a function ofK andq computed from F , ¯ μ, bμqc q , B j , ( q−1 j )( K−q bμqc−j ) q K ( K bμqc ) , and s q , inf{s : P bμqc j=s B j ≤ 1− ¯ μ}. We prove Theorem 6.1 in Section 6.3, in which we present a unified coded scheme that jointly designs the storage and the data shuffling, which achieves the latency in (6.9) and the communication load in (6.10). Remark 6.3. The Minimum Latency Code and the Minimum Bandwidth Code correspond to q = d 1 μ e and q = K, and achieve the two end points (E{S (d 1 μ e) },N− N/d 1 μ e) and (E{S (K) },N 1−bμKc/K bμKc ) respectively. Remark 6.4. We numerically evaluate in Fig. 6.3 the latency-load pairs achieved by the proposed coded framework, for computing N =180 output vectors using K =18 servers each with a storage sizeμ=1/3, assuming the the distribution function of the Map time in (6.6). The achieved tradeoff approximately exhibits an inverse-linearly proportional relationship between the latency and the load. For instance, doubling the latency from 120 to 240 results in a drop of the communication load from 43 to 23 by a factor of 1.87. 129 Chapter 6. A Unified Coding Framework with Straggling Servers 80 100 120 140 160 180 200 220 240 260 0 20 40 60 80 100 120 Computation Latency (D) Communication Load (L) Proposed Coded Framework Outer Bound Figure 6.3: Comparison of the latency-load pairs achieved by the proposed scheme with the outer bound, for computing N = 180 output vectors usingK = 18 servers each with a storage sizeμ = 1/3, assuming the the distribution function of the Map time in (6.6). Remark 6.5. The key idea to achieve D(q) and L(q) in Theorem 6.1 is to design the concatenation of the MDS code and the repetitive executions of the Map computations, in order to take advantage of both the Minimum Latency Code and the Minimum Bandwidth Code. More specifically, we first generate K q m MDS-coded rows of A, and then store each of thembμqc times across the K servers in a specific pattern. As a result, any subset of q servers would have sufficient amount of intermediate results to reduce the output vectors, and we end the Map phase as soon as the fastest q servers finish their Map computations, achieving the latency in (6.9). We also exploit coded multicasting in the Shuffle phase to reduce the communication load. In the load expression (6.10), B j , j≤bμqc, represents the (normalized) number of coded rows of A repeatedly stored/computed at j servers. By multicasting coded packets simul- taneously useful for j servers, B j intermediate values can be delivered to a server with a communication load of B j j , achieving a coding gain of j. We greedily utilize the coding opportunities with a larger coding gain until we get close to satisfying the demand of each server, which accounts for the first term in (6.10). Then the second term results from two follow-up strategies 1) communicate the rest of the demands uncodedly 2) continue coded multicasting with a smaller coding gain (i.e., j =s q − 1), which may however deliver more than what is needed for reduction. Theorem 6.2. The latency-load region is contained in the lower convex envelop of the points {(D(q), ¯ L(q)) :q =d 1 μ e,...,K}, (6.11) 130 Chapter 6. A Unified Coding Framework with Straggling Servers in which D(q) is given by (6.9) and ¯ L(q) =N max t=1,...,q−1 1− min{tμ, 1} d q t e(q−t) q. (6.12) We prove Theorem 6.2 in Section 6.4, by deriving an information-theoretic lower bound on the minimum required communication load for a given computation latency, using any storage design and data shuffling scheme. Remark 6.6. We numerically compare the outer bound in Theorem 6.2 and the achieved inner bound in Theorem 6.1 in Fig. 6.3, from which we make the following observations. • At the minimum latency point, i.e., q = 1/μ = 3 servers finish the Map computations, the proposed coded scheme achieves 1.33× of the minimum communication load. In general, when q = 1/μ∈N, the lower bound in Theorem 6.2 ¯ L( 1 μ ) = N/d q t e| t=q−1 = N/d 1 1−μ e = N 2 . The proposed coded scheme, or Minimum Latency Code in this case, achieves the load L( 1 μ ) = N(1−μ). Thus the proposed scheme always achieves the lower bound to within a factor of 2 at the minimum latency point. • At the point with the maximum latency, i.e., all K = 18 servers finish the Map computations, the proposed coded scheme achieves 2.67× of the lower bound on the minimum communication load. In general for q = K and μK∈N, we demonstrate in Appendix C that the proposed coded scheme, or Minimum Bandwidth Code in this case, achieves a communication load L(K) = N(1−μ)/(μK) to within a factor of 3 + √ 5 of the lower bound ¯ L(K). • For the intermediate latency from 70 to 270, the communication load achieved by the proposed scheme is within a multiplicative gap of at most 4.2× from the lower bound. In general, a complete characterization of the latency-load region (or an approximation to within a constant gap for all system parameters) remains open. 6.3 Proposed Coded Framework In this section, we prove Theorem 6.1 by proposing and analyzing a general coded framework that achieves the latency-load pairs in (6.8). We first demonstrate the key ideas of the proposed scheme through the following example, and then give the general description of the scheme. 131 Chapter 6. A Unified Coding Framework with Straggling Servers 6.3.1 Example: m = 20, N = 12, K = 6 and μ = 1 2 . We have a problem of multiplying a matrix A∈F m×n 2 T of m = 20 rows with N = 12 input vectors x 1 ,..., x 12 to compute 12 output vectors y 1 = Ax 1 ..., y 12 = Ax 12 , using K = 6 servers each with a storage size μ = 1 2 . We assume that we can afford to wait for q = 4 servers to finish their computations in the Map phase, and we describe the proposed storage design and shuffling scheme. Storage Design. As illustrated in Fig 6.4, we first independently generate 30 random linear combinations c 1 ,..., c 30 ∈F n 2 T of the 20 rows of A, achieving a (30, 20) MDS code of the rows of A. Then we partition these coded rows c 1 ,..., c 30 into 15 batches each of size 2, and store every batch of coded rows at a unique pair of servers. Server 1 Server 2 Server 3 Server 4 Server 5 Server 6 (30,20) MDS Code 15 Batches Storage Partition Figure 6.4: Storage Design when the Map phase is terminated when 4 servers have finished the computations. WLOG, due to the symmetry of the storage design, we assume that Servers 1, 2, 3 and 4 are the first 4 servers that finish their Map computations. Then we assign the Reduce tasks such that Server k reduces the output vectors y 3(k−1)+1 , y 3(k−1)+2 and y 3(k−1)+3 , for all k∈{1,..., 4}. After the Map phase, Server 1 has computed the intermediate values{c 1 x j ,..., c 10 x j : j = 1,..., 12}. For Server 1 to recover y 1 = Ax 1 , it needs any subset of 10 intermediate values c i x 1 with i∈{11,..., 30} from Server 2, 3 and 4 in the Shuffle phase. Similar data demands hold for all 4 servers and the output vectors they are reducing. Therefore, the goal of the Shuffle phase is to exchange these needed intermediate values to accomplish successful reductions. Coded Shuffle. We first group the 4 servers into 4 subsets of size 3 and perform coded shuffling within each subset. The goal is to deliver the intermediate values calculated 132 Chapter 6. A Unified Coding Framework with Straggling Servers exclusively at two servers in each subset and needed by another server in that subset. We illustrate the coded shuffling scheme for Servers 1, 2 and 3 in Fig. 6.5. Each server multicasts 3 bit-wise XORs, denoted by⊕, of the locally computed intermediate values to the other two. The intermediate values used to create the multicast messages are the ones known exclusively at two servers and needed by another one. After receiving 2 multicast messages, each server recovers 6 needed intermediate values. For instance, Server 1 recovers c 11 x 1 , c 11 x 2 and c 11 x 3 by canceling c 2 x 7 , c 2 x 8 and c 2 x 9 respectively, and then recovers c 12 x 1 , c 12 x 2 and c 12 x 3 by canceling c 4 x 4 , c 4 x 5 and c 4 x 6 respectively. Server 1 Server 2 Server 3 Figure 6.5: Multicasting 9 coded intermediate values across Servers 1, 2 and 3. Similar coded multicast communica- tions are performed for another 3 subsets of 3 servers. Similarly, we perform the above coded shuffling in Fig. 6.5 for another 3 subsets of 3 servers. After coded multicasting within the 4 subsets of 3 servers, each server recovers 18 needed intermediate values (6 for each of the output vector it is reducing). As mentioned before, since each server needs a total of 3× (20− 10) = 30 intermediate values to reduce the 3 assigned output vectors, it needs another 30−18 = 12 after decoding all multicast messages. We satisfy the residual data demands by simply having the servers unicast enough (i.e., 12× 4 = 48) intermediate values for reduction. Overall, 9× 4 + 48 = 84 (possibly coded) intermediate values are communicated, achieving a communication load of L = 4.2. 6.3.2 General Scheme We first describe the storage design, Map phase computation and the data shuffling scheme that achieves the latency-load pairs (D(q),L(q)) in (6.8), for all q∈{d 1 μ e,...,K}. Given these achieved pairs, we can “memory share” across them to achieve their lower convex envelop as stated in Theorem 6.1. 133 Chapter 6. A Unified Coding Framework with Straggling Servers For ease of exposition, we assume that μq∈N. Otherwise we can replace μ with ¯ μ = bμqc q , and apply the proposed scheme for a storage size of ¯ μ. Storage Design. We first use a ( K q m,m) MDS code to encode the m rows of matrix A into K q m coded rows c 1 ..., cK q m (e.g., K q m random linear combinations of the rows of A). Then as shown in Fig. 6.6, we evenly partitioned the K q m coded rows into K μq disjoint batches, each containing a subset of m q K ( K μq ) coded rows. 3 Each batch, denoted byB T , is labelled by a unique subsetT ⊂{1,...,K} of size|T| =μq. That is {1,..., K q m} ={B T :T ⊂{1,...,K},|T| =μq}. (6.13) Server k, k∈{1,...,K} stores the coded rows inB T as the rows of U k if k∈T . Server 1 MDS Code Batch Batch 1 Partition Server Storage Server Server q Batch 2 Figure 6.6: General MDS coding and storage design. In the above example, q = 4, and K q m = 6 4 × 20 = 30 coded rows of A are partitioned into K μq = 6 2 = 15 batches each containing 30 15 = 2 coded rows. Every node is in 5 subsets of size two, thus storing 5× 2 = 10 coded rows of A. Map Phase Execution. Each server computes the inner products between each of the locally stored coded rows of A and each of the input vectors, i.e., Server k computes c i x j for all j = 1,...,N, and all i∈{B T : k∈T}. We wait for the fastest q servers to finish their Map computations before halting the Map phase, achieving a computation latency D(q) in (6.9). We denote the set of indices of these servers asQ. 3 We focus on matrix multiplication problems for large matrices, and assume that m q K K μq , for all q∈{ 1 μ ,...,K}. 134 Chapter 6. A Unified Coding Framework with Straggling Servers The computation then moves on exclusively over theq servers inQ, each of which is assigned to reduce N q out of theN output vectors y 1 = Ax 1 ,..., y N = Ax N . We recall that the set of indices of the output vectors reduced by Server k∈Q is denoted byW k . For a feasible shuffling scheme to exist such that the Reduce phase can be successfully carried out, every subset of q servers (since we cannot predict which q servers will finish first) should have collectively stored at least m distinct coded rows c i for i∈{1,..., K q m}. Next, we explain how our proposed storage design meets this requirement. First, the q servers inQ collectively provide a storage size equivalent to μqm rows. Then since each coded row is stored byμq out of allK servers, it can be stored by at most μq servers inQ, and thus servers inQ collectively store at least μqm μq =m distinct coded rows. Coded Shuffle. ForS⊂Q andk∈Q\S, we denote the set of intermediate values needed by Server k and known exclusively by the servers inS asV k S . More formally: V k S ,{c i x j :j∈W k ,i∈{B T :T∩Q =S}}. (6.14) Due to the proposed storage design, for a particularS of size j,V k S contains N q · ( K−q μq−j )m q K ( K μq ) intermediate values. In the above example, we haveV 1 {2,3} ={c 11 x j , c 12 x j : j = 1, 2, 3},V 2 {1,3} ={c 3 x j , c 4 x j : j = 4, 5, 6}, andV 3 {1,2} ={c 1 x j , c 2 x j :j = 7, 8, 9}. In the Shuffle phase, servers inQ create and multicast coded packets that are simultaneously useful for multiple other servers, until every server inQ recovers at least m intermediate values for each of the output vectors it is reducing. The proposed shuffling scheme is greedy in the sense that every server inQ will always try to multicast coded packets simultaneously useful for the largest number of servers. The proposed shuffle scheme proceeds as follows. For each j =μq,μq− 1,...,s q , where s q ,inf{s : P μq j=s ( q−1 j )( K−q μq−j ) q K ( K μq ) ≤1−μ}, and every subsetS⊆Q of size j+1: 1. For eachk∈S, we evenly and arbitrarily splitV k S\{k} intoj disjoint segmentsV k S\{k} = {V k S\{k},i :i∈S\{k}}, and associate the segmentV k S\{k},i with the server i∈S\{k}. 135 Chapter 6. A Unified Coding Framework with Straggling Servers 2. Server i, i ∈ S, multicasts the bit-wise XOR, denoted by⊕, of all the segments associated with it inS, i.e., Server i multicasts ⊕ k∈S\{i} V k S\{k},i to the other servers in S\{i}. For every pair of servers k and i inS, after Server k receives the coded message X S i from Server i, since Server k has computed locally the segmentsV k 0 S\{k 0 },i for all k 0 ∈S\{i,k}, it can cancel them from the message ⊕ k∈S\{i} V k S\{k},i sent by Serveri, and recover the intended segmentV k S\{k},i . For each j in the above coded shuffling scheme, each server inQ recovers q−1 j ( K−q μq−j )m q K ( K μq ) intermediate values for each of the output vectors it is reducing. Therefore, j =s q +1 is the smallest size of the subsets in which the above coded multicasting needs to be performed, before enough number of intermediate values for reduction are delivered. In each subsetS of sizej, since each serveri∈S multicasts a coded segment of size |V k S\{k} | j for some k6=i, the total communication load so far, for B j = ( q−1 j )( K−q μq−j ) q K ( K μq ) , is μq X j=sq q j + 1 ! j + 1 j · N q · K−q μq−j q K K μq = μq X j=sq N B j j , (6.15) Next, we can continue to finish the data shuffling in two different ways. The first approach is to have the servers inQ communicate with each other uncoded intermediate values, until every server has exactlym intermediate values for each of the output vector it is responsible for. Using this approach, we will have a total communication load of L 1 = μq X j=sq N B j j +N(1−μ− μq X j=sq B j ). (6.16) The second approach is to continue the above 2 steps for j =s q − 1. Using this approach, we will have a total communication load of L 2 = P μq j=sq−1 N B j j . Then we take the approach with less communication load, and achieveL(q) = min{L 1 ,L 2 }. Remark 6.7. The ideas of efficiently creating and exploiting coded multicasting oppor- tunities have been introduced in caching problems [50–52]. In this section, we illustrated how to create and utilize such coding opportunities in distributed computing to slash the communication load, when facing with straggling servers. 136 Chapter 6. A Unified Coding Framework with Straggling Servers 6.4 Converse of Theorem 6.2 In this section, we prove the outer bound on the latency-load region in Theorem 6.2. We start by considering a distributed matrix multiplication scheme that stops the Map phase whenq servers have finished their computations. For such scheme, as given by (6.9), the computation latency D(q) is the expected value of the qth order statistic of the Map computation times at theK servers. WLOG, we can assume that Servers 1,...,q first finish their Map computations, and they will be responsible for reducing the N output vectors y 1 ,..., y N . To proceed, we first partition the y 1 ,..., y N intoq groupsG 1 ,...,G q each of sizeN/q, and define the output assignment A = W A 1 ,W A 2 ...,W A q , (6.17) whereW A k denotes the group of output vectors reduced by Serverk in the output assignment A. Next we choose an integer t∈{1,...,q− 1}, and consider the followingd q t e output assign- ments which are circular shifts of (G 1 ,...,G q ) with step size t, A 1 = (G 1 ,G 2 ,...,G q ), A 2 = (G t+1 ,...,G q ,G 1 ,...,G t ), . . . A d q t e = G (d q t e− 1)t+1 ,...,G q ,G 1 ,...,G (d q t e−1)t . (6.18) Remark 6.8. We note that by the Map computation in (6.2), at each server all the input vectors x 1 ,..., x N are multiplied by the same matrix (i.e., U k at Server k). Therefore, for the same set of q servers and their storage contents, a feasible data shuffling scheme for one of the above output assignments is also feasible for all otherd q t e− 1 assignments by relabelling the output vectors. As a result, the minimum communication loads for all of the above output assignments are identical. For a shuffling scheme admitting an output assignmentA, we denote the message sent by Server k∈{1,...,q} as X A k , with a size of R A k mT bits. 137 Chapter 6. A Unified Coding Framework with Straggling Servers Now we focus on the Servers 1,...,t and consider the compound setting that includes alld q t e output assignments in (6.18). We observe that as shown in Fig. 6.7, in this compound set- ting, the firstt servers should be able to recover all output vectors (y 1 ..., y N ) = (G 1 ,...,G q ) using their local computation results{U k x 1 ,..., U k x N :k = 1,...,t} and the received mes- sages in all the output assignments{X A 1 k ,...,X A d q t e k :k =t + 1,...,q}. Thus we have the following cut-set bound for the first t servers. rank U 1 U 2 . . . U t NT + d q t e X j=1 K X k=t+1 R A j k mT≥NmT. (6.19) Server 1 Server t Server t+1 Server q Server 1 Server t Server t+1 Server q Server 1 Server Server t Server t+1 Server q Cut Figure 6.7: Cut-set of Servers 1,...,t for the compound setting consisting of thed q t e output assignments in (6.18). Next we consider q subsets of servers each with size t: N i ,{i, (i + 1),..., (i +t− 1)}, i = 1,...,q, where the addition is modularq. Similarly, we have the following cut-set bound forN i : rank U i U i+1 . . . U i+t−1 NT + d q t e X j=1 X k/ ∈N i R A j k mT≥NmT. (6.20) 138 Chapter 6. A Unified Coding Framework with Straggling Servers Summing up these q cut-set bounds, we have NT q X i=1 rank U i U i+1 . . . U i+t−1 + q X i=1 d q t e X j=1 X k/ ∈N i R A j k mT≥qNmT, (6.21) ⇒ d q t e X j=1 q X i=1 X k/ ∈N i R A j k ≥qN−qN min{μt, 1}. (6.22) ⇒d q t e(q−t)L (a) ≥(1− min{tμ, 1})qN, (6.23) where (a) results from the fact mentioned in Remark 6.8 that the communication load is independent of the output assignment. Since (6.23) holds for all t = 1,...,q− 1, we have L≥ ¯ L(q) =N max t=1,...,q−1 1− min{tμ, 1} d q t e(q−t) q. (6.24) We assume that the Map phase terminates when q servers finish the computations with probability P (q), for all q ∈ {d 1 μ e,...,K}, then the communication load for a latency E q (D(q)) that is a convex combination of{E{S (q) } : q =d 1 μ e,...,K}, is lower bounded by E q ( ¯ L(q)) that is the same convex combination of{ ¯ L(q) : q =d 1 μ e,...,K)}. Consider- ing all distributions of q, we achieve all points on the lower convex envelop of the points {(E{S (q) }, ¯ L(q)) :q =d 1 μ e,...,K}, as an outer bound on the latency-load region. 139 Appendix A Converse of Theorem 2.2 In this appendix, we prove the lower bound on L ∗ (r,s) in Theorem 2.2, which generalizes the converse result of Theorem 2.1 for the case s > 1. Since the lower bound on L ∗ (r, 1) in Theorem 2.2 exactly matches the lower bound on L ∗ (r) in Theorem 2.1, we focus on the case s > 1 (i.e., each Reduce function is calculated by 2 or more nodes) throughout this appendix. We denote the minimum communication load under a particular file assignment M as L ∗ M (s), and we present a lower bound on L ∗ M (s) in the following lemma. Lemma A.1. L ∗ M (s)≥ K P j=1 a j M N min{j+s,K} P `=max{j,s} ( K−j `−j )( j `−s ) ( K s ) · `−j `−1 , where a j M is defined in (2.22). In the rest of this appendix, we first prove the converse part of Theorem 2.2 by showing L ∗ (r,s)≥ min{r+s,K} P `=max{r,s} ( K−r `−r )( r `−s ) ( K s ) · `−r `−1 , and then give the proof of Lemma A.1. Converse Proof of Theorem 2.2. The minimum communication load L ∗ (r,s) is lower bounded by the minimum value of L ∗ M (s) over all possible file assignments having a com- putation load of r: L ∗ (r,s)≥ inf M:|M 1 |+···+|M K |=rN L ∗ M (s). (A.1) Then by Lemma A.1, we have L ∗ (r,s)≥ inf M:|M 1 |+···+|M K |=rN K X j=1 a j M N min{j+s,K} X `=max{j,s} K−j `−j j `−s K s · `−j `− 1 . (A.2) 140 Appendix A. Converse of Theorem 2.2 For every file assignmentM such that|M 1 | +··· +|M K | =rN,{a j M } K j=1 satisfy the same conditions as the case of s = 1 in (2.25), (2.26) and (2.27). For a general computation load 1 ≤ r ≤ K, and the function L coded (r,s) = min{r+s,K} P `=max{r+1,s} `( K ` )( `−2 r−1 )( r `−s ) r( K r )( K s ) = min{r+s,K} P `=max{r,s} ( K−r `−r )( r `−s ) ( K s ) · `−r `−1 as defined in (2.20), we first find the line p +qj as a function of 1≤ j≤ K connecting the two points (brc,L coded (brc,s)) and (dre,L coded (dre,s)). More specifically, we find p,q∈R such that p +qj| j=brc =L coded (brc,s), (A.3) p +qj| j=dre =L coded (dre,s). (A.4) Then by the convexity of the function L coded (j,s) in j, we have for integer-valued j = 1,...,K, L coded (j,s) = min{j+s,K} X `=max{j,s} K−j `−j j `−s K s · `−j `− 1 ≥p +qj, j = 1,...,K. (A.5) Then (A.2) reduces to L ∗ (r,s)≥ inf M:|M 1 |+···+|M K |=rN K X j=1 a j M N · (p +qj) (A.6) = inf M:|M 1 |+···+|M K |=rN K X j=1 a j M N ·p + K X j=1 ja j M N ·q (A.7) (a) = p +qr, (A.8) where (a) is due to the constraints on{a j M } K j=1 in (2.26) and (2.27). Therefore, L ∗ (r,s) is lower bounded by the lower convex envelop of the points {(r,L coded (r,s)) : r∈{1,...,K}}. This completes the proof of the converse part of Theo- rem 2.2. The proof of Lemma A.1 follows the same steps of the proof of Lemma 1, where a lower bound on the number of bits communicated by any subset of nodes, for the case of s> 1, is established by induction. Proof of Lemma A.1. We first prove the following claim. 141 Appendix A. Converse of Theorem 2.2 Claim A.1. For any subsetS⊆{1,...,K}, we have H(X S |Y S c)≥QT |S| X j=1 a j,S M min{j+s,|S|} X `=max{j,s} |S|−j `−j j `−s K s · `−j `− 1 , (A.9) where a j,S M is defined in (2.39). We prove Claim A.1 by induction. a. IfS ={k} for any k∈{1,...,K}, obviously H(X k |Y {1,...,K}\{k} )≥ 0 =QTa 1,{k} M 1 X `=s 0 `−1 1 `−s K s . (A.10) b. Suppose the statement is true for all subsets of size S 0 . For anyS⊆{1,...,K} of size|S| =S 0 + 1, and all k∈S, we have as derived in (2.62): H(X S |Y S c)≥ 1 S 0 X k∈S (H(X S |V W k ,: ,V :,M k ,Y S c) +H(V W k ,: |V :,M k ,Y S c)), (A.11) where Y S c = (V W S c,: ,V :,M S c ). The first term on the RHS of (A.11) is lower bounded by the induction assumption: H(X S |V W k ,: ,V :,M k ,Y S c) =H(X S\{k} |Y (S\{k}) c) (A.12) ≥QT S 0 X j=1 a j,S\{k} M min{j+s,S 0 } X `=max{j,s} S 0 −j `−j j `−s K s · `−j `− 1 . (A.13) The second term on the RHS of (A.11) can be calculated based on the independence of intermediate values: H(V W k ,: |V :,M k ,Y S c) =H(V W k ,: |V :,M k ,V W S c,: ,V :,M S c ) (A.14) (a) = H(V W k ,: |V W k ,M k ∪M S c ,V W S c,: ) (A.15) (b) = X q∈W k H(V {q},: |V {q},M k ∪M S c ,V W S c,: ) (A.16) (c) = Q K s |S|− 1 s− 1 ! T S 0 X j=0 a j,S\{k} M (A.17) 142 Appendix A. Converse of Theorem 2.2 ≥ Q K s |S|− 1 s− 1 ! T S 0 X j=1 a j,S\{k} M , (A.18) where (a) and (b) are due to the independence of the intermediate values, and (c) is due to the uniform distribution of the output functions such that each node inS calculates Q ( K s ) · |S|−1 s−1 output functions computed exclusively by s nodes inS. Thus by (A.11), (A.13), and (A.18), we have H(X S |Y S c)≥ QT S 0 X k∈S S 0 X j=1 a j,S\{k} M min{j+s,S 0 } X `=max{j,s} S 0 −j `−j j `−s K s · `−j `− 1 + S 0 s−1 K s (A.19) = QT S 0 S 0 X j=1 min{j+s,S 0 } X `=max{j,s} S 0 −j `−j j `−s K s · `−j `− 1 + S 0 s−1 K s X k∈S a j,S\{k} M (A.20) =QT· S 0 + 1−j S 0 S 0 X j=1 min{j+s,S 0 } X `=max{j,s} S 0 −j `−j j `−s K s · `−j `− 1 + S 0 s−1 K s a j,S M (A.21) =QT S 0 +1 X j=1 S 0 + 1−j S 0 min{j+s,S 0 } X `=max{j,s} S 0 −j `−j j `−s K s · `−j `− 1 + S 0 s−1 K s a j,S M . (A.22) For each j∈{1,...,S 0 + 1} in (A.22), we have S 0 + 1−j S 0 min{j+s,S 0 } X `=max{j,s} S 0 −j `−j j `−s K s · `−j `− 1 + S 0 s−1 K s = S 0 + 1−j S 0 K s min{j+s,S 0 } X `=max{j,s} S 0 −j `−j ! j `−s ! `−j `− 1 + min{j+s,S 0 +1} X `=max{j+1,s} S 0 −j `−j− 1 ! j `−s ! (A.23) = S 0 + 1−j S 0 K s min{j+s,S 0 +1} X `=max{j,s} S 0 −j `−j ! j `−s ! `−j `− 1 + min{j+s,S 0 +1} X `=max{j,s} S 0 −j `−j− 1 ! j `−s ! (A.24) = 1 K s min{j+s,S 0 +1} X `=max{j,s} S 0 + 1−j `−j ! j `−s ! S 0 −` + 1 S 0 · `−j `− 1 + `−j S 0 (A.25) = 1 K s min{j+s,S 0 +1} X `=max{j,s} S 0 + 1−j `−j ! j `−s ! `−j `− 1 . (A.26) 143 Appendix A. Converse of Theorem 2.2 Applying (A.26) into (A.22) yields H(X S |Y S c)≥QT S 0 +1 X j=1 a j,S M min{j+s,S 0 +1} X `=max{j,s} S 0 +1−j `−j j `−s K s · `−j `− 1 (A.27) =QT |S| X j=1 a j,S M min{j+s,|S|} X `=max{j,s} |S|−j `−j j `−s K s · `−j `− 1 . (A.28) Since (A.28) holds for all subsetsS of size|S| =S 0 + 1, we have proven Claim A.1. Then by Claim A.1, letS ={1,...,K} be the set of all K nodes, L ∗ M (s)≥ H(X S |Y S c) QNT ≥ K X j=1 a j M N min{j+s,K} X `=max{j,s} K−j `−j j `−s K s · `−j `− 1 . (A.29) This completes the proof of Lemma A.1. 144 Appendix B Coded TeraSort Experiment Results In this appendix, we summarize additional EC2 experiment results of TeraSort and CodedTeraSort in Tables B.1-B.9. We observe a speedup of CodedTeraSort ranging from 1.59× to 4.11×. Table B.1: Sorting 11 GB data with K = 12 worker nodes and 100 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 2.20 2.51 844.10 1.24 13.34 863.39 – CodedTeraSort: r = 3 1.34 8.38 7.43 266.20 3.11 17.52 303.98 2.84x CodedTeraSort: r = 5 2.66 15.94 8.65 192.08 4.15 18.50 241.98 3.57x Table B.2: Sorting 11 GB data with K = 12 worker nodes and 50 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 2.20 2.51 1655.63 1.23 13.37 1674.94 – CodedTeraSort: r = 3 1.34 8.42 7.38 558.74 3.11 17.46 596.46 2.81x CodedTeraSort: r = 5 2.65 15.63 8.67 380.18 4.13 18.33 429.59 3.90x Table B.3: Sorting 11 GB data with K = 12 worker nodes and 20 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 2.22 2.51 4093.30 1.23 13.39 4112.67 – CodedTeraSort: r = 3 1.31 8.30 7.30 1348.29 3.13 17.22 1385.54 2.97x CodedTeraSort: r = 5 2.55 15.62 8.68 952.34 4.13 18.23 1001.53 4.11x 145 Appendix B. Coded TeraSort Experiment Results Table B.4: Sorting 11 GB data with K = 16 worker nodes and 100 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 1.65 2.11 865.98 0.77 9.56 880.06 – CodedTeraSort: r = 3 5.59 5.55 5.29 374.71 2.15 11.73 405.02 2.17x CodedTeraSort: r = 5 24.06 9.83 7.41 203.99 3.40 13.05 261.74 3.36x Table B.5: Sorting 11 GB data with K = 16 worker nodes and 50 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 1.64 2.11 1696.27 0.77 9.58 1710.38 – CodedTeraSort: r = 3 5.51 5.44 5.25 745.63 2.15 10.97 774.95 2.21x CodedTeraSort: r = 5 23.00 9.99 7.42 403.79 3.42 13.11 460.72 3.71x Table B.6: Sorting 11 GB data with K = 16 worker nodes and 20 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 1.69 2.11 4189.35 0.77 9.59 4203.51 – CodedTeraSort: r = 3 5.61 5.47 5.27 1859.61 2.16 10.73 1888.86 2.23x CodedTeraSort: r = 5 24.02 9.92 7.42 1008.52 3.40 12.98 1066.25 3.94x Table B.7: Sorting 11 GB data with K = 20 worker nodes and 100 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 1.33 1.81 879.50 0.56 7.45 890.65 – CodedTeraSort: r = 3 34.57 4.27 4.47 430.00 1.71 8.78 483.79 1.84x CodedTeraSort: r = 5 274.59 7.82 6.89 257.31 3.44 9.98 560.03 1.59x Table B.8: Sorting 11 GB data with K = 20 worker nodes and 50 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 1.33 1.81 1721.21 0.56 7.44 1732.36 – CodedTeraSort: r = 3 34.40 4.27 4.47 824.50 1.68 8.15 877.46 1.97x CodedTeraSort: r = 5 275.28 7.75 6.89 493.47 3.44 9.89 796.72 2.17x Table B.9: Sorting 11 GB data with K = 20 worker nodes and 20 Mbps network speed CodeGen Map Pack/Encode Shuffle Unpack/Decode Reduce Total Time Speedup (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) (sec.) TeraSort: – 1.34 1.82 4247.45 0.56 7.45 4258.62 – CodedTeraSort: r = 3 33.84 4.31 4.47 2031.15 1.68 7.81 2083.26 2.04x CodedTeraSort: r = 5 275.10 7.73 6.88 1225.50 3.44 9.82 1528.47 2.78x 146 Appendix C Constant Multiplicative Gap of Minimum Bandwidth Code In this appendix, we prove that when all K servers finish their Map computations, i.e., Q ={1,...,K} and we operate at the point with the maximum latency, the communication load achieved by the proposed coded scheme (or the Minimum Bandwidth Code) is within a constant multiplicative factor of the lower bound on the communication load in Theorem 6.2. More specifically, L(K) ¯ L(K) < 3 + √ 5, (C.1) whenμK is an integer, 1 whereL(K) and ¯ L(K) are respectively given by (6.10) and (6.12). Proof. For μK∈N, we have L(K) =N 1−μ μK , and L(K) ¯ L(K) = 1−μ μK max t=1,...,K−1 1−min{tμ,1} d K t e(K−t) K . (C.2) We proceed to bound the RHS of (C.2) in the following two cases: 1) 1≤ 1 μ ≤ 3 + √ 5. 1 This always holds true for large K. 147 Appendix C. Constant Multiplicative Gap of Minimum Bandwidth Code We set t = 1 in (C.2) to have L(K) ¯ L(K) ≤ 1−μ μK 1−μ K−1 < 1 μ ≤ 3 + √ 5. (C.3) 2) 1 μ > 3 + √ 5. Since μK≥ 1, we have K− 1≥d K 2 e≥d 1 2μ e. In this case, we set t =d 1 2μ e in (C.2) to have L(K) ¯ L(K) ≤ (1−μ)d K d 1 2μ e e(K−d 1 2μ e) μK 2 (1−μd 1 2μ e) (C.4) ≤ 2(1−μ)(K−d 1 2μ e) K(1−μd 1 2μ e) < 2(1−μ) 1−μd 1 2μ e (C.5) ≤ 2(1−μ) 1−μ( 1 2μ + 1) (C.6) = 4 + 4 1 μ − 2 < 3 + √ 5, (C.7) Comparing (C.3) and (C.7) completes the proof. 148 Bibliography [1] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, “A fundamental tradeoff be- tween computation and communication in distributed computing,” IEEE Transactions on Information Theory, vol. 64, no. 1, Jan. 2018. [2] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, Mar. 2018. [3] “The facebook data center FAQ,” available online: http://www.datacenterknowledge. com/the-facebook-data-center-faq/. Accessed on Jan. 26, 2018. [4] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clus- ters,” Sixth USENIX Symposium on Operating System Design and Implementation, Dec. 2004. [5] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in Proceedings of the 2nd USENIX HotCloud, vol. 10, June 2010, p. 10. [6] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent,” in Advances in neural information processing systems (NIPS), 2011, pp. 693–701. [7] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix factorization with distributed stochastic gradient descent,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011, pp. 69–77. 149 Bibliography [8] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin, “A fast parallel sgd for matrix factorization in shared memory systems,” in Proceedings of the 7th ACM conference on Recommender systems. ACM, 2013, pp. 249–256. [9] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar, “Tarazu: optimiz- ing MapReduce on heterogeneous clusters,” in ACM SIGARCH Computer Architecture News, vol. 40, no. 1, Mar. 2012, pp. 61–74. [10] Y. Guo, J. Rao, and X. Zhou, “iShuffle: Improving Hadoop performance with shuffle- on-write,” in Proceedings of the 10th International Conference on Autonomic Comput- ing, June 2013, pp. 107–117. [11] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, “Managing data trans- fers in computer clusters with orchestra,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 98–109, Aug. 2011. [12] Z. Zhang, L. Cherkasova, and B. T. Loo, “Performance modeling of MapReduce jobs in heterogeneous cloud environments,” in IEEE Sixth International Conference on Cloud Computing, June 2013, pp. 839–846. [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. [14] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project adam: Building an efficient and scalable deep learning training system.” in 11th USENIX Symposium on Operating Systems Design and Implementation, vol. 14, 2014, pp. 571–582. [15] A. Rajaraman and J. D. Ullman, Mining of massive datasets. Cambridge University Press, 2011. [16] A. C.-C. Yao, “Some complexity questions related to distributive computing (prelim- inary report),” in Proceedings of the eleventh annual ACM symposium on Theory of computing, Apr. 1979, pp. 209–213. [17] A. Orlitsky and A. El Gamal, “Average and randomized communication complexity,” IEEE Transactions on Information Theory, vol. 36, no. 1, pp. 3–16, Jan. 1990. [18] E. Kushilevitz and N. Nisan, Communication Complexity. Cambridge University Press, 2006. 150 Bibliography [19] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta, “VL2: a scalable and flexible data center network,” ACM SIGCOMM computer communication review, vol. 39, no. 4, pp. 51–62, Oct. 2009. [20] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat, “Hedera: Dynamic flow scheduling for data center networks,” 7th USENIX Symposium on Net- worked Systems Design and Implementation, Apr. 2010. [21] S. Zhang, J. Han, Z. Liu, K. Wang, and S. Feng, “Accelerating MapReduce with distributed memory cache,” 15th IEEE International Conference on Parallel and Dis- tributed Systems (ICPADS), pp. 472–478, Dec. 2009. [22] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, “Twister: a runtime for iterative MapReduce,” Proceedings of the 19th ACM International Sym- posium on High Performance Distributed Computing, pp. 810–818, June 2010. [23] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” Fifteenth Annual Conference of the International Speech Communication Association, 2014. [24] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication- efficient sgd via gradient quantization and encoding,” Advances in neural information processing systems (NIPS), pp. 1707–1718, 2017. [25] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” Advances in neural information processing systems (NIPS), pp. 1508–1518, 2017. [26] “Hadoop TeraSort,” https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/ examples/terasort/package-summary.html. Accessed on Jan. 30, 2018. [27] “Amazon Elastic Compute Cloud (EC2),” https://aws.amazon.com/ec2/. Accessed on Jan. 30, 2018. [28] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and its role in the internet of things,” in Proceedings of the 1st edition of the MCC workshop on Mobile cloud computing. ACM, 2012, pp. 13–16. [29] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet of Things Journal, 2016. 151 Bibliography [30] “Apache Hadoop,” http://hadoop.apache.org. Accessed on Jan. 30, 2018. [31] G. Ananthanarayanan, S. Kandula, A. G. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris, “Reining in the outliers in map-reduce clusters using mantri. ” in OSDI, vol. 10, no. 1, 2010, p. 24. [32] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving MapRe- duce performance in heterogeneous environments,” OSDI, vol. 8, no. 4, p. 7, Dec. 2008. [33] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effective straggler mit- igation: Attack of the clones,” in 10th USENIX Symposium on Networked Systems Design and Implementation, 2013, pp. 185–198. [34] K. Gardner, S. Zbarsky, S. Doroudi, M. Harchol-Balter, and E. Hyytia, “Reducing latency via redundant requests: Exact analysis,” ACM SIGMETRICS Performance Evaluation Review, vol. 43, no. 1, pp. 347–360, 2015. [35] K. Lee, R. Pedarsani, and K. Ramchandran, “On scheduling redundant requests with cancellation overheads,” in 53rd Annual Allerton Conference on Communication, Con- trol, and Computing. IEEE, 2015, pp. 99–106. [36] M. Chaubey and E. Saule, “Replicated data placement for uncertain scheduling,” in IEEE International Parallel and Distributed Processing Symposium Workshop, 2015, pp. 464–472. [37] N. B. Shah, K. Lee, and K. Ramchandran, “When do redundant requests reduce la- tency?” IEEE Transactions on Communications, vol. 64, no. 2, pp. 715–722, 2016. [38] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” NIPS, pp. 2100–2108, 2016. [39] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 70. Inter- national Convention Centre, Sydney, Australia: PMLR, Aug. 2017, pp. 3368–3376. [40] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” NIPS, pp. 4406–4416, 2017. [41] S. Lin and D. J. Costello, Error control coding. Pearson, 2004. 152 Bibliography [42] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” 53rd Annual Allerton Conference on Communication, Control, and Computing, Sept. 2015. [43] ——, “Fundamental tradeoff between computation and communication in distributed computing,” IEEE International Symposium on Information Theory, July 2016. [44] ——, “Coded distributed computing: Straggling servers and multistage dataflows,” 54rd Annual Allerton Conference on Communication, Control, and Computing, Sept. 2016. [45] J. Korner and K. Marton, “How to encode the modulo-two sum of binary sources,” IEEE Transactions on Information Theory, vol. 25, no. 2, pp. 219–221, Mar. 1979. [46] K. Becker and U. Wille, “Communication complexity of group key distribution,” in Proceedings of the 5th ACM conference on Computer and communications security, Nov. 1998, pp. 1–6. [47] A. Orlitsky and J. Roche, “Coding for computing,” IEEE Transactions on Information Theory, vol. 47, no. 3, pp. 903–917, Mar. 2001. [48] B. Nazer and M. Gastpar, “Computation over multiple-access channels,” IEEE Trans- actions on Information Theory, vol. 53, no. 10, pp. 3498–3516, Oct. 2007. [49] A. Ramamoorthy and M. Langberg, “Communicating the sum of sources over a net- work,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 4, pp. 655–665, Apr. 2013. [50] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of caching,” IEEE Transactions on Information Theory, vol. 60, no. 5, pp. 2856–2867, Mar. 2014. [51] ——, “Decentralized coded caching attains order-optimal memory-rate tradeoff,” IEEE/ACM Transactions on Networking, Apr. 2014. [52] M. Ji, G. Caire, and A. F. Molisch, “Fundamental limits of caching in wireless D2D networks,” IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 849–869, Feb. 2016. [53] N. Karamchandani, U. Niesen, M. A. Maddah-Ali, and S. Diggavi, “Hierarchical coded caching,” IEEE International Symposium on Information Theory, pp. 2142–2146, June 2014. 153 Bibliography [54] Q. Yu, S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “How to optimally allocate resources for coded distributed computing?” IEEE International Conference on Com- munications (ICC), pp. 1–7, May 2017. [55] Y. Birk and T. Kol, “Coding on demand by an informed source (ISCOD) for efficient broadcast of different supplemental data to caching clients,” IEEE Transactions on Information Theory, vol. 52, no. 6, pp. 2825–2830, June 2006. [56] Z. Bar-Yossef, Y. Birk, T. Jayram, and T. Kol, “Index coding with side information,” IEEE Transactions on Information Theory, vol. 57, no. 3, pp. 1479–1494, Mar. 2011. [57] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network information flow,” IEEE Transactions on Information Theory, vol. 46, no. 4, pp. 1204–1216, July 2000. [58] R. Koetter and M. Medard, “An algebraic approach to network coding,” IEEE/ACM Transactions on Networking, vol. 11, no. 5, pp. 782–795, Oct. 2003. [59] T. Ho, R. Koetter, M. Medard, D. R. Karger, and M. Effros, “The benefits of coding over routing in a randomized setting,” IEEE International Symposium on Information Theory, pp. 442–, June 2003. [60] C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun, “Map- Reduce for machine learning on multicore,” Advances in neural information processing systems, vol. 19, 2007. [61] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data- parallel programs from sequential building blocks,” in ACM SIGOPS Operating Sys- tems Review, vol. 41, no. 3, June 2007, pp. 59–72. [62] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin, “HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for ana- lytical workloads,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 922–933, Aug. 2009. [63] J. Ekanayake, T. Gunarathne, G. Fox, A. S. Balkir, C. Poulain, N. Araujo, and R. Barga, “DryadLINQ for scientific analyses,” in Fifth IEEE International Conference on e-Science, 2009, pp. 329–336. [64] B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino, “Apache Tez: A unifying framework for modeling and building data processing applications,” 154 Bibliography in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, May 2015, pp. 1357–1369. [65] S. Li, S. Supittayapornpong, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded Tera- Sort,” 6th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, May 2017. [66] S. G. Akl, Parallel sorting algorithms. Academic press, 2014, vol. 12. [67] D. Pasetto and A. Akhriev, “A comparative study of parallel sort algorithms,” in Proceedings of the ACM international conference companion on Object oriented pro- gramming systems languages and applications companion. ACM, 2011, pp. 203–204. [68] O. O?Malley, “Terabyte sort on Apache Hadoop,” Yahoo, Tech. Rep., May 2008. [69] “Open MPI: Open source high performance computing,” https://www.open-mpi.org/. [70] “tc - show / manipulate traffic control settings,” http://lartc.org/manpages/tc.txt. [71] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “A scalable framework for wire- less distributed computing,” IEEE/ACM Transactions on Networking, vol. 25, no. 5, pp. 2643–2654, Oct. 2017. [72] ——, “Edge-facilitated wireless distributed computing,” IEEE GLOBECOM, Dec. 2016. [73] U. Drolia, R. Martins, J. Tan, A. Chheda, M. Sanghavi, R. Gandhi, and P. Narasimhan, “The case for mobile edge-clouds,” in IEEE 10th International Conference on Ubiqui- tous Intelligence and Computing (UIC), 2013, pp. 209–215. [74] D. Datla, X. Chen, T. Tsou, S. Raghunandan, S. S. Hasan, J. H. Reed, C. B. Dietrich, T. Bose, B. Fette, and J.-H. Kim, “Wireless distributed computing: a survey of research challenges,” IEEE Commun. Mag., vol. 50, no. 1, pp. 144–152, 2012. [75] G. Huerta-Canepa and D. Lee, “A virtual cloud computing provider for mobile devices,” in Proceedings of the 1st ACM Workshop on Mobile Cloud Computing & Services: Social Networks and Beyond, 2010, p. 6. [76] S. Barbarossa, S. Sardellitti, and P. Di Lorenzo, “Communicating while computing: Distributed mobile cloud computing over 5G heterogeneous networks,” IEEE Signal Processing Magazine, vol. 31, no. 6, pp. 45–55, 2014. 155 Bibliography [77] S. Khalili and O. Simeone, “Inter-layer per-mobile optimization of cloud mobile com- puting: A message-passing approach,” e-print arXiv:1509.01596, 2015, submitted to IEEE Transactions on Communications. [78] B. Welton, D. Kimpe, J. Cope, C. M. Patrick, K. Iskra, and R. Ross, “Improving I/O forwarding throughput with data compression,” in IEEE International Conference on Cluster Computing, July 2011, pp. 438–445. [79] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Communication-aware computing for edge processing,” in 2017 IEEE International Symposium on Information Theory (ISIT), June 2017, pp. 2885–2889. [80] ——, “Architectures for coded mobile edge computing,” Fog World Congress, 2017. [81] M. T. Beck, M. Werner, S. Feld, and S. Schimper, “Mobile edge computing: A tax- onomy,” in Proc. of the 6th International Conference on Advances in Future Internet. IARIA, 2014. [82] A. Ahmed and E. Ahmed, “A survey on mobile edge computing,” in Proc. of the 10th International Conference on Intelligent Systems and Control (ISCO). IEEE, 2016, pp. 1–8. [83] Y. C. Hu, M. Patel, D. Sabella, N. Sprecher, and V. Young, “Mobile edge computing-a key technology towards 5G,” ETSI White Paper, vol. 11, 2015. [84] S. Yi, C. Li, and Q. Li, “A survey of fog computing: concepts, applications and issues,” in Proc. of the 2015 Workshop on Mobile Big Data. ACM, 2015, pp. 37–42. [85] H. T. Dinh, C. Lee, D. Niyato, and P. Wang, “A survey of mobile cloud comput- ing: architecture, applications, and approaches,” Wireless communications and mobile computing, vol. 13, no. 18, 2013. [86] K. Kumar, J. Liu, Y.-H. Lu, and B. Bhargava, “A survey of computation offloading for mobile systems,” Mobile Networks and Applications, vol. 18, no. 1, pp. 129–140, 2013. [87] I. Shomorony and A. S. Avestimehr, “Degrees of freedom of two-hop wireless networks: Everyone gets the entire cake,” IEEE Transactions on Information Theory, vol. 60, no. 5, pp. 2417–2431, 2014. 156 Bibliography [88] T. Gou, S. A. Jafar, C. Wang, S.-W. Jeon, and S.-Y. Chung, “Aligned interference neutralization and the degrees of freedom of the 2× 2× 2 interference channel,” IEEE Transactions on Information Theory, vol. 58, no. 7, pp. 4381–4395, 2012. [89] S. Sardellitti, G. Scutari, and S. Barbarossa, “Joint optimization of radio and compu- tational resources for multicell mobile-edge computing,” IEEE Transactions on Signal and Information Processing over Networks, vol. 1, no. 2, pp. 89–103, 2015. [90] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,” IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1397–1411, 2017. [91] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded distributed computing: Strag- gling servers and multistage dataflows,” 54th Allerton Conference, Sept. 2016. [92] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded distributed computing: Fundamental limits and practical challenges,” 50th Asilomar Conference, pp. 509–513, Nov. 2016. [93] A. Reisizadeh, S. Prakash, R. Pedarsani, and S. Avestimehr, “Coded computation over heterogeneous clusters,” IEEE ISIT, pp. 2408–2412, 2017. [94] M. Kiamari, C. Wang, and A. S. Avestimehr, “On heterogeneous coded distributed computing,” IEEE GLOBECOM, Dec. 2017. [95] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “A unified coding framework for distributed computing with straggling servers,” IEEE NetCod, Dec. 2016. [96] ——, “Coding for distributed fog computing,” IEEE Communications Magazine, vol. 55, no. 4, pp. 34–40, Apr. 2017. [97] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” NIPS Workshop on Machine Learning Systems, Dec. 2015. [98] B. C. Arnold, N. Balakrishnan, and H. N. Nagaraja, A first course in order statistics. Siam, 1992, vol. 54. 157
Abstract (if available)
Abstract
In this dissertation, we introduce the concept of ""coded computing"", which is a novel distributed computing paradigm that utilizes coding theory to smartly inject and leverage data/computation redundancy into distributed computing systems, achieving optimal tradeoffs between key resources like computation, bandwidth and storage, and significantly mitigating the fundamental performance bottlenecks for running large-scale data analytics. ❧ We consider the widely used MapReduce distributed computing framework, for which the input files distributedly stored on computing nodes are first mapped into intermediate values, then the computed intermediate values are shuffled between the nodes to be reduced to final output results. For such a framework, we characterize a fundamental inversely proportional tradeoff between computation load in the Map phase and the communication load in the Shuffle phase. We propose a coded scheme, named ""coded distributed computing'' (CDC), to achieve this tradeoff. CDC performs redundant Map computations across r nodes following a particular structure, enabling coding opportunities to create coded packets that are simultaneously useful for r nodes, and hence reducing the communication load by r times. For computation tasks with particular algebraic structures, e.g., multi-stage computations and computations with linear aggregations, we further optimize the CDC scheme to achieve additional reduction in the bandwidth consumption. We illustrate the practical impact of CDC by developing and evaluating a novel distributed sorting algorithm, named CodedTeraSort, by integrating the coding techniques of CDC into a commonly used Hadoop sorting benchmark TeraSort. On Amazon EC2 clusters, we empirically demonstrate a 3.39x speedup over the benchmark, by optimally trading extra computations for bandwidth consumption using CodedTeraSort. ❧ Beyond cloud computing systems, we extend the idea of CDC into the mobile edge computing environment, where many resource-poor mobile users scattered at the network edge collaborate to meet their computational needs by locally processing partial data and shuffling intermediate computation results via a wireless access point. For this edge computing model, we apply the CDC scheme to develop a scalable wireless distributed computing platform that can accommodate an unlimited number of mobile users with constant bandwidth requirement, via communicating coded packets that are simultaneously useful for multiple users. This platform provides a promising solution to enable efficient collaborative edge/fog computing for Internet-of-Things (IoT) devices, by substantially alleviating the heavy load of communication between a huge number of devices exerted on the underlying wireless channels with limited bandwidth. For another mobile edge computing model in which mobile users offload their computation tasks to the edge servers and receive the computation results from the servers, we propose a universal coded edge computing (UCEC) architecture to simultaneously minimize the load of computation at the edge servers, and maximize the physical-layer communication efficiency towards the mobile users. In the proposed UCEC architecture, edge servers create coded inputs of the users, from which they compute coded output results. Then, the edge servers utilize the computed coded results to create communication messages that zero-force all the interference signals over the air at each user. Specifically, the proposed scheme is universal since the coded computations performed at the edge nodes are oblivious of the channel states during the communication process from the edge nodes to the users. ❧ Finally, motivated by the idea of utilizing error correcting codes to handle the performance bottleneck caused by straggling servers, whose computation and/or communication is much slower than the other servers, we study the problem of designing optimal coding schemes to speed up MapReduce jobs run over heterogeneous computing clusters, in which some of the servers may be stragglers. For this setting, we propose a unified coding scheme that organically superimposes the proposed CDC scheme on top of Maximum-Distance-Separable (MDS) codes on individual tasks, allowing a flexible tradeoff between the computation latency limited by the stragglers in the Map phase and the communication load between the surviving nodes in the Shuffle phase.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Distributed interference management in large wireless networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Scalable processing of spatial queries
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Elements of next-generation wireless video systems: millimeter-wave and device-to-device algorithms
Asset Metadata
Creator
Li, Songze
(author)
Core Title
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/25/2018
Defense Date
04/09/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
big data analytics,coded TeraSort,coding theory,communication bottleneck,computation-communication tradeoff,distributed computing,latency-communication tradeoff,leveraging redundant computations,OAI-PMH Harvest,robust mobile edge computing,scalable mobile edge computing,straggler bottleneck
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Avestimehr, Salman (
committee chair
), Golubchik, Leana (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
songzeli@usc.edu,songzeli8824@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-25282
Unique identifier
UC11671758
Identifier
etd-LiSongze-6463.pdf (filename),usctheses-c89-25282 (legacy record id)
Legacy Identifier
etd-LiSongze-6463.pdf
Dmrecord
25282
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, Songze
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
big data analytics
coded TeraSort
coding theory
communication bottleneck
computation-communication tradeoff
distributed computing
latency-communication tradeoff
leveraging redundant computations
robust mobile edge computing
scalable mobile edge computing
straggler bottleneck