Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
(USC Thesis Other)
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Coded Computing: A Transformative Framework for Resilient, Secure, Private, and Communication Efficient Large Scale Distributed Computing by Qian Yu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2020 Copyright 2020 Qian Yu To My Dearest Parents. ii Acknowledgements First I would like to express my deepest gratitude to my advisor Prof. Salman Avestimehr for his support and guidance over my years at USC. Ever since I joined Prof. Avestimehr’s research group, we had individual meetings scheduled on a weekly basis. He always comes up with explicit and detailed suggestions on how to present and organize research results. He taught me how to convert a proof into a formal publication, a process that many beginning researchers struggle with and find difficult to learn by one’s own, and I consider it a best possible support one can have from an advisor. Prof. Avestimehr also constantly encourages me to keep the big picture in mind and focus on one good direction whenever I got distracted too much by other technically interesting problems. I would like to thank Dr. Mohammad Ali Maddah-Ali, with whom we have collaborated on several papers on caching and coded computing, for his insightful comments that connect different research problems. Whenever we solved an open problem, he often comes up with new directions where the developed techniques can be applied as a solution. I would also like to thank my labmate, Dr. Songze Li, for his advice on research and life. We first met during an MHI research festival, where I was immediately attracted by an open problem mentioned in his presentation, on characterizing a fundamental tradeoff between communication and computation in distributed computing. I was given the opportunity to work on that problem and developed a communication lower bound that closes the missing gap, which becomes my first result during my PhD studies. I am extremely grateful to my intern host Dr. Ananda Theertha Suresh and colleague Dr. Aditya Krishna Menon at Google, for the valuable intern experience and the opportunity to work and collaborate with various researchers with industry backgrounds. I would also like to thank my colleague Dr. Lin Chen for introducing some very interesting open problems on submodular and online convex optimization, and for his patience in reading my handwritten proofs. Thanks to my thesis committee members Prof. Haipeng Luo, Prof. Antonio Ortega, Prof. Mahdi Soltanolkotabi, and my qualifying exam committee members Prof. Paul Bogdan, Prof. Michael J. Neely, Prof. Antonio Ortega, Prof. Meisam Razaviyayn, for their valuable time, feedbacks and suggestions. Thanks, Prof. Solomon W. Golomb, Prof. Robert A. Scholtz, Prof. Antonio Ortega, iii Prof. Salman Avestimehr, Prof. P. Vijay Kumar, Prof. Sami Assaf, Prof. Mahdi Soltanolkotabi, Prof. Larry Goldstein, Prof. Fedor Malikov at USC for their wonderful lectures. Thanks to my collaborators Seyed Mohammadreza Mousavi Kalan, Hannah Lawrence, Dr. Lin Chen, Dr. Hossein Esfandiari, Dr. Thomas Fu, Dr. Vahab Mirrokni, Prof. Muriel M´ edard, Prof. Mahdi Soltanolkotabi, Prof. Amin Karbasi, and Prof. Netanel Raviv, it has been a great pleasure working with them. Last but not least, I would like to thank my friends and colleagues Eyal En Gad, Navid Naderializadeh, Mehrdad Kiamari, Saurav Prakash, Chien-Sheng Yang, Mingchao Fisher Yu, Xiaohan Wei, and Ke Wu, for their advice and supports. iv Contents Dedication ii Acknowledgements iii List of Tables x List of Figures xi Abstract xiv 1 Introduction 1 1.1 Overview of Coded Computing for Resilient, Secure, and Private Distributed Computing 1 1.2 Overview of Coded Computing for Communication Efficiency . . . . . . . . . . . . . 3 1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 I Optimal Codes for Straggler Mitigation 5 2 Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Mul- tiplication 6 2.1 System Model, Problem Formulation, and Main Result . . . . . . . . . . . . . . . . . 10 2.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Polynomial Code and Its Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Polynomial Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Optimality of Polynomial Code for Recovery Threshold . . . . . . . . . . . . 16 2.2.4 Optimality of Polynomial Code for Other Performance Metrics . . . . . . . . 16 2.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Concluding Remarks and Polynomial Coded Computing . . . . . . . . . . . . . . . . 18 3 Polynomial Codes for Distributed Convolution 20 3.1 Problem Formulation and Main Results . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 v 4 Entangled Polynomial Codes: Optimal Codes for Block Matrix Multiplication 26 4.1 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Entangled Polynomial Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.1 Illustrating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.2 General Coding Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.3 Computational complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Converses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4.1 Maching Converses for Linear Codes . . . . . . . . . . . . . . . . . . . . . . . 41 4.4.2 Information Theoretic Converse for Nonlinear Codes . . . . . . . . . . . . . . 43 4.5 Factor of 2 characterization of Optimum Linear Recovery Threshold . . . . . . . . . 45 4.5.1 Computational complexities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5 Coded Fourier Transform 50 5.1 System Model and Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Coded FFT: the Optimal Computation Strategy . . . . . . . . . . . . . . . . . . . . 54 5.2.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.2 General Description of Coded FFT . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.3 Decoding Complexity of Coded FFT . . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Optimality of Coded FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.4 n-dimensional Coded FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4.1 System Model and Main results . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4.2 General Description of n-dimensional Coded FFT . . . . . . . . . . . . . . . . 60 5.4.3 Optimally of n-dimensional Coded FFT . . . . . . . . . . . . . . . . . . . . . 62 5.5 Coded FFT with multiple inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5.1 System Model and Main results . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5.2 General Description of Coded FFT with Multiple Inputs . . . . . . . . . . . . 63 5.5.3 Optimally of Coded FFT with multiple inputs . . . . . . . . . . . . . . . . . 64 5.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 II Optimal Codes for Secure and Private Computation 66 6 Lagrange Coded Computing: Optimal Design for Resiliency, Security, and Pri- vacy 67 6.1 Problem Formulation and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2 Main Results and Prior Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2.1 LCC vs. Prior Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Lagrange Coded Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.1 Illustrating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.2 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4 Optimality of LCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 vi 7 Harmonic Coding: An Optimal Linear Code for Privacy-Preserving Gradient- Type Computation 79 7.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.3 Achievability Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.3.1 Example for K = 2, deg g = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.3.2 General Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.4 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.5.1 Random Key Access and Extra Computing Power at Master . . . . . . . . . 89 7.5.2 An Optimum Scheme for CharF = deg g . . . . . . . . . . . . . . . . . . . . 89 8 Entangled Polynomial Codes for Secure, Private, and Batch Distributed Matrix Multiplication: Breaking the “Cubic” Barrier 91 8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.2 Secure, Private, Batch Distributed Matrix Multiplication and Main Results . . . . . 94 8.2.1 Secure Distributed Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 94 8.2.2 Private Distributed Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 95 8.2.3 Batch Distributed Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . 96 8.3 Achievability Schemes for Secure Distributed Matrix Multiplication . . . . . . . . . . 98 8.4 Achievability Schemes for Private Distributed Matrix Multiplication . . . . . . . . . 100 8.5 Achievability Schemes for Batch Distributed Matrix Multiplication . . . . . . . . . . 103 III Optimal Codes for Communication 106 9 Fundamental Limits of Communication for Coded Distributed Computing 107 9.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 9.3 Converse of Theorem 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 9.4 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 10 The Exact Rate-Memory Tradeoff for Caching with Uncoded Prefetching 114 10.1 System Model and Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.1.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 10.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 10.3 The Optimal Caching Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 10.3.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 10.3.2 General Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 10.4 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 10.5 Extension to the Decentralized Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 130 10.5.1 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . 130 10.5.2 Exact Rate-Memory Tradeoff for Decentralized Setting . . . . . . . . . . . . . 132 10.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 vii 11 Characterizing the Rate-Memory Tradeoff in Cache Networks within a Factor of 2 136 11.1 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.1.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 11.1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 11.1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 11.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 11.3 Proof of Theorem 11.1 for peak rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 11.3.1 Proof of inequality (11.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 11.3.2 Proof of inequality (11.9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 11.4 Proof of Theorem 11.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 11.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Bibliography 151 A Supplement to Chapter 2 166 A.1 Optimality of Polynomial Code in Latency and Communication Load . . . . . . . . 166 A.1.1 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 A.1.2 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 B Supplement to Chapter 4 168 B.1 An Equivalence Between Fault Tolerance and Straggler Mitigation . . . . . . . . . . 168 B.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 B.1.2 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 C Supplement to Chapter 6 171 C.1 Algorithmic Illustration of LCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 C.2 Coding Complexities of LCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 C.3 The Uncoded Version of LCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 C.4 Proof of Lemma 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 C.5 Optimality on the Resiliency-Security-Privacy Tradeoff for Multilinear Functions . . 179 C.6 Optimality on the Resiliency-Privacy Tradeoff for General Multivariate Polynomials 180 C.7 Proof of Lemma C.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 D Supplement to Chapter 7 183 D.1 Proof of Lemma 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 D.2 Proof of Lemma 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 E Supplement to Chapter 9 186 E.1 Proof of Lemma 9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 F Supplement to Chapter 10 191 F.1 Proof of Lemma 10.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 F.2 Minimum Peak Rate for Centralized Caching . . . . . . . . . . . . . . . . . . . . . . 196 F.3 Proof of Theorem 10.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 F.3.1 The Optimal Decentralized Caching Scheme . . . . . . . . . . . . . . . . . . . 197 viii F.3.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 F.4 Proof of Corollary 10.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 F.5 Upper Bounds on Decoding Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 203 G Supplement to Chapter 11 206 G.1 Proof of Theorem 11.3 for peak rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 G.2 Proof of Theorem 11.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 G.3 Proof of Lemma 11.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 G.4 Proof of Lemma 11.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 G.5 Proof of Lemma 11.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 G.6 Proof of Lemma G.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 G.7 Proof of Theorem 11.1 for average rate . . . . . . . . . . . . . . . . . . . . . . . . . . 222 G.8 The exact rate-memory tradeoff for two-user case . . . . . . . . . . . . . . . . . . . . 224 G.9 Proof of Theorem 11.3 for average rate . . . . . . . . . . . . . . . . . . . . . . . . . . 225 G.10 Convexity of R u (N,K,r) and R u,ave (N,K,r) . . . . . . . . . . . . . . . . . . . . . . 227 ix List of Tables 6.1 A comparison between BGW based designs and LCC. The computational complexity is normalized by that of a single evaluation of f; randomness, which refers to the number of random entries used in encoding functions, is normalized by the length of each input X i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 x List of Figures 1.1 An illustration of coded computation. A collection of workers aims to compute a function g given an input dataset, where each worker can return an evaluation of a function f with possibly coded data assignments. By carefully designing the coding functions (c i ’s), the final results can be efficiently recovered after computation is applies on coded data, in the presence of stragglers while providing security and privacy against malicious and colluding workers. . . . . . . . . . . . . . . . . . . . . 2 2.1 Overview of the distributed matrix multiplication framework. Coded data are initially stored distributedly at N workers according to data assignment. Each worker computes the product of the two stored matrices and returns it to the master. By carefully designing the computation strategy, the master can decode given the computing results from a subset of workers, without having to wait for the stragglers (worker 1 in this example). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Illustration of (a) 1D MDS code, and (b) product code. . . . . . . . . . . . . . . . . 9 2.3 Comparison of the recovery thresholds achieved by the proposed polynomial code and the state of the arts (1D MDS code [1] and product code [2]), where each worker can store 1 10 fraction of each input matrix. The polynomial code attains the optimum recovery threshold K ∗ , and significantly improves the state of the art. . . . . . . . . 12 2.4 Example using polynomial code, with 5 workers that can each store half of each input matrix. (a) Computation strategy: each worker i storesA 0 +iA 1 andB 0 +i 2 B 1 , and computes their product. (b) Decoding: master waits for results from any 4 workers, and decodes the output using fast polynomial interpolation algorithm. . . . . . . . 13 2.5 Comparison of polynomial code and the uncoded scheme. We implement polynomial code and the uncoded scheme using Python and mpi4py library and deploy them on an Amazon EC2 cluster of 18 instances. We measure the computation latency of both algorithms and plot their CCDF. Polynomial code can reduce the tail latency by 37% even taking into account of the decoding overhead. . . . . . . . . . . . . . . 18 4.1 Overview of the distributed matrix multiplication problem. Each worker computes the product of the two stored encoded submatrices ( ˜ A i and ˜ B i ) and returns the result to the master. By carefully designing the coding strategy, the master can decode the multiplication result of the input matrices from a subset of workers, without having to wait for stragglers (worker 1 in this example). . . . . . . . . . . . . . . . . . . . . 27 4.2 Comparison of the recovery thresholds achieved by the uncoded repetition scheme, the random linear code, the short-MDS (or short-dot) [1,3] and our proposed entangled polynomial code, given problem parameters p = m = 3, n = 1. The entangled polynomial code orderwise improves upon all other approaches. It also achieves the optimum linear recovery threshold in this scenario. . . . . . . . . . . . . . . . . . . . 32 xi 4.3 Example using entangled polynomial code, with 5 workers that can each store half of each input matrix. (a) Computation strategy: each worker i stores A 0 +iA 1 and iB 0 +B 1 , and computes their product. (b) Decoding: master waits for results from any 3 workers, and decodes the output using polynomial interpolation. . . . . . . . 37 5.1 Overview of the distributed Fourier transform framework. Coded data are initially stored distributedly at N workers according to data assignment. Each worker computes an intermediate result based on its stored vector and returns it to the master. By designing the computation strategy, the master can decode given the computing results from a subset of workers, without having to wait for the stragglers (worker 1 in this example). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Example using coded FFT, with 3 workers that can each store and process half of the input. (a) Computation strategy: each worker i stores a linear combination of the interleaved version of the input according to an MDS code, and computes its DFT. (b) Decoding: master waits for results from any 2 workers, and recover the final output by first decoding the MDS code, then compute the transformed vector following the similar steps in the Cooley-Tukey algorithm. . . . . . . . . . . . . . . 54 6.1 An overview of the problem considered in this chapter, where the goal is to evaluate a not necessarily linear function f on a given dataset X = (X 1 ,X 2 ,...,X K ) using N workers. Each worker applies f on a possibly coded version of the inputs (denoted by ˜ X i ’s). By carefully designing the coding strategy, the master can decode all the required results from a subset of workers, in the presence of stragglers (workers s 1 ,...,s S ) and Byzantine workers (workers m 1 ,...,m A ), while keeping the dataset private to colluding workers (workers c 1 ,...,c T ). 68 7.1 An overview of the framework considered in this chapter. The goal is to design a privacy-preserving coded computing scheme for any gradient-type function, using the minimum possible number of workers. The master aims to recover f(X 1 ,...,X K ), g(X 1 ) +... +g(X K ) given an input dataset X = (X 1 ,...,X K ), for a not necessarily linear function g. Each worker i takes a coded version of the inputs (denoted by ˜ X i ) and computes g( ˜ X i ). By carefully designing the coding strategy, the master can decode the function given computing results from the workers, while keeping the dataset private to any curious worker (workers 3 and N in this example). . . . . . . 80 9.1 Illustration of a two-stage distributed computing framework. The overall computation is decomposed into computing a set of Map and Reduce functions. . . . . . . . . . . 109 10.1 Caching system considered in this chapter. The figure illustrates the case where K =N = 3, M = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.2 Numerical comparison between the optimal tradeoff and the state of the arts for the centralized setting. Our results strictly improve the prior arts in both achievability and converse, for both average rate and peak rate. . . . . . . . . . . . . . . . . . . . 121 10.3 A caching system with 6 users, 3 files, local cache size of 1 file at each user, and a demand where each file is requested by 2 users. . . . . . . . . . . . . . . . . . . . . . 123 10.4 DividingD into 5 types, for a caching problem with 4 files and 4 users. . . . . . . . . 128 10.5 Numerical comparison between the optimal tradeoff and the state of the arts for the decentralized setting. Our results strictly improve the prior arts in both achievability and converse, for both average rate and peak rate. . . . . . . . . . . . . . . . . . . . 134 xii 10.6 Achievable peak communication rates for centralized schemes that allow coded prefetching. For N = 20, K = 40, we compare our proposed achievability scheme with prior-art coded-prefetching schemes [4,5], prior-art converse bounds [6–8], and two recent results [9,10]. The achievability scheme presented in this chapter achieves the best performance to date in most cases, and is within a factor of 2 optimal as shown in [9], even compared with schemes that allow coded prefetching. . . . . . . . 135 11.1 Numerical comparison among the two converse bounds presented in Theorem 11.2 and Theorem 11.4, and the upper bound achieved in [11]. Our converse bounds tightly characterize the peak rate-memory tradeoff in all three presented scenarios. . 144 xiii Abstract Modern computing applications often require handling massive amounts of data in a distributed setting, where significant issues on resiliency, security, or privacy could arise. This dissertation presents new computing designs and optimality proofs, that address these issues through coding and information-theoretic approaches. The first part of this thesis focuses on a standard setup, where the computation is carried out using a set of worker nodes, each can store and process a fraction of the input dataset. The goal is to find computation schemes for providing the optimal resiliency against stragglers given the computation task, the number of workers, and the functions computed by the workers. The resiliency is measured in terms of the recovery threshold, defined as the minimum number of workers to wait for in order to compute the final result. We propose optimal solutions for broad classes of computation tasks, from basic building blocks such as matrix multiplication (entangled polynomial codes), Fourier transform (coded FFT), and convolution (polynomial code), to general functions such as multivariate polynomial evaluation (Lagrange coded computing). We develop optimal computing strategies by introducing a general coding framework called “polynomial coded computing”, to exploit the algebraic structure of the computation task and create computation redundancy in a novel coded form across workers. Polynomial coded computing allows for order-wise improvements over the state of the arts and significantly generalizes classical coding-theoretic results to go beyond linear computations. The encoding and decoding process of polynomial coded computing designs can be mapped to polynomial evaluation and interpolation, which can be computed efficiently. Then we show that polynomial coded computing can be extended to provide unified frameworks that also enable security and privacy in the computation. We present the optimal designs for three important problems: distributed matrix multiplication, multivariate polynomial evaluation, and gradient-type computation. We prove their optimality by developing information-theoretic and linear-algebraic converse bounding techniques. Finally, we consider the problem of coding for communication reduction. In the context of distributed computation, we focus on a MapReduce-type framework, where the workers need to shuffle their intermediate results to finish the computation. We aim to understand how to optimally xiv exploit extra computing power to reduce communication, i.e., to establish a fundamental tradeoff between computation and communication. We prove a lower bound on the needed communication load for general allocation of the task assignments, by introducing a novel information-theoretic converse bounding approach. The presented lower bound exactly matches the inverse-proportional coding gain achieved by coded distributed computing schemes, completely characterizing the optimal computation-communication tradeoff. The proposed converse bounding approach strictly improves conventional cut-set bounds and can be widely applied to prove exact optimally results for more general settings, as well as more classical communication problems. We also investigate a problem called coded caching, where a single server is connected to multiple users in a cache network through a shared bottleneck link. Each user has an isolated memory that can be used to prefetch content. Then the server needs to deliver users’ demands efficiently in a following delivery phase. We propose caching and delivery designs that improve the state-of-the-art schemes under both centralized and decentralized settings, for both peak and average communication rates. Moreover, by developing information-theoretic bounds, we prove the proposed designs are exactly optimal among all schemes that use uncoded prefetching, and optimal within a factor of 2.00884 among schemes with coded prefetching. xv Chapter 1 Introduction 1.1 Overview of Coded Computing for Resilient, Secure, and Pri- vate Distributed Computing A main challenge in distributed computing is to design schemes to operate in the presence of stragglers, which are workers that are slow or fail to return their computing results. Commonly, stragglers are handled using “uncoded repetition”, where the same computation tasks are duplicated and assigned to multiple worker machines. Coded computation has recently been proposed as an effective approach to mitigate the straggler effect, and computing strategies have been proposed for a variety of computation tasks, including matrix multiplication [1,3,14–16], convolution [14,17], Fourier transformation [18,19], element-wise multiplication [16], and multivariate-polynomial evaluation [20]. The main idea of coded computing is to assign each worker data or tasks in carefully designed coded forms, such that the final result can still be recovered after possibly non-linear computation is applied on coded data. In a standard framework of coded computation (illustrated in Figure 1.1), we aim to design a computing scheme to compute a function g over an input datasetX using N distributed workers. Each worker computes a single evaluation of some function f, which can be viewed as building blocks of computingg. A conventional approach in distributed computation is to assign each worker an uncoded fraction of the input dataset and to recover the final results from evaluations of these uncoded assignments. We present two examples as follows: This chapter is based on [12,13] 1 Matrix multiplication (column-wise partition) [14]. Consider a scenario where the goal is to compute the product A | B given two large matrices A∈F s×t and B∈F s×r . Here the input dataset is X = (A,B), and the computation task is g(A,B) =A | B. After partitioning the input matrices column-wise intom andn submatrices of equal sizes, denotedA 0 ,...,A m−1 andB 0 ,...,B n−1 , the final results are essentially the collection of all mn pair-wise submatrix-productsA | i B j ’s. If each worker can compute a single submatrix product of same sizes, i.e., f is the multiplication of two matrices of sizesF t m ×s andF s× r n , an uncoded design using K =mn workers can be constructed by assigning the workers distinct (A i ,B j )’s as inputs. Polynomial evaluation [20]. Another example is to evaluate multivariate polynomials on a dataset X = (X 1 ,...,X K ). Explicitly, given a general polynomial f, the goal is to compute g(X) = (f(X 1 ),...,f(X K )). If each worker can compute a single evaluation, then an uncoded design using K workers can be obtained by assigning each worker i input X i . … … Figure 1.1: An illustration of coded computation. A collection of workers aims to compute a function g given an input dataset, where each worker can return an evaluation of a function f with possibly coded data assignments. By carefully designing the coding functions (c i ’s), the final results can be efficiently recovered after computation is applies on coded data, in the presence of stragglers while providing security and privacy against malicious and colluding workers. A coded computing design that uses N workers first encodes the datasetX using N encoding functions c 1 ,...,c N , then assign c i (X) to each worker i as the coded input (see Figure 1.1). In the presence of stragglers, the decoder waits for a subset of fastest workers until g(X) can be recovered given the returned results from the workers. We say a coded computing scheme achieves a recovery threshold of R if the master can correctly decode the final output given the computing results from any subset of R workers. This is an equivalent measure of the number of stragglers that can be 2 tolerated, as well directly connected to other performance metrics in terms of fault tolerance and security against Byzantine adversaries. The goal is to design encoding functions to achieve the minimum possible recovery threshold given f,g, andN, under possible additional constrains due to privacy and coding complexity requirements 1 . 1.2 Overview of Coded Computing for Communication Efficiency Another main challenge in distributed computation is to find algorithms that efficiently utilize communication bandwidth resources. Also, as we distribute computations across many servers, massive amounts of data must be moved between them to execute the computational tasks, often over many iterations of a running algorithm, and this creates a substantial bandwidth bottleneck [21]. It is shown in [22–27] that coded computing can also significantly mitigate this communication bottleneck, by injecting structured computation redundancy to enable coded multi-casting opportunities when shuffling intermediate variables, to reduce the needed communication load. In particular, we consider a general setup where a set of K distributed computing nodes is connected through a shared communication link. During a shuffling stage, there is a list of intermediate variables to be exchanged between the computing nodes, each is stored by a subset of nodes and needed by another subset. Each node can send a multicast message to the rest of the computing nodes based on their locally available information. Due to the nature of the distributed computation setup, we usually have some freedom to jointly design the communication scheme, the allocation and the demands of the intermediate variables; and the goal is to minimize the communication load given constraints or requirements on storage or computation. This setup also appears in a closely related problem called coded caching [28]. 1.3 Main Contributions The rest of this thesis is organized as follows. In Chapter 2, we present an optimal straggler-mitigating code for the matrix multiplication problem mentioned earlier in Section 1.1. The solution of this problem is generalized to a Polynomial Coded Computing (PCC) framework, that leads to optimal codes for various computations, including convolution (Chapter 3), block matrix multiplication (Chapter 4), discrete Fourier transformation (Chapter 5), and multivariate polynomial evaluation (Chapter 6 in Part II). We also prove the optimally of these codes. 1 To ensure the complexities of the encoding and decoding functions are reasonably low, most related works focus on linear coding designs. 3 In the second part of this thesis, we take two other important aspects of computation: security and privacy, into account and demonstrate that the optimally of the codes we presented for straggler mitigation also extends to these settings. We generalize our codes for two example computing scenarios: multivariate polynomial evaluation (Chapter 6) and block matrix multiplication (Chapter 7) and prove their optimally. In Chapter 8, we consider a class of functions called gradient-type computation and present new coding ideas to provide the privacy of data while using the optimum number of workers. Finally, in the third part of this thesis, we focus on the communication aspect of distributed computation tasks. In Chapter 9, we develop a fundamental limit tradeoff between computation and communication in a distributed MapReduce framework, by proving a tight lower bound on communication rates, which applies to general allocations of computing assignments. The proposed lower bound can also be applied to more classical communication problems such as coded caching. We develop optimal achievability codes and prove its exact optimally among designs with uncoded prefetching in Chapter 10. We then prove in Chapter 11 that it achieves the optimum communication rate within a factor of approximately 2 even if designs with coded prefetching is taken into account. 4 Part I Optimal Codes for Straggler Mitigation 5 Chapter 2 Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication Matrix multiplication is one of the key building blocks underlying many data analytics and machine learning algorithms. Many such applications require massive computation and storage power to process large-scale datasets. As a result, distributed computing frameworks such as Hadoop MapReduce [29] and Spark [30] have gained significant traction, as they enable processing of data sizes at the order of tens of terabytes and more. As we scale out computations across many distributed nodes, a major performance bottleneck is the latency in waiting for slowest nodes, or “stragglers” to finish their tasks [31]. The current approaches to mitigate the impact of stragglers involve creation of some form of “computation redundancy”. For example, replicating the straggling task on another available node is a common approach to deal with stragglers (e.g., [32]). However, there have been recent results demonstrating that coding can play a transformational role for creating and exploiting computation redundancy to effectively alleviate the impact of stragglers [1,17,26,33,34]. Our main result in this chapter is the development of optimal codes, named polynomial codes, to deal with stragglers in distributed high-dimensional matrix multiplication, which also provides order-wise improvement over the state of the art. This chapter is based on [14]. 6 More specifically, we consider a distributed matrix multiplication problem where we aim to compute C =A | B from input matrices A and B. As shown in Fig. 2.1, the computation is carried out using a distributed system with a master node and N worker nodes that can each store 1 m fraction ofA and 1 n fraction ofB, for some parametersm,n∈N + . In particular, The input matrices are column-wise evenly partitioned into m and n submatrices, and each worker can store a pair of matrices each with a size that equals the size of the corresponding submatrices. 1 We denote the stored matrices at each worker i∈{0,...,N− 1} by ˜ A i and ˜ B i , which can be designed as arbitrary functions of A and B respectively. Each worker i then computes the product ˜ A | i ˜ B i and returns the result to the master. . . . . . . . . . Figure 2.1: Overview of the distributed matrix multiplication framework. Coded data are initially stored distributedly at N workers according to data assignment. Each worker computes the product of the two stored matrices and returns it to the master. By carefully designing the computation strategy, the master can decode given the computing results from a subset of workers, without having to wait for the stragglers (worker 1 in this example). By carefully designing the computation strategy at each worker (i.e. designing ˜ A i and ˜ B i ), the master only needs to wait for the fastest subset of workers before recovering output C, hence mitigating the impact of stragglers. Given a computation strategy, we define its recovery threshold as the minimum number of workers that the master needs to wait for in order to compute C. In other words, if any subset of the workers with size no smaller than the recovery threshold finish their jobs, the master is able to compute C. Given this formulation, we are interested in the following main problem. What is the minimum possible recovery threshold for distributed matrix multiplication? Can we find an optimal computation strategy that achieves the minimum recovery threshold, while allowing efficient decoding of the final output at the master node? 1 This setting will be generalized in Chapter 4 for block-wise partitioning of the input matrices, to provide a flexible tradeoff between storage, computation, and communication. 7 There have been two computing schemes proposed earlier for this problem that leverage ideas from coding theory. The first one, introduced in [1] and extended in [2], injects redundancy in only one of the input matrices using maximum distance separable (MDS) codes [35] 2 . We illustrate this approach, referred to as one dimensional MDS code (1D MDS code), using the example shown in Fig. 2.2a, where we aim to compute C =A | B using 3 workers that can each store half of A and the entire B. The 1D MDS code evenly divides A along the column into two submatrices denoted by A 0 and A 1 , encodes them into 3 coded matrices A 0 , A 1 , and A 0 +A 1 , and then assigns them to the 3 workers. This design allows the master to recover the final output given the results from any 2 of the 3 workers, hence achieving a recovery threshold of 2. More generally, one can show that the 1D MDS code achieves a recovery threshold of K 1D-MDS ,N− N n +m = Θ(N). (2.1) An alternative computing scheme was recently proposed in [2] for the case of m = n, referred to as the product code, which instead injects redundancy in both input matrices. This coding technique has also been proposed earlier in the context of Fault Tolerant Computing in [36,37]. As demonstrated in Fig. 2.2b, product code aligns workers in an √ N−by− √ N layout. A is divided along the columns into m submatrices, encoded using an ( √ N,m) MDS code into √ N coded matrices, and then assigned to the √ N columns of workers. Similarly √ N coded matrices of B are created and assigned to the √ N rows. Given the property of MDS codes, the master can decode an entire row after obtaining any m results in that row; likewise for the columns. Consequently, the master can recover the final output using a peeling algorithm, iteratively decoding the MDS codes on rows and columns until the output C is completely available. For example, if the 5 computing resultsA | 1 B 0 ,A | 1 B 1 , (A 0 +A 1 ) | B 1 ,A | 0 (B 0 +B 1 ), andA | 1 (B 0 +B 1 ) are received as demonstrated in Fig. 2.2b, the master can recover the needed results by computing A | 0 B 1 = (A 0 +A 1 ) | B 1 −A | 1 B 1 then A | 0 B 0 = A | 0 (B 0 +B 1 )−A | 0 B 1 . In general, one can show that the product code achieves a recovery threshold of K product , 2(m− 1) √ N− (m− 1) 2 + 1 = Θ( √ N), (2.2) which significantly improves over K 1D-MDS . In this paper, we show that quite interestingly, the optimum recovery threshold can be far less than what the above two schemes achieve. In fact, we show that the minimum recovery threshold does not scale with the number of workers (i.e. Θ(1)). We prove this fact by designing a novel 2 An (n,k) MDS code is a linear code which transforms k raw inputs ton coded outputs, such that from any subset of size k of the outputs, the original k inputs can be recovered. 8 (a) 1D MDS-code [1] in an example with 3 workers that can each store half of A and the entire B. (b) Product code [2] in an example with 9 workers that can each store half of A and half of B. Figure 2.2: Illustration of (a) 1D MDS code, and (b) product code. coded computing strategy, referred to as the polynomial code, which achieves the optimum recovery threshold ofmn, and significantly improves the state of the art. Hence, our main result is as follows. For a general matrix multiplication task C =A | B using N workers, where each worker can store 1 m fraction of A and 1 n fraction of B, we propose polynomial codes that achieve the optimum recovery threshold of K poly ,mn = Θ(1). (2.3) Furthermore, polynomial code only requires a decoding complexity that is almost linear to the input size. The main novelty and advantage of the proposed polynomial code is that, by carefully designing the algebraic structure of the encoded submatrices, we ensure that any mn intermediate computations at the workers are sufficient for recovering the final matrix multiplication product at the master. This in a sense creates an MDS structure on the intermediate computations, instead of only the encoded matrices as in prior works. Furthermore, by leveraging the algebraic structure of polynomial codes, we can then map the reconstruction problem of the final output at the master to a polynomial interpolation problem (or equivalently Reed-Solomon decoding [38]), which can be solved efficiently [39]. This mapping also bridges the rich theory of algebraic coding and distributed matrix multiplication. We prove the optimality of polynomial code by showing that it achieves the information theoretic lower bound on the recovery threshold, obtained by cut-set arguments (i.e., we need at least mn matrix blocks returned from workers to recover the final output, which exactly have size mn blocks). Hence, the proposed polynomial code essentially enables a specific computing strategy such that, from any subset of workers that give the minimum amount of information needed to recover C, the master can successfully decode the final output. As a by-product, we also prove the optimality of polynomial code under several other performance metrics considered in previous 9 literature: computation latency [1,2], probability of failure given a deadline [17], and communication load [22,23,27]. Finally, we implement and benchmark the polynomial code on an Amazon EC2 cluster. We measure the computation latency and empirically demonstrate its performance gain under straggler effects. 2.1 System Model, Problem Formulation, and Main Result We consider a problem of matrix multiplication with two input matrices A∈F s×r q and B∈F s×t q , for some integers r, s, t and a sufficiently large finite fieldF q . We are interested in computing the productC,A | B in a distributed computing environment with a master node and N worker nodes, where each worker can store 1 m fraction of A and 1 n fraction of B, for some parameters m,n∈N + (see Fig. 2.1). We assume at least one of the two input matrices A andB is tall (i.e. s≥r ors≥t), because otherwise the output matrix C would be rank inefficient and the problem is degenerated. Specifically, each worker i can store two matrices ˜ A i ∈F s× r m q and ˜ B i ∈F s× t n q , computed based on arbitrary functions of A and B respectively. Each worker can compute the product ˜ C i , ˜ A | i ˜ B i , and return it to the master. The master waits only for the results from a subset of workers, before proceeding to recover the final output C given these products using certain decoding functions. 3 2.1.1 Problem Formulation Given the above system model, we formulate the distributed matrix multiplication problem based on the following terminology: We define the computation strategy as the 2N functions, denoted by a = (a 0 ,a 1 ,...,a N−1 ), b = (b 0 ,b 1 ,...,b N−1 ), (2.4) that are used to compute each ˜ A i and ˜ B i . Specifically, ˜ A i =a i (A), ˜ B i =b i (B), ∀ i∈{0, 1,...,N− 1}. (2.5) For any integer k, we say a computation strategy is k-recoverable if the master can recover C given the computing results from any k workers. We define the recovery threshold of a computation strategy, denoted by k(a,b), as the minimum integer k such that computation strategy (a,b) is k-recoverable. 3 Note that we consider the most general model and do not impose any constraints on the decoding functions. However, any good decoding function should have relatively low computation complexity. 10 Using the above terminology, we define the following concept: Definition 2.1. For a distributed matrix multiplication problem of computingA | B usingN workers that can each store 1 m fraction of A and 1 n fraction of B, we define the optimum recovery threshold, denoted byK ∗ , as the minimum achievable recovery threshold among all computation strategies, i.e. K ∗ , min a,b k(a,b). (2.6) The goal of this problem is to find the optimum recovery threshold K ∗ , as well as a computation strategy that achieves such an optimum threshold. 2.1.2 Main Result Our main result is stated in the following theorem: Theorem 2.1. For a distributed matrix multiplication problem of computing A | B using N workers that can each store 1 m fraction of A and 1 n fraction of B, the minimum recovery threshold K ∗ is K ∗ =mn. (2.7) Furthermore, there is a computation strategy, referred to as the polynomial code, that achieves the above K ∗ while allowing efficient decoding at the master node, i.e., with complexity equal to that of polynomial interpolation given mn points. Remark 2.1. Compared to the state of the art [1, 2], the polynomial code provides order-wise improvement in terms of the recovery threshold. Specifically, the recovery thresholds achieved by 1D MDS code [1,2] and product code [2] scale linearly with N and √ N respectively, while the proposed polynomial code actually achieves a recovery threshold that does not scale with N. Furthermore, polynomial code achieves the optimal recovery threshold. To the best of our knowledge, this is the first optimal design proposed for the distributed matrix multiplication problem. Remark 2.2. We prove the optimality of polynomial code using a matching information theoretic lower bound, which is obtained by applying a cut-set type argument around the master node. As a by-product, we can also prove that the polynomial code simultaneously achieves optimality in terms of several other performance metrics, including the computation latency [1,2], the probability of failure given a deadline [17], and the communication load [22,23,27], as discussed in Section 2.2.4. Remark 2.3. The polynomial code not only improves the state of the art asymptotically, but also gives strict and significant improvement for any parameter values of N, m, and n (See Fig. 2.3 for example). 11 Figure 2.3: Comparison of the recovery thresholds achieved by the proposed polynomial code and the state of the arts (1D MDS code [1] and product code [2]), where each worker can store 1 10 fraction of each input matrix. The polynomial code attains the optimum recovery threshold K ∗ , and significantly improves the state of the art. Remark 2.4. As we will discuss in Section 2.2.2, decoding polynomial code can be mapped to a polynomial interpolation problem, which can be solved in time almost linear to the input size [39]. This is enabled by carefully designing the computing strategies at the workers, such that the computed products form a Reed-Solomon code [40] , which can be decoded efficiently using any polynomial interpolation algorithm or Reed-Solomon decoding algorithm that provides the best performance depending on the problem scenario (e.g., [41]). Remark 2.5. In this chapter we focused on designing optimal coding techniques to handle stragglers issues. The same technique can also be applied to the fault-tolerant computing setting (e.g., within the algorithmic fault-tolerant computing framework of [36,37], where a module can produce arbitrary error results under failure), to improve robustness to failures in computing. Given that the polynomial code produces computing results that are coded by Reed-Solomon code, which has the optimum hamming distance, it allows detecting, or correcting the maximum possible number of module errors. Specifically, polynomial code can robustly detect up to N−mn errors, and correct up tob N−mn 2 c errors. This provides the first optimum code for matrix multiplication under fault-tolerant computing. 12 2.2 Polynomial Code and Its Optimality In this section, we formally describe the polynomial code and its decoding procedure. We then prove its optimality with an information theoretic converse, which completes the proof of Theorem 2.1. Finally, we conclude this section with the optimality of polynomial code under other settings. 2.2.1 Motivating Example Figure 2.4: Example using polynomial code, with 5 workers that can each store half of each input matrix. (a) Computation strategy: each worker i stores A 0 +iA 1 and B 0 +i 2 B 1 , and computes their product. (b) Decoding: master waits for results from any 4 workers, and decodes the output using fast polynomial interpolation algorithm. We first demonstrate the main idea through a motivating example. Consider a distributed matrix multiplication task of computing C = A | B using N = 5 workers that can each store half of the matrices (see Fig. 2.4). We evenly divide each input matrix along the column side into 2 submatrices: A = [A 0 A 1 ], B = [B 0 B 1 ]. (2.8) Given this notation, we essentially want to compute the following 4 uncoded components: C =A | B = A | 0 B 0 A | 0 B 1 A | 1 B 0 A | 1 B 1 . (2.9) Now we design a computation strategy to achieve the optimum recovery threshold of 4. Suppose elements ofA,B are inF 7 , let each workeri∈{0, 1,..., 4} store the following two coded submatrices: ˜ A i =A 0 +iA 1 , ˜ B i =B 0 +i 2 B 1 . (2.10) 13 To prove that this design gives a recovery threshold of 4, we need to design a valid decoding function for any subset of 4 workers. We demonstrate this decodability through a representative scenario, where the master receives the computation results from workers 1, 2, 3, and 4, as shown in Figure 2.4. The decodability for the other 4 possible scenarios can be proved similarly. According to the designed computation strategy, we have ˜ C 1 ˜ C 2 ˜ C 3 ˜ C 4 = 1 0 1 1 1 2 1 3 2 0 2 1 2 2 2 3 3 0 3 1 3 2 3 3 4 0 4 1 4 2 4 3 A | 0 B 0 A | 1 B 0 A | 0 B 1 A | 1 B 1 . (2.11) The coefficient matrix in the above equation is a Vandermonde matrix, which is invertible because its parameters 1, 2, 3, 4 are distinct inF 7 . So one way to recover C is to directly invert equation (2.11), which proves the decodability. However, directly computing this inverse using the classical inversion algorithm might be expensive in more general cases. Quite interestingly, because of the algebraic structure we designed for the computation strategy (i.e., equation (2.10)), the decoding process can be viewed as a polynomial interpolation problem (or equivalently, decoding a Reed-Solomon code). Specifically, in this example each worker i returns ˜ C i = ˜ A | i ˜ B i =A | 0 B 0 +iA | 1 B 0 +i 2 A | 0 B 1 +i 3 A | 1 B 1 , (2.12) which is essentially the value of the following polynomial at point x =i: h(x),A | 0 B 0 +xA | 1 B 0 +x 2 A | 0 B 1 +x 3 A | 1 B 1 . (2.13) Hence, recovering C using computation results from 4 workers is equivalent to interpolating a 3rd-degree polynomial given its values at 4 points. Later in this section, we will show that by mapping the decoding process to polynomial interpolation, we can achieve almost-linear decoding complexity. 2.2.2 Polynomial Code Now we present the polynomial code in a general setting that achieves the optimum recovery threshold stated in Theorem 2.1 for any parameter values of N, m, and n. First of all, we evenly divide each input matrix along the column side into m and n submatrices respectively, i.e., A = [A 0 A 1 ... A m−1 ], B = [B 0 B 1 ... B n−1 ], (2.14) 14 We then assign each worker i∈{0, 1,...,N− 1} a number inF q , denoted by x i , and make sure that all x i ’s are distinct. Under this setting, we define the following class of computation strategies. Definition 2.2. Given parameters α,β∈N, we define the (α,β)-polynomial code as ˜ A i = m−1 X j=0 A j x jα i , ˜ B i = n−1 X j=0 B j x jβ i , ∀ i∈{0, 1,...,N− 1}. (2.15) In an (α,β)-polynomial code, each worker i essentially computes ˜ C i = ˜ A | i ˜ B i = m−1 X j=0 n−1 X k=0 A | j B k x jα+kβ i . (2.16) In order for the master to recover the output given any mn results (i.e. achieve the optimum recovery threshold), we carefully select the design parameters α and β, while making sure that no two terms in the above formula has the same exponent of x. One such choice is (α,β) = (1,m), i.e, let ˜ A i = m−1 X j=0 A j x j i , ˜ B i = n−1 X j=0 B j x jm i . (2.17) Hence, each worker computes the value of the following degree mn− 1 polynomial at point x =x i : h(x), m−1 X j=0 n−1 X k=0 A | j B k x j+km , (2.18) where the coefficients are exactly the mn uncoded components of C. Since all x i ’s are selected to be distinct, recovering C given results from any mn workers is essentially interpolating h(x) using mn distinct points. Since h(x) has degree mn− 1, the output C can always be uniquely decoded. In terms of complexity, this decoding process can be viewed as interpolating degree mn− 1 polynomials ofF q for rt mn times. It is well known that polynomial interpolation of degree k has a complexity of O(k log 2 k log logk) [42]. Therefore, decoding polynomial code also only requires a complexity of O(rt log 2 (mn) log log(mn)). Furthermore, this complexity can be reduced by simply swapping in any faster polynomial interpolation algorithm or Reed-Solomon decoding algorithm. Remark 2.6. We can naturally extend polynomial code to the scenario where input matrix elements are real or complex numbers. In practical implementation, to avoid handling large elements in the coefficient matrix, we can first quantize input values into numbers of finite digits, embed them into a finite field that covers the range of possible values of the output matrix elements, and then directly apply polynomial code. By embedding into finite fields, we avoid large intermediate computing results, which effectively saves storage and computation time, and reduces numerical errors. 15 2.2.3 Optimality of Polynomial Code for Recovery Threshold So far we have constructed a computing scheme that achieves a recovery threshold of mn, which upper boundsK ∗ . To complete the proof of Theorem 2.1, here we establish a matching lower bound through an information theoretic converse. We need to prove that for any computation strategy, the master needs to wait for at least mn workers in order to recover the output. Recall that at least one of A andB is a tall matrix. Without loss of generality, assume A is tall (i.e. s≥r). Let A be an arbitrary fixed full rank matrix and B be sampled from F s×t q uniformly at random. It is easy to show that C = A | B is uniformly distributed onF r×t q . This means that the master essentially needs to recover a random variable with entropy of H(C) =rt log 2 q bits. Note that each worker returns rt mn elements of F q , providing at most rt mn log 2 q bits of information. Consequently, using a cut-set bound around the master, we can show that at least mn results from the workers need to be collected, and thus we have K ∗ ≥mn. Remark 2.7 (Random Linear Code). We conclude this subsection by noting that, another computation design is to let each worker store two random linear combinations of the input submatrices. Although this design can achieve the optimal recovery threshold with high probability, it creates a large coding overhead and requires high decoding complexity (e.g., O(m 3 n 3 +mnrt) using the classical inversion decoding algorithm). Compared to random linear code, the proposed polynomial code achieves the optimum recovery threshold deterministically, with a significantly lower decoding complexity. 2.2.4 Optimality of Polynomial Code for Other Performance Metrics In the previous subsection, we proved that polynomial code is optimal in terms of the recovery threshold. As a by-product, we can prove that it is also optimal in terms of some other performance metrics. In particular, we consider the following 3 metrics considered in prior works, and formally establish the optimality of polynomial code for each of them. Proofs can be found in Appendix A.1. Computation latency is considered in models where the computation time T i of each worker i is a random variable with a certain probability distribution (e.g, [1,2]). The computation latency is defined as the amount of time required for the master to collect enough information to decode C. Theorem 2.2. For any computation strategy, the computation latency T is always no less than the latency achieved by polynomial code, denoted by T poly . Namely, T≥T poly . (2.19) 16 Probability of failure given a deadline is defined as the probability that the master does not receive enough information to decode C at any time t [17]. Corollary 2.1. For any computation strategy, let T denote its computation latency, and let T poly denote the computation latency of polynomial code. We have P(T >t)≥P(T poly >t) ∀ t≥ 0. (2.20) Corollary 2.1 directly follows from Theorem 2.2 since (2.19) implies (2.20) . Communication load is another important metric in distributed computing (e.g. [22,23,27]), defined as the minimum number of bits needed to be communicated in order to complete the computation. Theorem 2.3. Polynomial code achieves the minimum communication load for distributed matrix multiplication, which is given by L ∗ =rt log 2 q. (2.21) 2.3 Experiment Results To examine the efficiency of our proposed polynomial code, we implement the algorithm in Python using the mpi4py library and deploy it on an AWS EC2 cluster of 18 nodes, with the master running on a c1.medium instance, and 17 workers running on m1.small instances. The input matrices are randomly generated as two numpy matrices of size 4000 by 4000, and then encoded and assigned to the workers in the preprocessing stage. Each worker stores 1 4 fraction of each input matrix. In the computation stage, each worker computes the product of their assigned matrices, and then returns the result using MPI.Comm.Isend(). The master actively listens to responses from the 17 worker nodes through MPI.Comm.Irecv(), and uses MPI.Request.Waitany() to keep polling for the earliest fulfilled request. Upon receiving 16 responses, the master stops listening and starts decoding the result. To achieve the best performance, we implement an FFT-based algorithm for the Reed-Solomon decoding. We compare our results with distributed matrix multiplication without coding. 4 The uncoded implementation is similar, except that only 16 out of the 17 workers participate in the computation, each of them storing and processing 1 4 fraction of uncoded rows from each input matrix. The master 4 Due to the EC2 instance request quota limit of 20, 1D MDS code and product code could not be implemented in this setting, which require at least 21 and 26 nodes respectively. 17 0 5 10 15 20 25 30 35 40 Computation Latency (s) 10 -2 10 -1 10 0 CCDF Uncoded Polynomial-Code Figure 2.5: Comparison of polynomial code and the uncoded scheme. We implement polynomial code and the uncoded scheme using Python and mpi4py library and deploy them on an Amazon EC2 cluster of 18 instances. We measure the computation latency of both algorithms and plot their CCDF. Polynomial code can reduce the tail latency by 37% even taking into account of the decoding overhead. waits for all 16 workers to return, and does not need to perform any decoding algorithm to recover the result. To simulate straggler effects in large-scale systems, we compare the computation latency of these two schemes in a setting where a randomly picked worker is running a background thread which approximately doubles the computation time. As shown in Fig. 2.5, polynomial code can reduce the tail latency by 37% in this setting, even taking into account of the decoding overhead. 2.4 Concluding Remarks and Polynomial Coded Computing A key challenge in coded computing is to design coding functions to still ensure efficient recovery of final results after nonlinear computation is applied on coded variables. We conclude this Chapter by introducing a coding framework called Polynomial Coded Computing (PCC), which can be applied beyond matrix multiplication to general coded computing problems. We formally introduce PCC as follows. Given a standard coded computing problem defined in Section 1.1, A general PCC design encodes the input dataset by assigning the workers evaluations of a carefully designed single variate polynomial. More specifically, the coding design is based on the following design parameters: • A single variate polynomial c(·), where the coefficients are possibly random functions of the input variables. • N evaluation points denoted y 1 ,...,y N from the base field. Then each worker i obtains c i (X) =c(y i ) as the encoded variable. 18 After the workers applyf on their assignments, they essentially evaluates the composed polynomial f(c) at the same point. Hence, if the evaluation points y 1 ,...,y N are distinct, and the decoder receives results from sufficiently many (at least the degree of f(c) plus one) workers, they can recover all information about polynomial f(c). Based on this observation, the design problem in PCC framework is to construct a polynomial c, satisfying 5 • Decodability: the final result can be computed using coefficients of f(c), while minimizing the degree of f(c) to achieve minimum recovery thresholds. In the following chapters, we demonstrate that PCC can be applied to construct optimal codes for large classes of functions including block matrix multiplication, convolution, and polynomial evaluation. 5 As well as other possible requirements such as complexity constraints on encoding and decoding functions (e.g., linear codes) [14,16], and data-privacy [20]. 19 Chapter 3 Polynomial Codes for Distributed Convolution We extend the polynomial code to the problem of distributed convolution [17]. We show that by simply reducing the convolution problem to matrix multiplication and applying the polynomial code, we strictly and order-wise improve the state of the art. Furthermore, by exploiting the computing structure of convolution, we propose a variation of the polynomial code, which strictly reduces the recovery threshold even further, achieving the optimum recovery threshold within a factor of 2, and the exact optimum recovery threshold among computation strategies with linear encoding functions. 3.1 Problem Formulation and Main Results We consider a convolution task with two input vectors a = [a 0 a 1 ...a m−1 ], b = [b 0 b 1 ...b n−1 ], (3.1) where alla i ’s andb i ’s are vectors of length s over a sufficiently large fieldF q . We want to compute c,a∗b using a master and N workers. Each worker can store two vectors of length s, which are functions ofa andb respectively. We refer to these functions as the computation strategy. Each worker computes the convolution of its stored vectors, and returns it to the master. The master only waits for the fastest subset of workers, before proceeding to decode c. Similar to This chapter is based on [14,16]. 20 distributed matrix multiplication, we define the recovery threshold for each computation strategy. We aim to characterize the optimum recovery threshold denoted by K ∗ conv , and find computation strategies that closely achieve this optimum threshold, while allowing efficient decoding at the master. Distributed convolution has also been studied in [17], where the coded convolution scheme was proposed. The main idea of the coded convolution scheme is to inject redundancy in only one of the input vectors using MDS codes. The master waits for enough results such that all intermediate valuesa i ∗b j can be recovered, which allows the final output to be computed. One can show that this coded convolution scheme is in fact equivalent to the 1D MDS-coded scheme proposed in [2]. Consequently, it achieves a recovery threshold of K 1D-MDS =N− N n +m. Note that by simply adapting our proposed polynomial code designed for distributed matrix multiplication to distributed convolution, the master can recover all intermediate valuesa i ∗b j after receiving results from any mn workers, to decode the final output. Consequently, this achieves a recovery threshold of K poly =mn, which already strictly and significantly improves the state of the art. In this chapter, we take one step further and propose an improved computation strategy, strictly reducing the recovery threshold on top of the naive polynomial code. The result is summarized as follows: Theorem 3.1. For a distributed convolution problem of computinga∗b using N workers that can each store 1 m fraction ofa and 1 n fraction ofb, we can find a computation strategy that achieves a recovery threshold of K conv-poly ,m +n− 1. (3.2) Furthermore, this computation strategy allows efficient decoding, i.e., with complexity equal to that of polynomial interpolation given m +n− 1 points. We prove Theorem 3.1 by proposing a variation of the polynomial code, which exploits the computation structure of convolution. This new computing scheme is formally demonstrated in Section 3.2. Remark 3.1. Similar to distributed matrix multiplication, our proposed computation strategy provides orderwise improvement compared to the state of the art [17] in various settings. Furthermore, it achieves almost-linear decoding complexity using the fastest polynomial interpolation algorithm or the Reed-Solomon decoding algorithm. 21 Moreover, we characterize K ∗ conv within a factor of 2, as stated in the following theorem and proved in Section 3.3. Theorem 3.2. For a distributed convolution problem, the minimum recovery threshold K ∗ conv can be characterized within a factor of 2, i.e.: 1 2 K conv-poly <K ∗ conv ≤K conv-poly . (3.3) Finally, we prove the exact optimality of polynomial codes for convolution among computing schemes with linear encoding functions. Specifically, we say the encoding functions are linear if the coded variables assigned to workers are computed with linear functions, and we denote the optimum recovery threshold achievable by computation strategy with linear encoding functions by K ∗ conv-linear . We present our result in the following theorem, which is proved in Section 3.4. Theorem 3.3. For the distributed convolution problem of computinga∗b using N workers that can each store 1 m fraction of a and 1 n fraction of b, the optimum recovery threshold that can be achieved using linear codes, denoted by K ∗ conv-linear , is exactly characterized by the following equation K ∗ conv-linear =K conv-poly ,m +n− 1. (3.4) 3.2 Proof of Theorem 3.1 In this section, we formally describe a computation strategy, which achieves the recovery threshold stated in Theorem 3.1. Consider a distributed convolution problem with two input vectors a = [a 0 a 1 ...a m−1 ], b = [b 0 b 1 ...b n−1 ], (3.5) where thea i ’s andb i ’s are vectors of length s. We aim to computec =a∗b using N workers. In previous literature [17], the computation strategies were designed so that the master can recover all intermediate values a j ∗b k ’s. This is essentially the same computing framework used in the distributed matrix multiplication problem, so by naively applying the polynomial code (specifically the (1,m)-polynomial code using the notation in Definition 2), we can achieve the corresponding optimal recovery threshold in computing alla j ∗b k ’s. However, the master does not need to know each individuala i ∗b j in order to recover the output c. To customize the coding design so as to utilize this fact, we recall the general class of computation strategies stated in Definition 2: Given design parameters α and β, the (α,β)-polynomial code lets 22 each worker i store two vectors ˜ a i = m−1 X j=0 a j x jα i , ˜ b i = n−1 X j=0 b j x jβ i , (3.6) where the x i ’s are N distinct values assigned to the N workers. Recall that in the polynomial code designed for matrix multiplication, we picked values of α,β such that, in the local product, all coefficientsa j ∗b k are preserved as individual terms with distinct exponents on x i . The fact that no two terms were combined leaves enough information to the master, so that it can decode any individual coefficient value from the intermediate result. Now that decoding all individual values is no longer required in the problem of convolution, we can design a new variation of the polynomial code to further improve recovery threshold, using design parameters α =β = 1. In other words, each worker stores two vectors ˜ a i = m−1 X j=0 a j x j i , ˜ b i = n−1 X j=0 b j x j i . (3.7) After computing the convolution product of the two locally stored vectors, each worker i returns ˜ a i ∗ ˜ b i = m−1 X j=0 n−1 X k=0 a j ∗b k x j+k i , (3.8) which is essentially the value of the following degree m +n− 2 polynomial at point x =x i . h(x) = m+n−2 X j=0 min{j,n−1} X k=max{0,j−m+1} a j−k ∗b k x j i . (3.9) Using this design, instead of recovering alla j ∗b k ’s, the server can only recover a subspace of their linear combinations. Interestingly, we can still recoverc using these linear combinations, because it is easy to show that, if two values are combined in the same term of vector P min{j,n−1} k=max{0,j−m+1} a j−k ∗b k , then they are also combined in the same term ofc. Consequently, after receiving the computing results from any m +n− 1 workers, the server can recover all the coefficients of h(x), which allows recoveringc, which prove that this computation strategy achieves a recovery threshold of m +n− 1. Similar to distributed matrix multiplication, this decoding process can be viewed as interpolating degree m +n− 2 polynomials ofF q for s times. Consequently, the decoding complexity is O(s(m + n) log 2 (m +n) log log(m +n)), which is almost-linear to the input size s(m +n). 23 Remark 3.2. Similar to distributed matrix multiplication, we can also extend this computation strategy to the scenario where the elements of input vectors are real or complex numbers, by quantizing all input values, embedding them into a finite field, and then directly applying our distributed convolution algorithm. 3.3 Proof of Theorem 3.2 Now we characterize the optimum recovery threshold K conv within a factor of 2, as stated in Theorem 3.2. The upper bound K ∗ conv ≤K conv-poly directly follows from Theorem 4, hence we focus on proving the lower bound of K ∗ conv . We first prove the following inequality. K ∗ conv ≥ max{m,n}. (3.10) Leta be any fixed non-zero vector, andb be sampled fromF sn q uniformly at random. We can be easily show that the operation of convolving witha is invertible, and thus the entropy ofc,a∗b equals that ofb, which is sn log 2 q. Note that each worker i returns ˜ a i ∗ ˜ b i , whose entropy is at mostH(˜ a i ) +H( ˜ b i ) =s log 2 q. Using a cut-set bound around the master, we can show that at least n results from the workers need to be collected, and thus we have K ∗ ≥n. Similarly we have K ∗ ≥m, hence K ∗ ≥ max{m,n}. Thus we can show that the gap between the upper and lower bounds is no larger than 2: K ∗ ≥ max{m,n}≥ m+n 2 > m+n−1 2 = K conv-poly 2 . 3.4 Proof of Theorem 3.3 As we have shown in Section 3.2, the recovery threshold stated in Theorem 3.3 is achievable using a polynomial coded computing design, which proves an upper bound of K ∗ conv-linear . Hence, in this section we focus on proving the matching converse. Specifically, we aim to prove that given any problem parametersm,n, andN, for any computation strategy, if the encoding functions are linear, then its recovery threshold is at least m +n− 1. We prove it by contradiction. Assume the opposite, then the master can recoverc using results from a subset of at mostm+n−2 workers. We denote this subset byK. Obviously, we can find a partition ofK into two subsets, denoted byK a andK b , such that|K a |≤m− 1 and|K b |≤n− 1. Note that the encoding functions of workers inK a collaboratively and linearly maps F ms to F (m−1)s , which has a non-zero kernel. Hence, we can find a non-zero input vectora such that all workers inK a returns 0. Similarly, we 24 can find a non-zerob such that all workers inK b returns 0. Recall thatK a ∪K b =K. Consequently, when the master receives 0 from all workers inK, the decoding function returnsa∗b. This convolution product must be the 0 vector, given that the workers return the same results under zero inputs. However, note that the convolution operator has no zero-divisor. Eithera orb has to be zero, which contradicts the non-zero assumptions. Hence, we have K(f,g)≥m +n− 1. This concludes the proof of Theorem 3.3. Remark 3.3. Note that in this proof we never assumed the base field of convolution is finite. Hence, the optimality result stated in Theorem 3.3 holds true for any field, including real and complex numbers. 25 Chapter 4 Entangled Polynomial Codes: Optimal Codes for Block Matrix Multiplication In this chapter, we consider a general formulation of distributed matrix multiplication where the inputs are block-partitioned before being coded and sent to the workers. We study its information- theoretic limits, and develop optimal coding designs for straggler mitigation. We consider a standard master-worker distributed setting, where a group of N workers aim to collaboratively compute the product of two large matrices A and B, and return the result C =A | B to the master. As shown in Figure 4.1, the two input matrices are partitioned (arbitrarily) into p-by-m and p-by-n blocks of submatrices respectively, where all submatrices within the same input are of equal size. Each worker has a local memory that can be used to store any coded function of each matrix, denoted by ˜ A i ’s and ˜ B i ’s, each with a size equal to that of the corresponding submatrices. The workers then multiply their two stored (coded) submatrices and return the results to the master. By carefully designing the coding functions, the master can decode the final result without having to wait for the slowest workers, which provides robustness against stragglers. Note that by allowing different values of parameters p, m, and n, we allow flexible partitioning of input matrices, which in return enables different utilization of system resources (e.g., the required amount of storage at each worker and the amount of communication from worker to master). 1 Hence, considering the system constraints on available storage and communication resources, one This chapter is based on [16]. 1 A more detailed discussion is provided in Remark 4.3 26 . . . Figure 4.1: Overview of the distributed matrix multiplication problem. Each worker computes the product of the two stored encoded submatrices ( ˜ A i and ˜ B i ) and returns the result to the master. By carefully designing the coding strategy, the master can decode the multiplication result of the input matrices from a subset of workers, without having to wait for stragglers (worker 1 in this example). can choose p, m, and n accordingly. We aim to find optimal coding and computation designs for any choice of parameters p, m and n, to provide optimum straggler effect mitigation for various situations. With a careful design of the coded submatrices ˜ A i and ˜ B i at each worker, the master only needs results from the fastest workers before it can recover the final output, which effectively mitigates straggler issues. To measure the robustness against straggler effects of a given coding strategy, we use the metric recovery threshold, defined earlier in Section 1.1, which is equal to the minimum number of workers that the master needs to wait for in order to compute the output C. Given this terminology, our main problem is as follows: What is the minimum possible recovery threshold and the corresponding coding scheme, for any choice of parameters p, m, n, and N? We propose a novel coding technique, referred to as entangled polynomial code, which achieves the recovery threshold of pmn +p− 1 for all possible parameter values. The construction of the entangled polynomial code is based on the observation that when multiplying an m-by-p matrix and a p-by-n matrix, we essentially evaluate a subspace of bilinear functions, spanned by the pairwise product of the elements from the two matrices. Although potentially there are a total of p 2 mn pairs of elements, at mostpmn pairs are directly related to the matrix product, which is an order of p less. The particular structure of the proposed code entangles the input matrices to the output such that the system almost avoids unnecessary multiplications and achieves a recovery threshold in the order of pmn, while allowing robust straggler mitigation for arbitrarily large systems. This allows orderwise improvement upon conventional uncoded approaches, random linear codes, and MDS-coding type approaches for straggler mitigation [1,3]. 27 Entangled polynomial code generalizes our earlier presented polynomial code for distributed matrix multiplication [14], which was designed for the special case of p = 1 (i.e., allowing only column-wise partitioning of matrices A and B). However, as we move to arbitrary partitioning of the input matrices (i.e., arbitrary values of m, n, and p), a key challenge is to design the coding strategy at each worker such that its computation best aligns with the final computation C. In particular, to recover the product C, the master needs mn components that each involve summing p products of submatrices of A and B. Entangled polynomial code effectively aligns the workers’ computations with the master’s need, which is its key distinguishing feature from polynomial code. We show that entangled polynomial code achieves the optimal recovery threshold among all linear coding strategies in the cases of m = 1 or n = 1. It also achieves the optimal recovery threshold among all possible schemes within a factor of 2 when m = 1 or n = 1. Furthermore, for all partitionings of input matrices (i.e., all values of p, m, n, and N), we characterize the optimal recovery threshold among all linear coding strategies within a factor of 2 of R(p,m,n), which denotes the bilinear complexity of multiplying an m-by-p matrix to a p-by-n matrix (see Definition 4.3 later in the paper). While evaluating bilinear complexity is a well-known challenging problem in the computer science literature (see [43]), we show that the optimal recovery threshold for linear coding strategies can be approximated within a factor of 2 of this fundamental quantity. We establish this result by developing an improved version of the entangled polynomial code, which achieves a recovery threshold of 2R(p,m,n)− 1. Specifically, this coding construction exploits the fact that any matrix multiplication problem can be converted into a problem of computing the element-wise product of two arrays of length R(p,m,n). Then we show that this augmented computing task can be optimally handled using a variation of the entangled polynomial code, and the corresponding optimal code achieves the recovery threshold 2R(p,m,n)− 1. Finally, we show that the coding construction and converse bounding techniques developed for proving the above results can also be directly extended to other problems, including fault- tolerant computing, which was first studied in [36] for matrix multiplication. We provide tight characterizations on the maximum number of detectable or correctable errors. We note that recently, another computation design named PolyDot was also proposed for distributed block matrix multiplication, achieving a recovery threshold of m 2 (2p− 1) form =n [15]. Both entangled polynomial code and PolyDot are developed by extending the polynomial codes presented in earlier chapters to allow arbitrary partitioning of input matrices. Compared with PolyDot, entangled polynomial code achieves a strictly smaller recovery threshold of pmn +p− 1, by a factor of 2. More importantly, we have developed a converse bounding technique that proves the 28 optimality of the entangled polynomial code in several cases. We have also proposed an improved version of the entangled polynomial code and characterized the optimum recovery threshold within a factor of 2 for all parameter values. 4.1 System Model and Problem Formulation We consider a problem of matrix multiplication with two input matrices A∈F s×r andB∈F s×t , for some integers r, s, t and a sufficiently large fieldF. 2 We are interested in computing the product C,A | B in a distributed computing environment with a master node and N worker nodes, where each worker can store 1 pm fraction of A and 1 pn fraction of B, based on some integer parameters p, m, and n (see Fig. 4.1). Specifically, each worker i can store two coded matrices ˜ A i ∈F s p × r m and ˜ B i ∈F s p × t n , computed based on A and B respectively. Each worker can compute the product ˜ C i , ˜ A | i ˜ B i , and return it to the master. The master waits only for the results from a subset of workers before proceeding to recover the final output C using certain decoding functions. Given the above system model, we formulate the distributed matrix multiplication problem based on the following terminology: We define the computation strategy as a collection of 2N encoding functions, denoted by a = (a 0 ,a 1 ,...,a N−1 ), b = (b 0 ,b 1 ,...,b N−1 ), (4.1) that are used by the workers to compute each ˜ A i and ˜ B i , and a class of decoding functions, denoted by d ={d K } K⊆{0,1,...,N−1} , (4.2) that are used by the master to recover C given results from any subsetK of the workers. Each worker i stores matrices ˜ A i =a i (A), ˜ B i =b i (B), (4.3) and the master can compute an estimate ˆ C of matrixC using results from a subsetK of the workers by computing ˆ C =d K { ˜ C i } i∈K . (4.4) 2 Here we consider the general class of fields, which includes finite fields, the field of real numbers, and the field of complex numbers. 29 For any integer k, we say a computation strategy is k-recoverable if the master can recover C given the computing results from any k workers. Specifically, a computation strategy isk-recoverable if for any subsetK of k users, the final output ˆ C from the master equals C for all possible input values. We define the recovery threshold of a computation strategy, denoted by K(a,b,d), as the minimum integer k such that computation strategy (a,b,d) is k-recoverable. We aim to find a computation strategy that requires the minimum possible recovery threshold and allows efficient decoding at the master. Among all possible computation strategies, we are particularly interested in a certain class of designs, referred to as the linear codes and defined as follows: Definition 4.1. For a distributed matrix multiplication problem of computing A | B using N workers, we say a computation strategy is a linear code given parameters p, m, and n, if there is a partitioning of the input matrices A and B where each matrix is divided into the following submatrices of equal sizes A = A 0,0 A 0,1 ··· A 0,m−1 A 1,0 A 1,1 ··· A 1,m−1 . . . . . . . . . . . . A p−1,0 A p−1,1 ··· A p−1,m−1 , (4.5) B = B 0,0 B 0,1 ··· B 0,n−1 B 1,0 B 1,1 ··· B 1,n−1 . . . . . . . . . . . . B p−1,0 B p−1,1 ··· B p−1,n−1 , (4.6) such that the encoding functions of each worker i can be written as ˜ A i = X j,k A j,k a ijk , ˜ B i = X j,k B j,k b ijk , (4.7) for some tensors a and b, and the decoding function given each subsetK can be written as 3 ˆ C j,k = X i∈K ˜ C i c ijk , (4.8) for some tensor c. For brevity, we denote the set of linear codes asL. The major advantage of linear codes is that they guarantee that both the encoding and the decoding complexities of the scheme scale linearly with respect to the size of the input matrices. 3 Here ˆ C j,k denotes the master’s estimate of the subblock of C that corresponds to P ` A `,j B `,k . 30 Furthermore, as we have proved in [14], linear codes are optimal for p = 1. Given the above terminology, we define the following concept. Definition 4.2. For a distributed matrix multiplication problem of computing A | B using N workers, we define the optimum linear recovery threshold as a function of the problem parameters p, m, n, and N, denoted by K ∗ linear , as the minimum achievable recovery threshold among all linear codes. Specifically, K ∗ linear , min (a,b,d)∈L K(a,b,d). (4.9) Our goal is to characterize the optimum linear recovery threshold K ∗ linear , and to find computation strategies to achieve such optimum threshold. Note that if the number of workers N is too small, obviously no valid computation strategy exists even without requiring straggler tolerance. Hence, in the rest of the paper, we only consider the meaningful case where N is large enough to support at least one valid computation strategy. More concretely, we show that the minimum possible number of workers is given by a fundamental quantity: the bilinear complexity of multiplying an m-by-p matrix and a p-by-n matrix, which is formally introduced in Section 4.2. We are also interested in characterizing the minimum recovery threshold achievable using general coding strategies (including non-linear codes). Similar to Chapter 2, we define this value as the optimum recovery threshold and denote it by K ∗ . 4.2 Main Results We state our main results in the following theorems: Theorem 4.1. For a distributed matrix multiplication problem of computing A | B using N workers, with parameters p, m, and n, the following recovery threshold can be achieved by a linear code, referred to as the entangled polynomial code. 4 K entangled-poly ,pmn +p− 1. (4.10) Remark 4.1. Compared to some other possible approaches, our proposed entangled polynomial code provides orderwise improvement in the recovery threshold (see Fig. 4.2). One conventional approach (referred to as the uncoded repetition scheme) is to let each worker store and multiply uncoded submatrices. With the additional computation redundancy through repetition, the scheme can robustly tolerate some stragglers. However, its recovery threshold, K uncoded ,N−b N pmn c + 1, 4 For N <pmn +p− 1, we define K entangled-poly ,N. 31 grows linearly with respect to the number of workers. Another approach is to let each worker store two random linear combinations of the input submatrices (referred to as the random linear code). With high probability, this achieves recovery threshold K RL ,p 2 mn, 5 which does not scale with N. However, to calculate C, we need the result of at most pmn sub-matrix multiplications. Indeed, the lack of structure in the random coding forces the system to wait for p times more than what is essentially needed. One surprising aspect of the proposed entangled polynomial code is that, due to its particular structure which aligns the workers’ computations with the master’s need, it avoids unnecessary multiplications of submatrices. As a result, it achieves a recovery threshold that does not scale with N, and is orderwise smaller than that of the random linear code. Furthermore, it allows efficient decoding at the master, which requires at most an almost linear complexity. 0 10 20 30 40 50 60 70 80 Number of Workers N 0 10 20 30 40 50 60 Recovery Threshold Uncoded Repetition Short MDS Random Linear Code Entangled Polynomial Code (Optimal) bilinear complexity Figure 4.2: Comparison of the recovery thresholds achieved by the uncoded repetition scheme, the random linear code, the short-MDS (or short-dot) [1,3] and our proposed entangled polynomial code, given problem parameters p =m = 3, n = 1. The entangled polynomial code orderwise improves upon all other approaches. It also achieves the optimum linear recovery threshold in this scenario. Remark 4.2. There have been several works in prior literature investigating the p = 1 case [1,2,14]. For this special case, the entangled polynomial code reduces to our previously proposed polynomial code, which achieves the optimum recovery thresholdmn and orderwise improves upon other designs. On the other hand, there has been some investigation on matrix-by-vector type multiplication [1,3], 5 Intuitively, because each worker returns a random linear combination of all p 2 mn possible pairwise products, with high probability, the final output can be recovered from any subset of p 2 mn results. 32 which can be viewed as the special case ofm = 1 orn = 1 in our proposed problem. The short-MDS code (or short-dot) has been proposed, achieving a recovery threshold of N−b N p c +m, which scales linearly withN. Our proposed entangled polynomial code also strictly and orderwise improves upon that (see Fig. 4.2). Remark 4.3. By selecting different values of parameters p, m, andn, the entangled polynomial code enables different utilization of the system resources, which allows for balancing the costs due to storage and communication. In particular, one can show that a distributed implementation for multiplying A | ∈F r×s and B∈F s×t with parameters p, m, and n requires: • Computation load at each worker (normalized by the cost of a single field operation): O( srt pmn ), • Communication required from each worker (normalized by the size of C): L, 1 mn , • Storage allocated for storing each coded matrix (normalized by the sizes of A,B, respectively): μ A , 1 pm , μ B , 1 pn . If we roughly fix the computation load (specifically, fixing pmn for the cubic matrix multipli- cation algorithm), the computing scheme requires the following trade-off between storage and communication: Lμ A μ B ∼ constant. (4.11) By designing the values of p, m, and n, we can operate at different locations on this trade-off to account for the system’s requirement 6 , while the entangled polynomial code maintains almost the same recovery threshold. Our second result is the optimality of the entangled polynomial code when m = 1 or n = 1. Specifically, we prove that entangled polynomial code is optimal in this scenario among all linear codes. Furthermore, if the base fieldF is finite, it also achieves the optimum recovery threshold K ∗ within a factor of 2, with non-linear coding strategies taken into account. Theorem 4.2. For a distributed matrix multiplication problem of computing A | B using N workers, with parameters p, m, and n, if m = 1 or n = 1, we have K ∗ linear =K entangled-poly . (4.12) 6 For example, letting p = 1 minimizes the communication load L, and letting n = 1 or m = 1 minimizes the storage cost for storing matrix A or matrix B, respectively. Our proposed entangled polynomial code achieves the optimum linear recovery threshold in all these cases. More generally, adjusting the value of p trades communication by storage; then adjusting the ratio between m and n allows for minimizing the overall storage cost, to account for the scenario where the sizes of input matrices are unbalanced. Finally, by scaling p, m, and n without taking the computational constraint into account, we enable the flexibility in terms of level of distribution. 33 Moreover, if the base field F is finite, 1 2 K entangled-poly <K ∗ ≤K entangled-poly . (4.13) Remark 4.4. We prove Theorem 4.2 by first exploiting the algebraic structure of matrix multiplication to develop a linear algebraic converse for equation (4.12), and then constructing an information theoretic converse to prove inequality (4.13). The linear algebraic converse only relies on two properties of the matrix multiplication operation: 1) bilinearity, and 2) uniqueness of zero element. This technique can be extended to any other bilinear operations with similar properties, such as convolution, as mentioned later (see Theorem 3.3). On the other hand, the information theoretic converse is obtained through a cut-set type argument, which allows a lower bound on the recovery thresholds even for non-linear codes. Our final result on the main problem is characterizing the optimum linear recovery threshold K ∗ linear within a factor of 2 for all possible p, m, n, and N, by developing an improved version of the entangled polynomial code. This characterization involves the fundamental concept of bilinear complexity [43]: Definition 4.3. The bilinear complexity of multiplying an m-by-p matrix and a p-by-n matrix, denoted by R(p,m,n), is defined as the minimum number of element-wise multiplications required to complete such an operation. Rigorously, R(p,m,n) denotes the minimum integer R, such that we can find tensors a∈F R×p×m , b∈F R×p×n , and c∈F R×m×n , satisfying X i c ijk X j 0 ,k 0 A j 0 k 0a ij 0 k 0 X j 00 ,k 00 B j 00 k 00b ij 00 k 00 = X ` A `j B `k . (4.14) for any input matrices A∈F p×m , B∈F p×n . Using this concept, we state our result as follows. Theorem 4.3. For a distributed matrix multiplication problem of computing A | B using N workers, with parameters p, m, and n, the optimum linear recovery threshold is characterized by R(p,m,n)≤K ∗ linear ≤ 2R(p,m,n)− 1, (4.15) where R(p,m,n) denotes the bilinear complexity of multiplying an m-by-p matrix and a p-by-n matrix. 34 Remark 4.5. The key proof idea of Theorem 4.3 is twofold. We first demonstrate a one-to-one correspondence between linear computation strategies and upper bound constructions 7 for bilinear complexity, which enables converting a matrix multiplication problem into computing the element- wise product of two vectors of length R(p,m,n). Then we show that an optimal computation strategy can be developed for this augmented problem, which achieves the stated recovery threshold. Similarly to this result, factor-of-2 characterization can also be obtained for non-linear codes, as discussed in Section 4.5. Remark 4.6. The coding construction we developed for proving Theorem 4.3 provides an improved version of the entangled polynomial code. Explicitly, given any upper bound construction for R(p,m,n) with rank R, the coding scheme achieves a recovery threshold of 2R− 1, while tolerating arbitrarily many stragglers. This improved version further and orderwise reduces the needed recovery threshold on top of its basic version. For example, by simply applying the well-know Strassen’s construction [44], which provides an upper bound R(2 k , 2 k , 2 k )≤ 7 k for any k∈N, the proposed coding scheme achieves a recovery threshold of 2· 7 k − 1, which orderwise improves upon K entangled-poly = 8 k + 2 k − 1 achieved by the entangled polynomial code. Further improvements can be achieved by applying constructions with lower ranks, up to 2R(p,m,n)− 1. Remark 4.7. In parallel to this work, the Generalized PolyDot scheme was proposed in [45] to extend the PolyDot construction [15] to asymmetric matrix-vector multiplication. Generalized PolyDot can be applied to achieve the same recovery threshold of the entangled polynomial code for special case of m = 1 or n = 1. However, entangled polynomial codes achieve order-wise better recovery thresholds for general values of p, m, and n. The techniques we developed in this chapter can also be extended to several other problems such as fault-tolerant computing [36,37], leading to tight characterizations. Unlike the straggler effects we studied in this paper, fault tolerance considers scenarios where arbitrary errors can be injected into the computation, and the master has no information about which subset of workers are returning errors. We show that the techniques we developed for straggler mitigation can also be applied in this setting to improve robustness against computing failures, and the optimality of any encoding function in terms of recovery threshold also preserves when applied in the fault-tolerant computing setting. As an example, we present the following theorem, demonstrating this connection. Theorem 4.4. For a distributed matrix multiplication problem of computing A | B using N workers, with parameters p, m, and n, if m = 1 or n = 1, the entangled polynomial code can detect up to E ∗ detect =N−K entangled-poly (4.16) 7 Formally defined in Section 4.5. 35 errors, and correct up to E ∗ correct = N−K entangled-poly 2 (4.17) errors. This can not be improved using any other linear encoding strategies. Remark 4.8. The proof idea for Theorem 4.4 is to connect the straggler mitigation problem and the fault tolerance problem by extending the concept of Hamming distance to coded computing. Specifically, we map the straggler mitigation problem to the problem of correcting erasure errors, and the fault tolerance problem to the problem of correcting arbitrary errors. The solution to these two communication problems are deeply connected by the Hamming distance, and we show that this result extends to coded computing (see Lemma B.1 in Appendix B.1). Since the concept of Hamming distance is not exclusively defined for linear codes, this connection also holds for arbitrary computation strategies. Furthermore, this approach can be easily extended to the hybrid settings where both stragglers and computing errors exist, and similar results can be proved. The detailed formulation and proof can be found in Appendix B.1. In Section 4.3, we prove Theorem 4.1 by describing the (basic version of) entangled polynomial codes. Then in Section 4.4, we prove Theorem 4.2 by deriving the converses. Finally, we present the coding construction and converse for proving Theorem 4.3 in Section 4.5. 4.3 Entangled Polynomial Code In this section, we prove Theorem 4.1 by formally describing the entangled polynomial code and its decoding procedure. We start with an illustrating example. 4.3.1 Illustrating Example Consider a distributed matrix multiplication task of computing A | B using N = 5 workers that can each store half of the rows (i.e., p = 2 and m =n = 1). We evenly divide each input matrix along the row side into 2 submatrices: A = A 0 A 1 , B = B 0 B 1 , (4.18) Given this notation, we essentially want to compute C =A | B = h A | 0 B 0 +A | 1 B 1 i . (4.19) 36 A naive computation strategy is to let the 5 workers compute eachA | i B i uncodedly with repetition. Specifically we can let 3 workers computeA | 0 B 0 and 2 workers computeA | 1 B 1 . However, this approach can only robustly tolerate 1 straggler, achieving a recovery threshold of 4. Another naive approach is to use random linear codes, i.e., let each worker store a random linear combination of A 0 ,A 1 , and a combination of B 0 , B 1 . However, the resulting computation result of each worker is a random linear combination of 4 variables A | 0 B 0 , A | 0 B 1 , A | 1 B 0 , and A | 1 B 1 , which also results in a recovery threshold of 4. Surprisingly, there is a simple computation strategy for this example that achieves the optimum linear recovery threshold of 3. The main idea is to instead inject structured redundancy tailored to the matrix multiplication operation. We present this proposed strategy as follows: Figure 4.3: Example using entangled polynomial code, with 5 workers that can each store half of each input matrix. (a) Computation strategy: each worker i stores A 0 +iA 1 and iB 0 +B 1 , and computes their product. (b) Decoding: master waits for results from any 3 workers, and decodes the output using polynomial interpolation. Suppose elements of A,B are inR. Let each worker i∈{0, 1,..., 4} store the following two coded submatrices: ˜ A i =A 0 +iA 1 , ˜ B i =iB 0 +B 1 . (4.20) To prove that this design gives a recovery threshold of 3, we need to find a valid decoding function for any subset of 3 workers. We demonstrate this decodability through a representative scenario, where the master receives the computation results from workers 1, 2, and 4, as shown in Figure 4.3. The decodability for the other 9 possible scenarios can be proved similarly. 37 According to the designed computation strategy, we have ˜ C 1 ˜ C 2 ˜ C 4 = 1 0 1 1 1 2 2 0 2 1 2 2 4 0 4 1 4 2 A | 0 B 1 A | 0 B 0 +A | 1 B 1 A | 1 B 0 . (4.21) The coefficient matrix in the above equation is a Vandermonde matrix, which is invertible because its parameters 1, 2, 4 are distinct in R. So one decoding approach is to directly invert equation (4.21), of which the returned result includes the needed matrix C =A | 0 B 0 +A | 1 B 1 . This proves the decodability. However, as we will explain in the general coding design, directly computing this inverse problem using the classical inversion algorithm might be expensive in some more general cases. Quite interestingly, because of the algebraic structure we designed for the computation strategy (i.e., equation (4.20)), the decoding process can be viewed as a polynomial interpolation problem (or equivalently, decoding a Reed-Solomon code). Specifically, in this example each worker i returns ˜ C i = ˜ A | i ˜ B i =A | 0 B 1 +i(A | 0 B 0 +A | 1 B 1 ) +i 2 A | 1 B 0 , (4.22) which is essentially the value of the following polynomial at point x =i: h(x), ˜ A | i ˜ B i =A | 0 B 1 +x(A | 0 B 0 +A | 1 B 1 ) +x 2 A | 1 B 0 . (4.23) Hence, recovering C using computation results from 3 workers is equivalent to recovering the linear term coefficient of a quadratic function given its values at 3 points. Later in this section, we will show that by mapping the decoding process to polynomial interpolation, we can achieve almost-linear decoding complexity even for arbitrary parameter values. 4.3.2 General Coding Design Now we present the entangled polynomial code, which achieves a recovery threshold pmn +p− 1 for any p, m, n and N as stated in Theorem 4.1. 8 First of all, we evenly divide each input matrix into pm and pn submatrices according to equations (4.5) and (4.6). We then assign each worker i∈{0, 1,...,N− 1} an element inF, denoted by x i , and make sure that all x i ’s are distinct. Under this setting, we define the following class of computation strategies. 8 For N <pmn +p− 1, a recovery threshold of N is achievable by definition. Hence we focus on the case where N≥pmn +p− 1. 38 Definition 4.4. Given parameters α,β,θ∈N, we define the (α,β,θ)-polynomial code as ˜ A i = p−1 X j=0 m−1 X k=0 A j,k x jα+kβ i , ˜ B i = p−1 X j=0 n−1 X k=0 B j,k x (p−1−j)α+kθ i , ∀ i∈{0, 1,...,N− 1}. (4.24) In an (α,β,θ)-polynomial code, each worker essentially evaluates a polynomial whose coefficients are fixed linear combinations of the products A | j,k B j 0 ,k 0. Specifically, each worker i returns ˜ C i = ˜ A | i ˜ B i = p−1 X j=0 m−1 X k=0 p−1 X j 0 =0 n−1 X k 0 =0 A | j,k B j 0 ,k 0x (p−1+j−j 0 )α+kβ+k 0 θ i . (4.25) Consequently, when the master receives results from enough workers, it can recover all these linear combinations using polynomial interpolation. Recall that we aim to recover C = C 0,0 C 0,1 ··· C 0,n−1 C 1,0 C 1,1 ··· C 1,n−1 . . . . . . . . . . . . C m−1,0 C m−1,1 ··· C m−1,n−1 , (4.26) where each submatrix C k,k 0 , P p−1 j=0 A | j,k B j,k 0 is also a fixed linear combination of these products. We design the values of parameters (α,β,θ) such that all these linear combinations appear in (4.25) separately as coefficients of terms of different degrees. Furthermore, we want to minimize the degree of the polynomial ˜ C i , in order to reduce the recovery threshold. One design satisfying these properties is (α,β,θ) = (1,p,pm), i.e, ˜ A i = p−1 X j=0 m−1 X k=0 A j,k x j+kp i , ˜ B i = p−1 X j=0 n−1 X k=0 B j,k x p−1−j+kpm i . (4.27) Hence, each worker returns the value of the following degreepmn +p− 2 polynomial at pointx =x i : h i (x), ˜ A | i ˜ B i = p−1 X j=0 m−1 X k=0 p−1 X j 0 =0 n−1 X k 0 =0 A | j,k B j 0 ,k 0x (p−1+j−j 0 )+kp+k 0 pm i , (4.28) 39 where each C k,k 0 is exactly the coefficient of the (p− 1 +kp +k 0 pm)-th degree term. Since all x i ’s are selected to be distinct, recovering C given results from any pmn +p− 1 workers is essentially interpolating h(x) using pmn +p− 1 distinct points. Because the degree of h(x) is pmn +p− 2, the output C can always be uniquely decoded. 4.3.3 Computational complexities In terms of complexity, the decoding process of entangled polynomial code can be viewed as interpolating a degree pmn +p− 2 polynomial for rt mn times. It is well known that polynomial interpolation of degreek has a complexity ofO(k log 2 k log logk) [42]. 9 Therefore, decoding entangled polynomial code only requires at most a complexity of O(prt log 2 (pmn) log log(pmn)), which is almost linear to the input size of the decoder (Θ(prt) elements). This complexity can be reduced by simply swapping in any faster polynomial interpolation algorithm or Reed-Solomon decoding algorithm. In addition, this decoding complexity can also be further improved by exploiting the fact that only a subset of the coefficients are needed for recovering the output matrix. Note that given the presented computation framework, each worker is assigned to multiply two coded matrices with sizes of r m × s p and s p × t n , which requires a complexity of O( srt pmn ). 10 This complexity is independent of the coding design, indicating that the entangled polynomial code strictly improves other designs without requiring extra computation at the workers. Recall that the decoding complexity of entangled polynomial code grows linearly with respect to the size of the output matrix. The decoding overhead becomes negligible compared to workers’ computational load in practical scenarios where the sizes of coded matrices assigned to the workers are sufficiently large. Moreover, the fast decoding algorithms enabled by the Polynomial coding approach further reduces this overhead, compared to general linear coding designs. Entangled polynomial code also enables improved performances for systems where the data has to encoded online. For instance, if the input matrices are broadcast to the workers and are encoded distributedly, the linearity of entangled polynomial code allows for an in-place algorithm, which does not require addition storage or time complexity. Alternatively, if centralized encoding is required, almost-linear-time algorithms can also be developed similar to decoding: at most a complexity of O(( sr pm log 2 (pm) log log(pm)+ st pn log 2 (pn) log log(pn))N) is required using fast polynomial evaluation, which is almost linear with respect to the output size of the encoder (Θ(( sr pm + st pn )N) elements). 9 When the base field supports FFT, this complexity bound can be improved to O(k log 2 k). 10 More precisely, the commonly used cubic algorithm achieves a complexity ofθ( srt pmn ) for the general case. Improved algorithms has been found in certain cases (e.g., [44,46–54]), however, all known approaches requires a super-quadratic complexity. 40 4.4 Converses In this section, we provide the proof of Theorem 4.2. We first prove equation (4.12) by developing a linear algebraic converse. Then we prove inequality (4.13) through an information theoretic lower bound. 4.4.1 Maching Converses for Linear Codes To prove equation (4.12), we start by developing a converse bound on recovery threshold for general parameter values, then we specialize it to the settings where m = 1 orn = 1. We state this converse bound in the following lemma: Lemma 4.1. For a distributed matrix multiplication problem with parameters p, m, n, and N, we have K ∗ linear ≥ min{N, pm +pn− 1}. (4.29) When m = 1 or n = 1, the RHS of inequality (4.29) is exactly K entangled-poly . Hence equation (4.12) directly follows from Lemma 4.1. So it only suffices to prove Lemma 4.1, and we prove it as follows: Proof. To prove Lemma 4.1, we only need to consider the following two scenarios: (1) If K ∗ linear =N, then (4.29) is trivial. (2) If K ∗ linear <N, then we essentially need to show that for any parameter values p, m, n, and N satisfying this condition, we have K ∗ linear ≥pm +pn− 1. By definition, if such a linear recovery threshold is achievable, we can find a computation strategy, i.e., tensors a,b, and a class of decoding functionsd,{d K }, such that d K X j 0 ,k 0 A | j 0 ,k 0 a ij 0 k 0 X j 00 ,k 00 B j 00 ,k 00b ij 00 k 00 i∈K =A | B (4.30) for any input matrices A and B, and for any subsetK of K ∗ linear workers. We choose the values of A and B, such that each A j,k and B j,k satisfies A j,k =α jk A c , (4.31) 41 B j,k =β jk B c , (4.32) for some matrices α∈F p×m , β∈F p×n , and constants A c ∈F s p × r m , B c ∈F s p × t n satisfying A | c B c 6= 0. Consequently, we have d K X j 0 ,k 0 α j 0 k 0a ij 0 k 0 X j 00 ,k 00 β j 00 k 00b ij 00 k 00 A | c B c i∈K =A | B (4.33) for all possible values of α, β, andK. Fixing the value i, we can view each subtensor a ijk as a vector of length pm, and each subtensor b ijk as a vector of length pn. For brevity, we denote each such vector by a i andb i respectively. Similarly, we can also view matrices α and β as vectors of length pm and pn, and we denote these vectors by and. Furthermore, we can define dot products within these vector spaces following the conventions. Using these notations, (4.33) can be written as d K {(·a i ) (·b i )A | c B c } i∈K =A | B. (4.34) Given the above definitions, we now prove that within each subsetK of size K ∗ linear , the vectors {a i } i∈K span the spaceF pm . Essentially, we need to prove that for any such given subsetK, there does not exist a non-zero α∈F p×m such that the corresponding vector∈F pm satisfies·a i = 0 for all i∈K. Assume the opposite that such an α exists, so that·a i is always 0, then the LHS of (4.34) becomes a fixed value. On the other hand, since α is non-zero, we can always find different values ofβ such thatα | β is variable. Recalling (4.31) and (4.32), the RHS of (4.34) cannot be fixed if α | β is variable, which results in a contradiction. Now we use this conclusion to prove (4.29). For any fixedK with size K ∗ linear , letB be a subset of indices inK such that{a i } i∈B form a basis. Recall that we are considering the case where K ∗ linear <N, meaning that we can find a worker ˜ k6∈K. For convenience, we defineK + =K∪{ ˜ k}, andK − ,K + \B. Obviously,|B| =pm, and|K − | =|K + |−|B| =K ∗ linear + 1−pm. Hence, it suffices to prove that|K − |≥pn, which only requires that{b i } i∈K − forms a basis ofF pn . Equivalently, we only need to prove that any β∈F p×n such that its vectorized version∈F pn satisfies ·b i = 0 for any i∈K − must be zero. For brevity, we letB denotes the subspace that contains all values of β satisfying this condition. To prove this statement, we first construct a list of matrices as follows, denoted by{α i } i∈B . Recall that{a i } i∈B forms a basis. We can find a matrixα i ∈F p×m for eachi∈B such that their vectorized 42 version{ i } i∈B satisfies i ·a i 0 =δ i,i 0. 11 From elementary linear algebra, the vectors{ i } i∈B also form a basis ofF pm . Correspondingly, their matrix version{α i } i∈B form a basis ofF p×m . For any k∈B, we defineK k =K + \{k}. Note that|K k | =K ∗ linear , equation (4.34) should also hold forK k instead ofK. Moreover, note that if we fixα =α k , then the corresponding LHS of (4.34) remains fixed for any β∈B. As a result, A | B must also be fixed. Similar to the above discussion, this requires that the value of α | k β be fixed. This value has to be 0 because β = 0 satisfies our stated condition. Now we have proved that any β∈B must also satisfy α | k β = 0 for any k∈B. Because{α k } k∈B form a basis ofF p×m , such β acting onF p×m through matrix product has to be the zero operator, so β = 0. As mentioned above, this results in K ∗ linear ≥pm +pn− 1, which completes the proof of Lemma 4.1 and equation (4.12). Remark 4.9. Note that in the above proof, we never used the condition that the decoding functions are linear. Hence, the converse does not require the linearity of the decoder. This fact will be used later in our discussion regarding the fault-tolerant computing in Appendix B.1. 4.4.2 Information Theoretic Converse for Nonlinear Codes Now we prove inequality (4.13) through an information theoretic converse bound. Similar to the proof of equation (4.12), we start by proving a general converse. Lemma 4.2. For a distributed matrix multiplication problem with parameters p, m, n, and N, if the base field F is finite, we have K ∗ ≥ max{pm,pn}. (4.35) When m = 1 or n = 1, the RHS of inequality (4.35) is greater than 1 2 K entangled-poly . Hence inequality (4.13) directly results from Lemma 4.2, which we prove as follows. Proof. Without loss of generality, we assume m≥ n, and aim to prove K ∗ ≥ pm. Specifically, we need to show that any computation strategy has a recovery threshold of at least pm, for any possible parameter values. Recall the definition of recovery threshold. It suffices to prove that for any computation strategy (f,g,d) and any subsetK of workers, if the master can recover C given 11 Here δi,j denotes the discrete delta function, i.e., δi,i = 1, and δi,j = 0 for i6=j. 43 results from workers inK (i.e., the decoding function d K returns C for any possible values of A and B), then we must have|K|≥pm. Suppose the condition in the above statement holds. Given each inputA, the workers can compute { ˜ A i } i∈K using the encoding functions. On the other hand, for any fixed possible value of B, the workers can compute{ ˜ C i } i∈K based on{ ˜ A i } i∈K . Hence, let ˜ C i,func be a function that returns ˜ C i given B as input,{ ˜ C i,func } i∈K is completely determined by{ ˜ A i } i∈K , without requiring additional information on the value of A. If we view A as a random variable, we have the following Markov chain: A→{ ˜ A i } i∈K →{ ˜ C i,func } i∈K . (4.36) Because the master can decode C as a function of{ ˜ C i } i∈K , if we define C func similarly as a function that returns C given B as input, C func is also completely determined by{ ˜ C i,func } i∈K , with no direct dependency on any other variables. Consequently, we have the following extended Markov chain A→{ ˜ A i } i∈K →{ ˜ C i } i∈K →C func . (4.37) Note that by definition, C func has to satisfy C func (B) = A | B for any A∈F s×r and B∈F s×t . Hence, C func is essentially a linear operator uniquely determined by A, defined as multiplication by A | . Conversely, one can show that distinct values of A leads to distinct operators, which directly follows from the definition of matrix multiplication. Therefore, the input matrix A can be exactly determined from C func , i.e., H(A|C func ) = 0. Using the data processing inequality, we have H(A|{ ˜ A i } i∈K ) = 0. Now let A be uniformly randomly sampled fromF s×r , and we have H(A) =sr log 2 |F| bits. On the other hand, each ˜ A i consists of sr pm elements, which has an entropy of at most sr pm log 2 |F| bits. Consequently, we have |K|≥ H(A) max i∈K H( ˜ A i ) ≥pm. (4.38) This concludes the proof of Lemma 4.2 and inequality (4.13). 44 4.5 Factor of 2 characterization of Optimum Linear Recovery Thresh- old In this section, we provide the proof of Theorem 4.3. Specifically, we need to provide a computation strategy that achieves a recovery threshold of at most 2R(p,m,n)− 1 for all possible values of p,m, n, and N, as well as a converse result showing that any linear computation strategy requires at least N≥R(p,m,n) workers for any p, m, and n. The proof is accomplished in 2 steps. In Step 1, we show that any linear code for matrix multiplication is equivalently an upper bound construction of the bilinear complexity R(p,m,n), and vice versa. This result indicates the equality between R(p,m,n) and the minimum required number of workers, which proves the needed converse. It also converts any matrix multiplication into the computation of element-wise products given two vectors of length R(p,m,n). Then in Step 2, we show that we can find an optimal computation strategy for this augmented computing task. We develop a variation of the entangled polynomial code, which achieves a recovery threshold of 2R(p,m,n)− 1. For Step 1, we first formally define upper bound constructions for bilinear complexity. Definition 4.5. Given parameters p, m, n, an upper bound construction for bilinear complexity R(p,m,n) with rank R is a tuple of tensors a∈F R×p×m , b∈F R×p×n , and c∈F R×m×n such that for any matrices A∈F p×m , B∈F p×n , X i c ijk X j 0 ,k 0 A j 0 k 0a ij 0 k 0 X j 00 ,k 00 B j 00 k 00b ij 00 k 00 = X ` A `j B `k . (4.39) Recall the definition of linear codes. One can verify that any upper bound construction with rank R is equivalently a linear computing design using R workers when the sizes of input matrices are given by A∈F p×m , B∈F p×n . Note that matrix multiplication follows the same rules for any block matrices, this equivalence holds true for any input sizes. 12 Specifically, given an upper bound construction (a,b,c) with rank R, and for general inputs A∈F s×r , B∈F s×t , any block of the final output C can be computed as C j,k = X i c ijk ˜ A | i,vec ˜ B i,vec , (4.40) 12 Rigorously, it also requires the linear independence of the A | i Bj ’s, which can be easily proved. 45 where ˜ A i,vec and ˜ B i,vec are linearly encoded matrices stored by R workers, defined as ˜ A i,vec , X j,k A j,k a ijk , ˜ B i,vec , X j,k B j,k b ijk . (4.41) Conversely, one can also show that any linear code using N workers is equivalently an upper bound construction with rank N. This equivalence relationship provides a one-to-one mapping between linear codes and upper bound constructions. Recall the definition of bilinear complexity (provided in Section 4.2), which essentially states that the minimum achievable rank R equals R(p,m,n). We have shown that the minimum number of workers required for any linear code is given by the same quantity, which proves the coverse. In terms of achievability, we have also proved the existence of a linear computing design usingR(p,m,n) workers, where the encoding and decoding are characterized by some tensors a∈F R(p,m,n)×p×m , b∈F R(p,m,n)×p×n , and c∈F R(p,m,n)×m×n satisfying equation (4.14), following equations (4.40) and (4.41). This achievability scheme essentially converts matrix multiplication into a problem of computing the element-wise product of two “vectors” ˜ A i,vec and ˜ B i,vec , each of length R(p,m,n). Specifically, the master only needs ˜ A | i,vec ˜ B i,vec for decoding the final output. Now in Step 2, we develop the optimal computation strategy for this augmented computation task. Given two arbitrary vectors ˜ A i,vec and ˜ B i,vec of length R(p,m,n), we want to achieve a recovery threshold of 2R(p,m,n)− 1 for computing their element-wise product using N workers, each of which can multiply two coded vectors of length 1. As we have explained in Section 4.3.2, a recovery threshold of N is always achievable, so we only need to focus on the scenario where N≥ 2R(p,m,n)− 1. The main coding idea is to first view the elements in each vector as values of a degreeR(p,m,n)−1 polynomial at R(p,m,n) different points. Specifically, given R(p,m,n) distinct elements in the field F, denoted by x 0 ,x 1 ,...,x R(p,m,n)−1 , we find polynomials ˜ f and ˜ g of degree R(p,m,n)− 1, whose coefficients are matrices, such that ˜ f(x i ) = ˜ A i,vec (4.42) ˜ g(x i ) = ˜ B i,vec . (4.43) Recall that we want to recover ˜ A | i,vec ˜ B i,vec , which is essentially recovering the values of the degree 2R(p,m,n)− 2 polynomial ˜ h, ˜ f | ˜ g at these R(p,m,n) points. Earlier in this paper, we already developed a coding structure that allows us to recover polynomials of this form. We now reuse the idea in this construction. 46 Let y 0 , y 1 , ..., y N−1 be distinct elements ofF. We let each worker i store ˜ A i = ˜ f(y i ), (4.44) ˜ B i = ˜ g(y i ), (4.45) which are linear combinations of the input submatrices. More Specifically, ˜ A i = X j ˜ A j,vec · Y k6=j (y i −x k ) (x j −x k ) , (4.46) ˜ B i = X j ˜ B j,vec · Y k6=j (y i −x k ) (x j −x k ) . (4.47) After computing the product, each worker essentially evaluates the polynomial ˜ h aty i . Hence, from the results of any 2R(p,m,n)−1 workers, the master can recover ˜ h, which has degree 2R(p,m,n)−2, and proceed with decoding the output matrix C. This construction achieves a recovery threshold of 2R(p,m,n)− 1, which proves the upper bound in Theorem 4.3. Remark 4.10. The computation strategy we developed in Step 2 provides a tight upper bound on the characterization of the optimum linear recovery threshold for computing element-wise product of two arbitrary vectors using N machines. Its optimality naturally follows from Theorem 4.2, given that the element-wise product of two vectors contains all the information needed to compute the dot-product, which is a special case of matrix multiplication. We formally state this result in the following corollary. Corollary 4.1. Consider the problem of computing the element-wise product of two vectors of length R using N workers, each of which can store a linearly coded element of each vector and return their product to the master. The optimum linear recovery threshold, denoted as K ∗ e-prod-linear , is given by the following equation: 13 K ∗ e-prod-linear = min{N, 2R− 1}. (4.48) Remark 4.11. Note that Step 2 of this proof does not require the computation strategy to be linear. Hence, using exactly the same coding approach, we can easily extend this result to non-linear codes, and prove a similar factor-of-2 characterization for the optimum recovery threshold K ∗ , formally stated in the following corollary. Corollary 4.2. For a distributed matrix multiplication problem with parameters p, m, and n, let N ∗ (p,m,n) denotes the minimum number of workers such that a valid (possibly non-linear) 13 Obviously, we need N≥R to guarantee the existence of a valid computation strategy. 47 computation strategy exists. Then for all possible values of N, we have N ∗ (p,m,n)≤K ∗ ≤ 2N ∗ (p,m,n)− 1. (4.49) Remark 4.12. Finally, note that the computing design provided in this section can be applied any upper bound construction with rank R, achieving a recovery threshold of 2R− 1, its significance is two-fold. Using constructions that achieves bilinear complexity, it proves the existence of a factor-of-2 optimal computing scheme, which achieves the same recovery threshold while tolerating arbitrarily many stragglers. On the other hand, for cases where R(p,m,n) is not yet known, explicit coding constructions can still be obtained (e.g., using the well know Strassen’s result [44], as well as any other known constructions, such as ones presented in [46–60]), which enables further improvements upon the basic entangled polynomial code. 4.5.1 Computational complexities Algorithmically, decoding the improved version of entangled polynomial code can be completed in two steps. In step 1, the master can first recover the element-wise products{ ˜ A | i,vec ˜ B i,vec } R(p,m,n) i=1 , by Lagrange-interpolating a degree 2R(p,m,n)−1 polynomial atR(p,m,n) points, for rt mn times. Similar to the entangled polynomial code, it requires a complexity of at mostO( rt mn R(p,m,n) log 2 (R(p,m,n)) log log(R(p,m,n))), which is almost linear to the input size of the decoder (Θ( rt mn R(p,m,n)) elements). Then in Step 2, the master can recover the final results by linearly combining these products, following equation (4.40). Note that without even exploiting any algebraic properties of the tensor construction, the natural computing approach achieves a complexity of Θ(rtR(p,m,n)) for computing the second step. This already achieves a strictly smaller decoding complexity compared with a general linear computing design, which could requires inverting an R(p,m,n)-by-R(p,m,n) matrix. 14 Moreover, note that most commonly used upper bound constructions are based on the sub- multiplicativity of R(p,m,n), further improved decoding algorithms can be designed when these constructions are used instead. As an example, consider Strassen’s construction, which achieves a rank ofR = 7 k ≥R(2 k , 2 k , 2 k ). The final outputs can essentially be recovered given the intermediate products{ ˜ A | i,vec ˜ B i,vec } R i=1 by following the last few iterations of Strassen’s Algorithm, requiring only a linear complexity Θ( rt mn R). This approach achieves an overall decoding complexity of O( rt mn R log 2 R log logR), which is almost linear to the input size of the decoder. 14 Similar to matrix multiplication, inverting a k-by-k matrix requires a complexity of O(k 3 ). Faster algorithms has been developed, however, all known results requires super-quadratic complexity. 48 Similar to the discussion in Section 4.3.3, the computational complexity at each worker is O( srt pmn ), which is independent of the coding design. Hence, the improved version of the entangled polynomial code also does not require extra computation at the workers, and the decoding overhead becomes negligible when sizes of the coded submatrices are sufficiently large. Improved performances can also be obtained for systems that requires online encoding, following similar approaches used in decoding. 4.6 Concluding Remarks In this paper, we studied the coded distributed matrix multiplication problem and proposed entangled polynomial codes, which allows optimal straggler mitigation and orderwise improves upon the prior arts. Based on our proposed coding idea, we proved a fundamental connection between the optimum linear recovery threshold and the bilinear complexity, which characterizes the optimum linear recovery threshold within a factor of 2 for all possible parameter values. The techniques developed in this paper can be directly applied to many other problems, including coded convolution and fault-tolerant computing, providing matching characterizations. By directly extending entangled polynomial codes to secure [61–75], private [67,69,74,76], and batch [72,77,78] distributed matrix multiplication, we can also order-wise improve all other block-partitioning based schemes [65,66,74,77,78], achieving subcubic recovery threshold while enabling flexible resource tradeoffs. 15 Entangled polynomial codes has also inspired recent development of coded computing schemes for general polynomial computations [20], secure/private computing [79], and secure sharding in blockchain systems [80]. One interesting follow-up direction is to find better characterization of the optimum linear recovery threshold. Although this problem is completely solved for cases including m = 1, n = 1, or p = 1, there is room for improvement in general cases. Another interesting question is whether there exist non-linear coding strategies that strictly out-perform linear codes, especially for the important case where the input matrices are large (s,r,t p,m,n), while allowing for efficient decoding algorithms with almost linear complexities. Finally, the main focus of this paper is to provide optimal algorithmic solutions for matrix multiplication on general fields. Although, when the base field is infinite, one can instead embed the computation into finite fields to avoid practical issues such as numerical error and computation overheads (see discussions in [14,81]). It is an interesting following direction to find new quantization and computation schemes to study optimal tradeoffs between these measures. 15 For details, see Chapter 8. 49 Chapter 5 Coded Fourier Transform Discrete Fourier transform (DFT) is one of the fundamental operations, which has been broadly used in many applications, including signal processing, data analysis, and machine learning algo- rithms. Due to the increasing size and dimension of data, many modern applications require massive amount of computation and storage, which can not be provided by a single machine. Thus, finding efficient design of algorithms including DFT in a distributed computing environment has gained considerable attention. For example, several distributed DFT implementations, such as FFTW [82] and PFFT [83], have been introduced and used widely. In this chapter, our focus is on mitigating the straggler effects for distributed DFT algorithms. Specifically, we consider a distributed Fourier transform problem where we aim to compute the discrete Fourier transform X =F{x} given an input vector x. As shown in Figure 5.1, the computation is carried out using a distributed system with a master node and N worker nodes that can each store and process 1 m fraction of the input vector, for some parameter m∈N ∗ . The vector stored at each worker can be designed as an arbitrary function of the input vectorx. Each worker can also compute an intermediate result of the same length based on an arbitrary function of its stored vector, and return it to the master. By designing the computation strategy at each worker (i.e., designing the functions to store the vector and to compute the intermediate result), the master only need to wait for the fastest subset of workers before recovering the final outputX, which mitigates the straggler effects. Our main result in this paper is the development of an optimal computing strategy, referred to as the coded FFT. This computing design achieves the optimum recovery threshold m, while This chapter is based on [18]. 50 . . . Figure 5.1: Overview of the distributed Fourier transform framework. Coded data are initially stored distributedly at N workers according to data assignment. Each worker computes an intermediate result based on its stored vector and returns it to the master. By designing the computation strategy, the master can decode given the computing results from a subset of workers, without having to wait for the stragglers (worker 1 in this example). allowing the the master to decode the final output with low complexity. Furthermore, we extend this technique to settings including computing multi-dimensional Fourier transform, and propose the corresponding optimal computation strategies. To develop coded FFT, we leverage two key algebraic properties of the Fourier transform operations. First due to its recursive structures, we can decompose the DFT into multiple identical and simpler operations (i.e., DFT over shorter vectors), which suits the distributed computing framework and can be potentially assigned to multiple worker nodes. Secondly, due to the linearity of Fourier transform, we can apply linear codes on the input data, which commutes with the DFT operation and translates to the computing results. These two properties allow us to develop a coded computing strategy where the outputs from the worker nodes has certain MDS properties, which can optimally mitigate straggler effects. 5.1 System Model and Main Results We consider a problem of computing the Discrete Fouier transform X =F{x} in a distributed computing environment with a master node and N worker nodes. The inputx and the outputX are vectors of length s over an arbitrary fieldF with a primitive sth root of unity, denoted by ω s . 1 1 When the base fieldF is finite, we assume it is sufficiently large. 51 We want to compute the elements of the output vector, denoted by X 0 ,...,X s−1 , as a function of the elements of the input vector, denoted by x 0 ,...,x s−1 , based on the following equations. X i , s−1 X j=0 x j ω ij s for i∈{0,...,s− 1}. (5.1) Each one of the N workers can store and process 1 m fraction of the vector. Specifically, given a parameter m∈ N ∗ satisfying m|s, each worker i can store an arbitrary vector a i ∈ F s m as a function of the inputx, compute an intermediate resultb i ∈F s m as a function ofa i , and returnb i to the server. The server only waits for the results from a subset of workers, before recovering the final outputX using certain decoding functions, given these intermediate results returned from the workers. Given the above system model, we can design the functions to compute a i s’ and b i s’ for the workers. We refer to these functions as the encoding functions and the computing functions. We say that a computation strategy consists of N encoding functions and N computing functions, denoted by f = (f 0 ,f 1 ,...,f N−1 ), (5.2) and g = (g 0 ,g 1 ,...,g N−1 ), (5.3) that are used to compute thea i s’ andb i s’ . Specifically, given a computation strategy, each worker i storesa i and computesb i according to the following equations: a i =f i (x), (5.4) b i =g i (a i ). (5.5) For any integer k, we say a computation strategy is k-recoverable if the master can recover X given the computing results from any k workers using certain decoding functions. We define the recovery threshold of a computation strategy as the minimum integer k such that the computation strategy is k-recoverable. The goal of this paper is to find the optimal computation strategy that achieves the minimum possible recovery threshold, while allowing efficient decoding at the master node. This essentially provides the computation strategy with the maximum robustness against the straggler effect, which only requires a low additional computation overhead. 52 We summarize our main results in the following theorems: Theorem 5.1. In a distributed Fourier transform problem of computing X =F{x} using N workers that each can store and process 1 m fraction of the input x, we can achieve the following recovery threshold K ∗ =m. (5.6) Furthermore, the above recovery threshold can be achieved by a computation strategy, referred to as the Coded FFT, which allows efficient decoding at the master node, i.e., with a complexity that scales linearly with respect to the size s of the input data. Moreover, we can prove the optimally of coded FFT, which is formally stated in the following theorem Theorem 5.2. In a distributed Fourier transform environment with N workers that each can store and process 1 m fraction of the input vector, the following recovery threshold K ∗ =m (5.7) is optimal when the base field F is finite. 2 Remark 5.1. The above converse demonstrates that our proposed coded FFT design is optimal in terms of recovery threshold. Moreover, we can prove that coded FFT is also optimal in terms of the communication load (see Section 5.3). Remark 5.2. While in the above results we focused on the developing the optimal coding technique for the one dimensional Fourier transform. The techniques developed in this paper can be easily generalized to the n-dimensional Fourier transform operations. Specifically, we can show that in a general n-dimensional Fourier transform setting, the optimum recovery threshold K ∗ =m can still be achieved, using a generalized version of the coded FFT strategy (see Section 5.4). Similarly, this also generalized to the scenario where we aim to compute the Fourier transform of multiple input vectors. The optimum recovery threshold K ∗ =m can also be achieved (see Section 5.5). Remark 5.3. Although the coded FFT strategy was designed focusing on optimally handling the stragglers issues, it can also be applied to the fault tolerance computing setting (e.g., as considered in [84,85], where a module can produce arbitrary error results under failure), to improve robustness to failures in computing. Specifically, given that the coded FFT produces computing results that are coded by an MDS code, it also enables detecting, or correcting maximum amounts errors even when the erroneous workers can produce arbitrary computing results. 2 Similar results can be generalized to the case where the base field is infinite, by taking into account of some practical implementation constrains (see Section 5.3). 53 5.2 Coded FFT: the Optimal Computation Strategy In this section, we prove Theorem 5.1 by proposing an optimal computation strategy, referred to as Coded FFT. We start by demonstrate this computation strategy and the corresponding decoding procedures through a motivating example. 5.2.1 Motivating Example Consider a distributed Fourier transform problem with an input vector x = [x 0 ,x 1 ,x 2 ,x 3 ]∈C 4 , N = 4 workers, and a design parameter m = 2. We want to compute the Fourier transform X =F{x}, which is specified as follows. X 0 X 1 X 2 X 3 = 1 1 1 1 1 − √ −1 −1 √ −1 1 −1 1 −1 1 √ −1 −1 − √ −1 x 0 x 1 x 2 x 3 . (5.8) We aim to design a computation strategy to achieve a recovery threshold of 2. Figure 5.2: Example using coded FFT, with 3 workers that can each store and process half of the input. (a) Computation strategy: each worker i stores a linear combination of the interleaved version of the input according to an MDS code, and computes its DFT. (b) Decoding: master waits for results from any 2 workers, and recover the final output by first decoding the MDS code, then compute the transformed vector following the similar steps in the Cooley-Tukey algorithm. In order to design the optimal strategy, we exploit two key properties of the DFT operation. Firstly, DFT has the following recursive structure: X i = 3 X j=0 x j (− √ −1) ij (5.9) = 1 X k=0 c 0,k (−1) ik + (− √ −1) i 1 X k=0 c 1,k (−1) ik , (5.10) 54 where vectorsc 0 andc 1 are the interleaved version of the input vector: c 0 = [x 0 ,x 2 ], (5.11) c 1 = [x 1 ,x 3 ]. (5.12) This structure decomposes the Fourier transform into two identical and simpler operations: the Fourier transform ofc 0 andc 1 , defined as follows. C i,j , 1 X k=0 c i,k (−1) jk . (5.13) Hence, computing the Fourier transform of a vector is essentially computing the Fourier transforms of its sub-components. This property has been exploited in the context of single machine algorithms and led to the famous Cooley-Tukey algorithm [86]. On the other hand, we exploit the linearity of the DFT operation to inject linear codes in the computation to provide robustness against stragglers. Specifically, given that the Fourier transform of any linearly coded vector equals the linear combination of the Fourier transforms of the individual vectors, by injecting MDS code on the interleaved vectorsc 0 andc 1 and computing their Fourier transforms, we obtain a coded version of the vectorsC 0 andC 1 . This provides the redundancy to mitigate the straggler effects. Specifically, we encodec 0 andc 1 using a (3, 2)-MDS code, and let each worker store one of the coded vectors. I.e., a 0 =c 0 , (5.14) a 1 =c 1 , (5.15) a 2 =c 0 +c 1 . (5.16) Each worker computes the Fourier transformb i =F{a i } of its assigned vector. Specifically, each worker i computes b i,0 b i,1 = 1 1 1 −1 a i,0 a i,1 . (5.17) To prove that this computation strategy gives a recovery threshold of 2, we need to design a valid decoding function for any subset of 2 workers. We demonstrate this decodability through a representative scenario, where the master receives the computation results from worker 1 and worker 2 as shown in Figure 5.2. The decodability for the other 2 possible scenarios can be proved similarly. 55 According to the designed computation strategy, the server can first recover the computing result of worker 0 given the results from the other workers as follows: b 0 =b 2 −b 1 . (5.18) After recoveringb 0 , we can verify that the server can then recover the final outputX usingb 0 and b 1 as follows: X 0 X 1 X 2 X 3 = b 0,0 +b 1,0 b 0,1 − √ −1·b 1,1 b 0,0 −b 1,0 b 0,1 + √ −1·b 1,1 . (5.19) 5.2.2 General Description of Coded FFT Now we present an optimal computing strategy that achieves the optimum recovery threshold stated in Theorem 5.1, for any parameter values of N and m. First of all we interleave the input vectorx into m vectors of length s m , denoted byc 0 ,...,c m−1 . Specifically, we let the jth element of eachc i equal c i,j =x i+jm . (5.20) We denote the discrete Fourier transform of each interleaved vectorc i , in the domain ofZ s m , asC i . Specifically, C i,j , s m −1 X k=0 c i,k ω jkm s for j∈{0,..., s m − 1}. (5.21) Note that if the master node can recover all the above Fourier transform C i of the interleaved vectors, the final output can be computed based on the following identities: X i = m−1 X j=0 s m −1 X k=0 c j,k ω i(j+km) s (5.22) = m−1 X j=0 C j, mod (i, s m ) ω ij s , (5.23) where mod(i, s m ) denotes the remainder of i divided by s m . 56 Based on this observation, we can naturally view the distributed Fourier transform problem as a problem of distributedly computing a list of linear transformations, i.e., computing the Fourier transform ofc i ’s. We inject the redundancy as follows to provide robustness to the computation: We first encode thec 0 ,c 1 ,...,c m−1 using an arbitrary (N,m)-MDS code, where the coded vectors are denoted a 0 ,...,a N−1 and are assigned to the workers correspondingly. Then each worker i computes the Fourier ofa i , and return it to the master. Given the linearity of Fourier transform, the computing resultsb 0 ,...,b N−1 are essentially linear combinations of the Fourier transformC i ’s, which are coded by the same MDS code. Hence, after the master receives any m computing results, it can decode the messageC i ’s, and proceed to recover the final result. This allows achieving the recovery threshold of m. Remark 5.4. The recovery threshold K ∗ =m achieved by coded FFT can not be achieved using computation strategies that were developed for generic matrix-by-vector multiplication in the literature [1, 3]. Specifically, the conventional uncoded repetition strategy requires a recovery threshold of N− N m 2 + 1, and the short-dot (or short-MDS) strategy provided in [1, 3] requires N− N m +m. Hence, by developing a coding strategy for the specific purpose of computing Fourier transform, we can achieve order-wise improvement in the recovery threshold. 5.2.3 Decoding Complexity of Coded FFT Now we show that coded FFT allows an efficient decoding algorithm at the master for recovering the output. After receiving the computing results, the master needs to recover the output in two steps: decoding the MDS code and then computingX from the intermediate valueC i ’s. For the first step, the master needs of decode an (N,m)-MDS code by s m times. This can be computed efficiently, by selecting an MDS code with low decoding complexity for the coded FFT design. There has been various works on finding efficiently decodable MDS codes (e.g., [38, 87]). In general, an upper bound on the decoding complexity of (N,m)-MDS code is given by O(m log 2 m log logm), which can be attained by the Reed-Solomon codes [40] and using fast polynomial interpolation [39] as the decoding algorithm. Consequently, the first step of the decoding algorithm has a complexity of at most O(s log 2 m log logm), which scales linearly with respect to s. For the second step, the master node needs to evaluate equation (5.23) to recover the final result. Equivalently, the master needs to compute X i+j s m = m−1 X k=0 C k,i ω ik+jk s m s (5.24) 57 for anyi∈{0, 1,..., s m −1} andj∈{0,...,m−1}. This is essentially the Fourier transform of s m vectors of lengthm, where thekth element of theith vector equalsC k,i ω ik s . In most cases (e.g.,F =C), the Fourier transform of a lengthm vector can be efficiently computed with a complexity of O(m logm), which is faster than the corresponding MDS decoding procedure used in the first step. In general, the computational complexity of Fourier transform is upper bounded byO(m logm log logm), which can be achieved by a combination of Bluestein’s algorithm and fast polynomial multiplication [88]. Hence, the complexity of the second step is at most O(s logm log logm). To conclude, our proposed coded FFT strategy allows efficient decoding with a complexity of at most O(s log 2 m log logm), which is linear to the input size s. The decoding computation is bottlenecked by the first step of the algorithm, which is essentially decoding an (N,m)-MDS code by s m times. To achieve the best performance, one can pick any MDS code with a decoding algorithm that requires the minimum amount of computation based on the problem scenatio [41]. 5.3 Optimality of Coded FFT In this section, we prove Theorem 5.2 through a matching information theoretic converse. Specifically, we need to prove that for any computation strategy, the master needs to wait for at least m workers in order to recover the final output. Recall that Theorem 5.2 is stated for finite fields, we can let the inputx be be uniformly randomly sampled fromF s . Given the invertibility of the Discrete Fourier transform, the output vectorX given this input distribution must also be uniformly random on F s . This means that the master node essentially needs to recover a random variable with entropy of H(X) =s log 2 |F| bits. Note that each worker returns s m elements of F, providing at most s m log 2 |F| bits of information. By applying a cut-set bound around the master, we can show that at least results from m workers need to be collected. Thus we have that the recovery threshold K ∗ =m is optimal. Remark 5.5. Besides the recovery threshold, communication load is also an important metric in distributed computing. The above cut-set converse in fact directly bounds the needed communication load for computing Fourier transform directly, proving that at least s log 2 |F| bits of communication is needed. Note that our proposed coded FFT uses exactly this amount of communication to deliver the intermediate results to the server. Hence, it is also optimal in terms of communication. Remark 5.6. Although Theorem 5.2 focuses on the scenario where the base field F is finite, similar results can be obtained when the base field is infinite (e.g., F =C), by taking into account of the practical implementation constrains. For example, any computing device can only keep variables reliably with finite precision. This quantization requirement in fact allows applying the cut-set 58 bound for the distributed Fourier transform problem, even when F is infinite, and enables proving the optimally of coded FFT in those scenarios. 5.4 n-dimensional Coded FFT Fourier transform in higher dimensional spaces is a frequently used operation in image processing and machine learning applications. In this section, we consider the problem of designing optimal codes for this operation. We show that the coded FFT strategy can be naturally extended to this scenario, and achieves the optimum performances. We start by formulating the system model and state the main results. 5.4.1 System Model and Main results We consider a problem of computing an n-dimensional Discrete Fourier transform T =F{t} in a distributed computing environment with a master node and N worker nodes. The input t and the output T are tensors of order n, with dimension s 0 ×s 1 ×...×s n−1 . For brevity, we denote the total number of elements in each tensor by s, i.e., s,s 0 s 1 ...s n−1 . The elements of the tensors belong to a field F with a primitive s k th root of unity for each k∈{0,...,n−1}, denoted byω s k . We want to compute the elements of the output tensorT , denoted by{T i 0 i 1 ...i n−1 } i ` ∈{0,...,s i −1},∀`∈{0,...,n−1} , as a function of the elements of the input tensor, denoted by{t i 0 i 1 ...i n−1 } i ` ∈{0,...,s i −1},∀`∈{0,...,n−1} , based on the following equations. T i 0 i 1 ...i n−1 , X j ` ∈{0,...,s i −1}, ∀`∈{0,...,n−1} t j 0 j 1 ...j n−1 n−1 Y k=0 ω i k j k s k . (5.25) Each one of the N workers can store and process 1 m fraction of the tensor. Specifically, given a parameter m∈ N ∗ satisfying m|s, each worker i can store an arbitrary vector a i ∈ F s m as a function of the inputt, compute an intermediate resultb i ∈F s m as a function ofa i , and returnb i to the server. The server only waits for the results from a subset of workers, before recovering the final output T using certain decoding functions, given these intermediate results returned from the workers. Similar to the one dimensional Fourier transform problem, we design the functions to compute a i s’ andb i s’ for the workers, and refer to them as the computation strategy. We aim to find an 59 optimal computation strategy that achieves the minimum possible recovery threshold, while allowing efficient decoding at the master node. Our main results are summarized in the following theorems: Theorem 5.3. In an n-dimensional distributed Fourier transform problem of computing T =F{t} using N workers that each can store and process 1 m fraction of the input t, we can achieve the following recovery threshold K ∗ =m. (5.26) Furthermore, the above recovery threshold can be achieved by a computation strategy, referred to as the n-dimentional Coded FFT, which allows efficient decoding at the master node, i.e., with a complexity that scales linearly with respect to the size s of the input data. Moreover, we can prove the optimally of n-dimensional coded FFT, which is formally stated in the following theorem. Theorem 5.4. In an n-dimensional distributed Fourier transform environment with N workers that each can store and process 1 m fraction of the input vector from a finite field F, the following recovery threshold K ∗ =m (5.27) is optimal. 3 5.4.2 General Description of n-dimensional Coded FFT We first prove Theorem 5.3 by proposing an optimal computation strategy, referred to as n- dimensional Coded FFT, that achieves the recovery threshold K ∗ =m for any parameter values of N and m. First of all we interleave the input tensort into m smaller tensors, each with a total size of s m . Specifically, given that m|s, we can find integers m 0 ,m 1 ,...,m n−1 ∈N, such that m k |s k for each k∈{0, 1,...,n}, and for each tuple (i 0 ,i 1 ,...,i n−1 ) satisfying i k ∈{0, 1,...,m k − 1}, we define a tensor c i 0 i 1 ...i n−1 with dimension s 0 m 0 × s 1 m 1 ×...× s n−1 mn−1 , with the following elements: c i 0 i 1 ,...i n−1 ,j 0 j 1 ,...j n−1 =t (i 0 +j 0 m)(i 1 +j 1 m)...(i n−1 +j n−1 m) . (5.28) 3 Similar to the 1-dimensional case, this optimally can be generalized to base fields with infinite cardinally, by taking into account of some practical implementation constrains. 60 We denote the discrete Fourier transform of each interleaved tensor c i 0 i 1 ...i n−1 by C i 0 i 1 ...i n−1 . Specifi- cally, C i 0 i 1 ,...i n−1 ,j 0 j 1 ,...j n−1 , (5.29) X j 0 ` ∈{0,..., s i m i −1}, ∀`∈{0,...,n−1} c i 0 i 1 ,...i n−1 ,j 0 0 j 0 1 ...j 0 n−1 n−1 Y k=0 ω j k j 0 k m k s k (5.30) for any j ` ∈{0,..., s i m i − 1}. Note that if the master node can recover all the above Fourier transform C i 0 i 1 ...i n−1 of the interleaved tensors, the final output can be computed based on the following identity: T i 0 i 1 ...i n−1 = X j ` ∈{0,...,m i −1}, ∀`∈{0,...,n−1} C j 0 j 1 ...j n−1 ,i 0 0 i 0 1 ...i 0 n−1 n−1 Y k=0 ω i k j k s k , (5.31) wherei 0 ` = mod(i ` , s ` m ` ). Hence, we can view this distributed Fourier transform problem as a problem of computing a list of linear transformations, and we inject the redundancy using MDS code similar to the one dimensional coded FFT strategy. Specifically, we encode the c i 0 i 1 ...i n−1 ’s using an arbitrary (N,m)-MDS code, where the coded tensors are denoteda 0 ,...,a N−1 and are assigned to the workers correspondingly. Then each worker i computes the Fourier of tensor a i , and return it to the master. Given the linearity of Fourier transform, the computing results b 0 ,...,b N−1 are essentially linear combinations of the Fourier transform C i 0 i 1 ...i n−1 ’s, which are coded by the same MDS code. Hence, after the master receives any m computing results, it can decode the message C i 0 i 1 ...i n−1 ’s, and proceed to recover the final result. This allows achieving the recovery threshold of m. In terms of the decoding complexity, n-dimensional coded FFT also requires first decoding an MDS code, and then recovering the final result by computing Fourier transforms of tensors with lower dimension. Similar to the one dimensional FFT, the bottleneck of the decoding algorithm is also the first step, which requires decoding an (N,m)-MDS code by s m times. This decoding complexity is upper bounded by O(s log 2 m log logm), which is linear with respect to the input size s. It can be further improved in practice by using any MDS code or MDS decoding algorithms with better computational performances. 61 5.4.3 Optimally of n-dimensional Coded FFT The optimally of n-dimensional Coded FFT (i.e., Theorem 5.4) can be proved as follows. When the base field F is finite, let the input t be be uniformly randomly sampled from F s . Given the invertibility of the n-dimensional Discrete Fourier transform, the output tensorT given this input distribution must also be uniformly random on F s . Hence, the master node needs to collect at least H(T ) =s log 2 |F| bits of information, where each worker can provide at most s m log 2 |F| bits. By applying the cut-set bound around the master, we can prove that at least m worker needs to return their results to finish the computation. Moreover, the above converse can also be extended to prove that the n-dimensional Coded FFT is optimal in terms of communication. 5.5 Coded FFT with multiple inputs Coded FFT can also be extended to optimally handle computation tasks with multiple inputs entries. In this section, we consider the problem of designing optimal codes for such scenario. 5.5.1 System Model and Main results We consider a problem of computing then-dimensional Discrete Fourier transform ofq input tensors, in a distributed computing environment with a master node and N worker nodes. The inputs, denoted by t 0 ,t 1 ,...,t q−1 , are q tensors of order n and dimension s 0 ×s 1 ×...×s n−1 . For brevity, we denote the total number of elements in each tensor by s, i.e., s, s 0 s 1 ...s n−1 . The elements of the tensors belong to a field F with a primitive s k th root of unity for each k∈{0,...,n− 1}, denoted by ω s k . We aim to compute the Fourier transforms of the input tensors, which are denoted by T 0 ,T 1 ,...,T q−1 . Specifically, we want to compute the elements of the output tensors according to the following equations. T h,i 0 i 1 ...i n−1 , X j ` ∈{0,...,s i −1}, ∀`∈{0,...,n−1} t h,j 0 j 1 ...j n−1 n−1 Y k=0 ω i k j k s k . (5.32) Each one of the N workers can store and process 1 m fraction of the entire input. Specifically, given a parameter m∈N ∗ satisfying m|qs, each worker i can store an arbitrary vectora i ∈F qs m as a function of the input tensors, compute an intermediate result b i ∈ F qs m as a function of a i , 62 and returnb i to the server. The server only waits for the results from a subset of workers, before recovering the final output using certain decoding functions. For this problem, we can find an optimal computation strategy that achieves the minimum possible recovery threshold, while allowing efficient decoding at the master node. We summarize this result in the following theorems: Theorem 5.5. For an n-dimensional distributed Fourier transform problem using N workers, if each worker can store and process 1 m fraction of the q inputs, we can achieve the following recovery threshold K ∗ =m. (5.33) Furthermore, the above recovery threshold can be achieved by a computation strategy, which allows efficient decoding at the master node, i.e., with a complexity that scales linearly with respect to the size s of the input data. Moreover, we prove the optimally of our proposed computation strategy, which is formally stated in the following theorem. Theorem 5.6. In an n-dimensional distributed Fourier transform environment with N workers that each can store and process 1 m fraction of the input vector, the following recovery threshold K ∗ =m (5.34) is optimal when the base field F is finite. 4 5.5.2 General Description of Coded FFT with Multiple Inputs We prove Theorem 5.5 by proposing an optimal computation strategy that achieves the recovery thresholdK ∗ =m. First of all we interleave theq inputs into smaller tensors. Specifically, given that m|qs, we can find integers ˜ m,m 0 ,m 1 ,...,m n−1 ∈N, such that ˜ m|q andm k |s k for eachk∈{0, 1,...,n}. For each input tensor t h and each tuple (i 0 ,i 1 ,...,i n−1 ) satisfying i k ∈{0, 1,...,m k − 1}, we define a tensor c h,i 0 i 1 ...i n−1 with dimension s 0 m 0 × s 1 m 1 ×...× s n−1 mn−1 , with the following elements: c h,i 0 i 1 ,...i n−1 ,j 0 j 1 ,...j n−1 =t h,(i 0 +j 0 m)(i 1 +j 1 m)...(i n−1 +j n−1 m) . (5.35) 4 Similar to the single input case, this optimally can be generalized to base fields with infinite cardinally, by taking into account of some practical implementation constrains. 63 As explained in Section 5.4.2, if the master node can obtain the Fourier transforms of all the interleaved tensors, then the final outputs can be computed efficiently. Hence, we can view this distributed Fourier transform problem as a problem of computing a list of linear transformations, and we inject the redundancy using MDS code similar to the single input coded FFT strategy. Specifically, we first bundle theq input tensors into ˜ m disjoint subsets of same size. For convenience, we denote the set of indices for the ith subset byS i . Within each subset, we view all interleaved tensors with the same index parameter (i 0 ,i 1 ,...,i n−1 ) as one message symbol and we encode all the symbols using an arbitrary (N,m)-MDS code. More precisely, for eachg∈{0, 1,..., ˜ m− 1} and each index parameter (i 0 ,i 1 ,...,i n−1 ), we create the following symbol{c h,i 0 i 1 ...i n−1 } h∈Sg . There are m symbols in total and we encode them using an (N,m)-MDS code. We assign the N coded symbols to N workers, and each of them computes the Fourier transform of all coded tensors contained in the symbol. Given the linearity of Fourier transform, the computing resultsb 0 ,...,b N−1 are essentially linear combinations of the Fourier transforms of the interleaved tensors, which are coded by the same MDS code. Hence, after the master receives any m computing results, it can decode all the needed intermediate values, and proceed to recover the final result. This allows achieving the recovery threshold of m. In terms of the decoding complexity, one can show that the bottleneck of the decoding algorithm is the decoding of the (N,m)-MDS code by s m times, using similar arguments mentioned in Section 5.4. This decoding complexity is upper bounded by O(s log 2 m log logm), which is linear with respect to the input size s. It can be further improved in practice by using any MDS code or MDS decoding algorithms with better computational performances. 5.5.3 Optimally of Coded FFT with multiple inputs The optimally of our proposed Coded FFT strategy for multiple users (i.e., Theorem 5.6) can be proved as follows. When the base fieldF is finite, let the input tensors be be uniformly randomly sampled fromF q×s . Given the invertibility of the Discrete Fourier transform, the output tensors must also be uniformly random onF q×s . Hence, the master node needs to collect at least qs log 2 |F| bits of information, where each worker can provide at most qs m log 2 |F| bits. By applying the cut-set bound around the master, we can prove that at least m worker needs to return their results to finish the computation. Moreover, the above converse also applies for proving the optimally of Coded FFT in terms of communication. 64 5.6 Concluding Remarks We considered the problem of computing the Fourier transform of high-dimensional vectors, dis- tributedly over a cluster of machines. We propose a computation strategy, named as coded FFT, which achieves the optimal recovery threshold, defined as the minimum number of workers that the master node needs to wait for in order to compute the output. We also extended coded FFT to settings including computing general n-dimensional Fourier transforms, and provided the optimal computing strategy for those settings. There are several interesting future directions, including the practical demonstration of coded FFT over distributed clusters, generalization of coded FFT to more general master-less architectures, and extension of coded FFT to other computing architectures (e.g., edge and fog computing architectures [24,25,89]). 65 Part II Optimal Codes for Secure and Private Computation 66 Chapter 6 Lagrange Coded Computing: Optimal Design for Resiliency, Security, and Privacy The massive size of modern datasets necessitates computational tasks to be performed in a distributed fashion, where the data is dispersed among many servers that operate in parallel [90]. As we “scale out” computations across many servers, several fundamental challenges arise. Besides the straggler effect [31,91,92], distributed computing systems are also much more susceptible to adversarial servers, making security and privacy a major concern [93–95]. We consider a general scenario (see Figure 6.1) in which the computation is carried out distributively across several workers, and propose Lagrange Coded Computing (LCC), a new framework to simultaneously provide 1. resiliency against straggler workers that may prolong computations; 2. security against Byzantine (or malicious, adversarial) workers, with no computational restriction, that deliberately send erroneous data in order to affect the computation for their benefit; and 3. (information-theoretic) privacy of the dataset amidst possible collusion of workers. LCC can be applied to any computation scenario in which the function of interest is an arbitrary multivariate polynomial of the input dataset. This covers many computations of interest in machine learning, such as various gradient and loss-function computations in learning algorithms and tensor This chapter is based on [20]. 67 Worker 1 Worker 2 Worker # $ Worker # % Worker & ' Worker & ( Worker ) dataset * ' ,…,* - . possibly malicious nodes / possibly colluding nodes Coding of the dataset 0 * 1 0 * ' 0 * $ Master Worker # ' . . . . . . 0 * 2 3 0 * 4 5 . . . . . . 6( 0 * ' ) 6( 0 * $ ) 6( 0 * 1 ) Worker 9 ' Worker 9 : . . . 0 * ; < = stragglers *? ? 2 3 ? 2 5 Figure 6.1: An overview of the problem considered in this chapter, where the goal is to evaluate a not necessarily linear function f on a given dataset X = (X 1 ,X 2 ,...,X K ) using N workers. Each worker applies f on a possibly coded version of the inputs (denoted by ˜ X i ’s). By carefully designing the coding strategy, the master can decode all the required results from a subset of workers, in the presence of stragglers (workerss 1 ,...,s S ) and Byzantine workers (workersm 1 ,...,m A ), while keeping the dataset private to colluding workers (workers c 1 ,...,c T ). algebraic operations (e.g., low-rank tensor approximation). The key idea of LCC is to encode the input dataset using the well-known Lagrange polynomial, in order to create computational redundancy in a novel coded form across the workers. This redundancy can then be exploited to provide resiliency to stragglers, security against malicious servers, and privacy of the dataset. Specifically, as illustrated in Fig. 6.1, using a master-worker distributed computing architecture withN workers, the goal is to computef(X i ) for everyX i in a large datasetX = (X 1 ,X 2 ,...,X K ), where f is a given multivariate polynomial with degree degf. To do so, N coded versions of the input dataset, denoted by ˜ X 1 , ˜ X 2 ,..., ˜ X N are created, and the workers then compute f over the coded data, as if no coding is taking place. For a given N and f, we say that the tuple (S,A,T ) is achievable if there exists an encoding and decoding scheme that can complete the computations in the presence of up to S stragglers, up to A adversarial workers, whilst keeping the dataset private against sets of up to T colluding workers. Our main result is that by carefully encoding the dataset the proposed LCC achieves (S,A,T ) if (K +T− 1) degf +S + 2A + 1≤N. The significance of this result is that by one additional 68 worker (i.e., increasing N by 1) LCC can increase the resiliency to stragglers by 1 or increase the robustness to malicious servers by 1/2, while maintaining the privacy constraint. Hence, this result essentially extends the well-known optimal scaling of error-correcting codes (i.e., adding one parity can provide robustness against one erasure or 1/2 error in optimal maximum distance separable codes) to the distributed secure computing paradigm. We prove the optimality of LCC by showing that it achieves the optimal tradeoff between resiliency, security, and privacy. In other words, any computing scheme (under certain complexity constrains on the encoding and decoding designs) can achieve (S,A,T ) if and only if (K+T−1) degf +S+2A+1≤ N. 1 This result further extends the scaling law in coding theory to private computing, showing that any additional worker enables data privacy against 1/degf additional colluding workers. Our general theoretical guarantees for LCC is specialized in [20] in the context of least-squares linear regression, which is one of the elemental learning tasks. It is shown via experiments on Amazon EC2 that LCC speeds up the conventional uncoded implementation of distributed least-squares linear regression by up to 13.43×, and also achieves a 2.36×−12.65× speedup over the state-of-the-art straggler mitigation strategies. Related works. There has recently been a surge of interest on using coding theoretic approaches to alleviate key bottlenecks (e.g., stragglers, bandwidth, and security) in distributed machine learning applications (e.g., [1,3,14,22,23,96–102]). As we discuss in more detail in Section 6.2.1, the proposed LCC scheme significantly advances prior works in this area by 1) generalizing coded computing to arbitrary multivariate polynomial computations, which are of particular importance in learning applications; 2) extending the application of coded computing to secure and private computing; 3) reducing the computation/communication load in distributed computing (and distributed learning) by factors that scale with the problem size, without compromising security and privacy guarantees; and 4) enabling 2.36×-12.65× speedup over the state-of-the-art in distributed least-squares linear regression in cloud networks. Secure multiparty computing (MPC) and secure/private Machine Learning (e.g., [103,104]) are also extensively studied topics that address a problem setting similar to LCC. As we elaborate in Section 6.2.1, compared with conventional methods in this area (e.g., the celebrated BGW scheme for secure and private MPC [103]), LCC achieves substantial reduction in the amount of randomness, storage overhead, and computation complexity. 1 More accurately, whenN <Kdegf− 1, we prove that the optimal tradeoff is instead given by K(S + 2A + degf· T + 1)≤N, which can be achieved by a variation of the LCC scheme, as described in Appendix C.3. 69 6.1 Problem Formulation and Examples We consider the problem of evaluating a multivariate polynomial f :V→U over a dataset X = (X 1 ,...,X K ), 2 where V and U are vector spaces of dimensions M and L, respectively, over the fieldF. We assume a distributed computing environment with a master and N workers (Figure 6.1), in which the goal is to compute Y 1 ,f(X 1 ),...,Y K ,f(X K ). We denote the total degree 3 of the polynomial f by degf. In this setting each worker has already stored a fraction of the dataset prior to computation, in a possibly coded manner. Specifically, for i∈ [N] (where [N],{1,...,N}), worker i stores ˜ X i ,g i (X 1 ,...,X K ), where g i is a (possibly random) function, refered to as the encoding function of that worker. We restrict our attention to linear encoding schemes 4 , which guarantee low encoding complexity and simple implementation. Each workeri∈ [N] computes ˜ Y i ,f( ˜ X i ) and returns the result to the master. The master waits for a subset of fastest workers and then decodes Y 1 ,...,Y K . This procedure must satisfy several additional requirements: • Resiliency, i.e., robustness against stragglers. Formally, the master must be able to obtain the correct values of Y 1 ,...,Y K even if up to S workers fail to respond (or respond after the master executes the decoding algorithm), where S is the resiliency parameter of the system. A scheme that guarantees resiliency against S stragglers is called S-resilient. • Security, i.e., robustness against adversaries. That is, the master must be able to obtain correct values of Y 1 ,...,Y K even if up to A workers return arbitrarily erroneous results, where A is the security parameter of the system. A scheme that guarantees security against A adversaries is called A-secure. • Privacy, i.e., the workers must remain oblivious to the content of the dataset, even if up to T of them collude, where T is the privacy parameter of the system. Formally, for everyT ⊆ [N] of size at most T, we must have I(X; ˜ X T ) = 0, where I is mutual information, ˜ X T represents the collection of the encoded dataset stored at the workers inT , and X is seen as chosen uniformly at random. 5 A scheme which guarantees privacy against T colluding workers is called T -private. 6 2 We focus on the non-trivial case where K > 0 and f is not constant. 3 The total degree of a polynomialf is the maximum among all the total degrees of its monomials. When discussing finiteF, we resort to the canonical representation of polynomials, in which the individual degree within each term is no more than (|F|− 1). 4 A formal definition is provided in Section 6.4. 5 Equivalently, it requires that ˜ XT and X are independent. Under this condition, the input data X still appears uniformly random after the colluding workers learn ˜ XT , which guarantees the privacy. 6 To guarantee that the privacy requirement is well defined, we assume that F andV are finite whenever T > 0. 70 More concretely, given any subset of workers that return the computing results (denoted byK), the master computes ( ˆ Y 1 ,..., ˆ Y K ) =h K ({ ˜ Y i } i∈K ), where each h K is a deterministic function (or is random but independent of both the encoding functions and input data). We refer to the h K ’s as decoding functions. 7 We say that a scheme is S-resilient, A-secure, and T-private if the master always returns the correct results (i.e., each Y i = ˆ Y i ), and all 8 above requirements are satisfied. Given the above framework, we aim to characterize the region for (S,A,T ), such that anS-resilient, A-secure, and T-private scheme can be found, given parameters N, K, and function f, for any sufficiently large fieldF. This framework encapsulates many computation tasks of interest, which we highlight as follows. Linear computation. Consider a scenario where the goal is to compute A ~ b for some dataset A ={A i } K i=1 and vector ~ b, which naturally arises in many machine learning algorithms, such as each iteration of linear regression. Our formulation covers this by letting V be the space of matrices of certain dimensions over F, U be the space of vectors of a certain length over F, X i be A i , and f(X i ) =X i · ~ b for all i∈ [K]. Coded computing for such linear computations has also been studied in [1,3,105–107]. Bilinear computation. Another computation task of interest is to evaluate element-wise products{A i ·B i } K i=1 of two lists of matrices{A i } K i=1 and{B i } K i=1 . This is the key building block for various algorithms, such as fast distributed matrix multiplication [108]. Our formulation covers this by letting V be the space of pairs of two matrices of certain dimensions, U be the space of matrices of dimension which equals that of the product of the pairs of matrices, X i = (A i ,B i ), and f(X i ) =A i ·B i for all i∈ [K]. General tensor algebra. Beyond bilinear operations, distributed computations of multivariate polynomials of larger degree, such as general tensor algebraic functions (i.e. functions composed of inner products, outer products, and tensor contractions) [109], also arise in practice. A specific example is to compute the coordinate transformation of a third-order tensor field at K locations, where given a list of matrices{Q (i) } K i=1 and a list of third-order tensors{T (i) } K i=1 with matching dimension on each index, the goal is to compute another list of tensors, denoted by{T 0(i) } K i=1 , of which each entry is defined as T 0(i) j 0 k 0 ` 0 , P j,k,` T 0(i) jk` Q (i) jj 0 Q (i) kk 0 Q (i) `` 0 . Our formulation covers all functions within this class by lettingV be the space of input tensors,U be the space of output tensors, X i be the inputs, and f be the tensor function. These computations are not studied by state-of-the-art coded computing frameworks. 7 Similar to encoding, we also require the decoding function to have low complexity. When there is no adversary (A = 0), we restrict our attention to linear decoding schemes. 8 In particular, we require that the scheme can operate even if the stragglers and malicious workers appear the same time. 71 Gradient computation. Another general class of functions arises from gradient decent algo- rithms and their variants, which are the workhorse of today’s learning tasks [110]. The computation task for this class of functions is to consider one iteration of the gradient decent algorithm, and to evaluate the gradient of the empirical risk∇L S (h), avg z∈S ∇` h (z), given a hypothesish :R d →R, a respective loss function ` h :R d+1 →R, and a training setS⊆R d+1 , where d is the number of features. In practice, this computation is carried out by partitioningS into K subsets{S i } K i=1 of equal sizes, evaluating the partial gradients{∇L S i (h)} K i=1 distributedly, and computing the final result using∇L S (h) = avg i∈[K] ∇L S i (h). A specific example of applying this computing model to least-squares regression problems is presented in [111]. 6.2 Main Results and Prior Works We now state our main results and discuss their connections with prior works. Our first theorem characterizes the region for (S,A,T ) that LCC achieves (i.e., the set of all feasible S-resilient, A-secure, and T -private schemes via LCC as defined in the previos section). Theorem 6.1. Given a number of workers N and a dataset X = (X 1 ,...,X K ), LCC provides an S-resilient, A-secure, and T -private scheme for computing{f(X i )} K i=1 for any polynomial f, as long as (K +T− 1) degf +S + 2A + 1≤N. (6.1) Remark 6.1. To prove Theorem 6.1, we formally present LCC in Section 6.3, which achieves the stated resiliency, security, and privacy. The key idea is to encode the input dataset using the well-known Lagrange polynomial. In particular, encoding functions (i.e., g i ’s) in LCC amount to evaluations of a Lagrange polynomial of degree K− 1 at N distinct points. Hence, computations at the workers amount to evaluations of a composition of that polynomial with the desired function f. Therefore, inequality (6.1) may simply be seen as the number of evaluations that are necessary and sufficient in order to interpolate the composed polynomial, which is later evaluated at a certain point to finalize the computation. LCC also has a number of additional properties of interest. First, the proposed encoding is identical for all computationsf, which allows pre-encoding of the data without knowing the identity of the computing task (i.e., universality). Second, decoding and encoding rely on polynomial interpolation and evaluation, and hence efficient off-the-shelf subroutines can be used. 9 Remark 6.2. Besides the coding approach presented to achieve Theorem 6.1, a variation of LCC can be used to achieve any (S,A,T ) as long asK(S + 2A + degf·T + 1)≤N. This scheme (presented 9 A more detailed discussion on the coding complexities of LCC can be found in Appendix C.2. 72 in Appendix C.3) achieves an improved region when N <Kdegf− 1 and T = 0, where it recovers the uncoded repetition scheme. For brevity, we refer the better of these two scheme as LCC when presenting optimality results (i.e., Theorem 6.2). Remark 6.3. Note that LHS of inequality (6.1) is independent of the number of workers N, hence the key property of LCC is that adding 1 worker can increase its resilience to stragglers by 1 or its security to malicious servers by 1/2, while keeping the privacy constraint T the same. Note that using an uncoded replication based approach, to increase the resiliency to stragglers by 1, one needs to essentially repeat each computation once more (i.e., requiring K more machines as opposed to 1 machine in LCC). This result essentially extends the well-known optimal scaling of error-correcting codes (i.e., adding one parity can provide robustness against one erasure or 1/2 error in optimal maximum distance separable codes) to the distributed computing paradigm. Our next theorem demonstrates the optimality of LCC. Theorem 6.2. LCC achieves the optimal trade-off between resiliency, security, and privacy (i.e., achieving the largest region of (S,A,T)) for any multilinear function f among all computing schemes that uses linear encoding, for all problem scenarios. Moreover, when focusing on the case where no security constraint is imposed, LCC is optimal for any polynomial f among all schemes with additional constraints of linear decoding and sufficiently large (or zero) characteristic of F. Remark 6.4. Theorem 6.2 is proved in Section 6.4. The main proof idea is to show that any computing strategy that outperforms LCC would violate the decodability requirement, by finding two instances of the computation process where the same intermediate computing results correspond to different output values. Remark 6.5. In addition to the result we show in Theorem 6.2, we can also prove that LCC achieves optimality in terms of the amount of randomness used in data encoding. Specifically, we show in [20] that LCC requires injecting the minimum amount of randomness, among all computing schemes that universally achieve the same resiliency-security-privacy tradeoff for all linear functions f. We conclude this section by discussing several lines of related work in the literature and contrasting them with LCC. 6.2.1 LCC vs. Prior Works The study of coding theoretic techniques for accelerating large scale distributed tasks (a.k.a. coded computing) was initiated in [1, 22, 23]. Following works focused largely on matrix-vector and matrix-matrix multiplication (e.g., [3,14–16]), gradient computation in gradient descent algorithms (e.g., [97,99,112]), communication reduction via coding (e.g., [113–116]), and secure and private computing (e.g., [101,102]). 73 LCC recovers several previously studied results as special cases. For example, setting f to be the identity function and V =U reduces to the well-studied case of distributed storage, in which Theorem 6.1 is well known (e.g., the Singleton bound [40, Thm. 4.1]). Further, as previously mentioned,f can correspond to matrix-vector and matrix-matrix multiplication, in which the special cases of Theorem 6.1 are known as well [1,16]. More importantly, LCC improves and generalizes these works on coded computing in a few aspects: Generality–LCC significantly generalizes prior works to go beyond linear and bilinear computations that have so far been the main focus in this area, and can be applied to arbitrary multivariate polynomial computations that arise in machine learning applications. In fact, many specific computations considered in the past can be seen as special cases of polynomial computation. This includes matrix-vector multiplication, matrix-matrix multiplication, and gradient computation whenever the loss function at hand is a polynomial, or is approximated by one. Universality–once the data has been coded, any polynomial up to a certain degree can be computed distributedly via LCC. In other words, data encoding of LCC can be universally used for any polynomial computation. This is in stark contrast to previous task specific coding techniques in the literature. Furthermore, workers apply the same computation as if no coding took place; a feature that reduces computational costs, and prevents ordinary servers from carrying the burden of outliers. Security and Privacy–other than a handful of works discussed above, straggler mitigation (i.e., resiliency) has been the primary focus of the coded computing literature. This work extends the application of coded computing to secure and private computing for general polynomial computations. Providing security and privacy for multiparty computing (MPC) and Machine Learning systems is an extensively studied topic which addresses a problem setting similar to LCC. To illustrate the significant role of LCC in secure and private computing, let us consider the celebrated BGW MPC scheme [103]. 10 Given inputs{X i } K i=1 , BGW first uses Shamir’s scheme [117] to encode the dataset in a privacy- preserving manner as P i (z) = X i +Z i,1 z +... +Z i,T z T for every i∈ [K], where Z i,j ’s are i.i.d uniformly random variables and T is the number of colluding workers that should be tolerated. The key distinction between the data encoding of BGW scheme and LCC is that we instead use Lagrange polynomials to encode the data. This results in significant reduction in the amount of randomness needed in data encoding (BGW needsKT z i,j ’s while as we describe in the next section, LCC only needs T amount of randomness). The BGW scheme will then store{P i (α ` )} i∈[K] to worker ` for every `∈ [N], given some distinct values α 1 ,...,α N . The computation is then carried out by evaluating f over all stored coded data 10 Conventionally, the BGW scheme operates in a multi-round fashion, requiring significantly more communication overhead than one-shot approaches. For simplicity, we present a modified one-shot version of BGW. 74 at the nodes. In the LCC scheme, on the other hand, each worker ` only needs to store one encoded data ( ˜ X ` ) and compute f( ˜ X ` ). This gives rise to the second key advantage of LCC, which is a factor of K in storage overhead and computation complexity at each worker. After computation, each worker ` in the BGW scheme has essentially evaluated the polynomials {f(P i (z))} K i=1 at z =α ` , whose degree is at most deg(f)·T. Hence, if no straggler or adversary appears (i.e, S =A = 0), the master can recover all required results f(P i (0))’s, through polynomial interpolation, as long as N≥ deg(f)·T + 1 workers participated in the computation 11 . Note that under the same condition, LCC scheme requires N≥ deg(f)· (K +T− 1) + 1 number of workers, which is larger than that of the BGW scheme. Hence, in overall comparison with the BGW scheme, LCC results in a factor ofK reduction in the amount of randomness, storage overhead, and computation complexity, while requiring more workers to guarantee the same level of privacy. This, as well as a comparison with the more conventional multi-round BGW scheme, is summarized in Table 6.1. 12 Multi-round BGW [103] BGW LCC Complexity per worker ≥K K 1 Frac. data per worker 1 1 1 K Rounds of Comm. Ω(log deg(f)) 1 1 Randomness ≥KT KT T Min. num. of workers 2·T + 1 deg(f)·T + 1 deg(f)· (K+T−1)+1 Table 6.1: A comparison between BGW based designs and LCC. The computational complexity is normalized by that of a single evaluation off; randomness, which refers to the number of random entries used in encoding functions, is normalized by the length of each input X i . Recently, [101] has also combined ideas from the BGW scheme and [14] to form polynomial sharing, a private coded computation scheme for arbitrary matrix polynomials. However, polynomial sharing inherits the undesired BGW property of performing a communication round for every bilinear operation in the polynomial; a feature that drastically increases communication overhead, and is circumvented by the one-shot approach of LCC. DRACO [102] is also recently proposed as a secure computation scheme for gradients. Yet, DRACO employs a blackbox approach, i.e., the resulting gradients are encoded rather than the data itself, and the inherent algebraic structure of the gradients is ignored. For this approach, [102] shows that a 2A + 1 multiplicative factor of redundant computations is necessary. In LCC however, the blackbox approach is disregarded in favor of an algebraic one, and consequently, a 2A additive factor suffices. 11 It is also possible to use the conventional multi-round BGW, which only requires N≥ 2T + 1 workers to ensure T-privacy. However, multiple rounds of computation and communication (Ω(log deg(f)) rounds) are needed, which further increases its communication overhead. 12 A BGW scheme was also proposed in [103] for secure MPC, however for a substantially different setting. Similarly, a comparison can be made by adapting it to our setting, leading to similar results, which we omit for brevity. 75 LCC has also been recently applied to several applications in which security and privacy in computations are critical. For example, in [80], LCC has been applied to enable a scalable and secure approach to sharding in blockchain systems. Also, in [79], a privacy-preserving approach for machine learning has been developed that leverages LCC to provides substantial speedups over cyrptographic approaches that relay on MPC. 6.3 Lagrange Coded Computing In this Section we prove Theorem 6.1 by presenting LCC and characterizing the region for (S,A,T ) that it achieves. 13 We start with an example to illustrate the key components of LCC. 6.3.1 Illustrating Example Consider the function f(X i ) = X 2 i , where input X i ’s are √ M× √ M square matrices for some square integer M. We demonstrate LCC in the scenario where the input data X is partitioned into K = 2 batches X 1 and X 2 , and the computing system has N = 8 workers. In addition, the suggested scheme is 1-resilient, 1-secure, and 1-private (i.e., achieves (S,A,T ) = (1, 1, 1)). The gist of LCC is picking a uniformly random matrix Z, and encoding (X 1 ,X 2 ,Z) using a Lagrange interpolation polynomial: 14 u(z),X 1 · (z− 2)(z− 3) (1− 2)(1− 3) +X 2 · (z− 1)(z− 3) (2− 1)(2− 3) + Z· (z− 1)(z− 2) (3− 1)(3− 2) . We then fix distinct{α i } 8 i=1 inF such that{α i } 8 i=1 ∩[2] =?, and let workers 1,..., 8 storeu(α 1 ),...,u(α 8 ). First, note that for every j∈ [8], worker j sees ˜ X j , a linear combination of X 1 and X 2 that is masked by addition of λ·Z for some nonzero λ∈F 11 ; since Z is uniformly random, this guarantees perfect privacy for T = 1. Next, note that worker j computes f( ˜ X j ) = f(u(α j )), which is an evaluation of the composition polynomial f(u(z)), whose degree is at most 4, at α j . Normally, a polynomial of degree 4 can be interpolated from 5 evaluations at distinct points. However, the presence of A = 1 adversary and S = 1 straggler requires the master to employ a Reed-Solomon decoder, and have three additional evaluations at distinct points (in general, two 13 For an algorithmic illustration, see Appendix C.1. 14 Assume thatF is a finite field with 11 elements. 76 additional evaluations for every adversary and one for every straggler). Finally, after decoding polynomial f(u(z)), the master can obtain f(X 1 ) and f(X 2 ) by evaluating it at z = 1 and z = 2. 6.3.2 General Description Similar to Subsection 6.3.1, we select any K +T distinct elements β 1 ,...,β K+T fromF, and find a polynomial u : F→ V of degree at most K +T− 1 such that u(β i ) = X i for any i∈ [K], and u(β i ) =Z i for i∈{K + 1,...,K +T}, where all Z i ’s are chosen uniformly at random fromV. This is simply accomplished by letting u be the Lagrange interpolation polynomial u(z), X j∈[K] X j · Y k∈[K+T]\{j} z−β k β j −β k + K+T X j=K+1 Z j · Y k∈[K+T]\{j} z−β k β j −β k . We then select N distinct elements{α i } i∈[N] from F such that{α i } i∈[N] ∩{β j } j∈[K] =? (this requirement is alleviated if T = 0), and let ˜ X i =u(α i ) for any i∈ [N]. That is, the input variables are encoded as ˜ X i =u(α i )=(X 1 ,...,X K ,Z K+1 ,...,Z K+T )·U i , (6.2) where U∈F (K+T)×N q is the encoding matrix U i,j , Q `∈[K+T]\{i} α j −β ` β i −β ` , and U i is its i’th column. 15 This encoding scheme guarantees T -privacy, and a proof can be found in [20]. Following the above encoding, each worker i applies f on ˜ X i and sends the result back to the master. Hence, the master obtains N−S evaluations, at most A of which are incorrect, of the polynomialf(u(z)). Since deg(f(u(z)))≤ deg(f)·(K+T−1), andN≥ (K+T−1) deg(f)+S+2A+1, the master can obtain all coefficients of f(u(z)) by applying Reed-Solomon decoding. Having this polynomial, the master evaluates it at β i for every i∈ [K] to obtain f(u(β i )) =f(X i ), and hence we have shown that the above scheme is S-resilient and A-secure. 6.4 Optimality of LCC In this section, we provide a layout for the proof of optimality for LCC (i.e., Theorem 6.2). Formally, we define that a linear encoding function is one that computes a linear combination of the input 15 By selecting the values of αi’s differently, we can recover the uncoded repetition scheme, see Appendix C.3. 77 variables (and possibly a list of independent uniformly random keys when privacy is taken into account 16 ); while a linear decoding function computes a linear combination of workers’ output. We essentially need to prove that (a) given any multilinear f, any linear encoding scheme that achieves any (S,A,T ) requires at least N≥ (K +T− 1) degf +S + 2A + 1 workers when T > 0 or N≥Kdeg f− 1, and N≥K(S + 2A + 1) workers in other cases; (b) for a general polynomial f, any scheme that uses linear encoding and decoding requires at least the same number of workers, if the characteristic ofF is 0 or greater than deg f. The proof rely on the following key lemma, which characterizes the recovery threshold of any encoding scheme, defined as the minimum number of workers that the master needs to wait to guarantee decodability. Lemma 6.1. Given any multilinear f, the recovery threshold of any valid linear encoding scheme, denoted by R, satisfies R≥R LCC (N,K,f), min{(K− 1) degf + 1, N−bN/Kc + 1}. (6.3) Moreover, if the encoding scheme is T private, we have R≥R LCC (N,K,f) +T· degf. The proof of Lemma 6.1 can be found in Appendix C.4, by constructing instances of the computation process for any assumed scheme that achieves smaller recovery threshold, and proving that such scheme fails to achieve decodability in these instances. Intuitively, note that the recovery threshold is exactly the difference between N and the number of stragglers that can be tolerated, inequality (6.3) in fact proves that LCC (described in Section 6.3 and Appendix C.6) achieves the optimum resiliency, as it exactly achieves the stated recovery threshold. Similarly, one can verify that Lemma 6.1 essentially states that LCC achieves the optimal tradeoff between resiliency and privacy. Assuming the correctness of Lemma 6.1, the two parts of Theorem 6.2 can be proved as follows. To prove part (a) of the converses, we need to extend Lemma 6.1 to also take adversaries into account. This is achieved by using an extended concept of Hamming distance, defined in [16] for coded computing. Part (b) requires generalizing Lemma 6.1 to arbitrary polynomial functions, which is proved by showing that for any f that achieves any (S,T ) pair, there exists a multilinear function with the same degree for which a computation scheme can be found to achieves the same requirement. The detailed proofs can be found in Appendices C.5 and C.6 respectively. 16 This is well defined as we assumed that V is finite when T > 0. 78 Chapter 7 Harmonic Coding: An Optimal Linear Code for Privacy-Preserving Gradient-Type Computation Gradient computation is the key building block in many optimization and machine learning algorithms. This computation can be simply described as computing the sum of some “partial gradients”, which are defined as evaluations of a certain function over disjoint subsets of the input data. This computation structure also broadly appears in various frameworks such as MapReduce [29] and tensor algebra [109]. We refer to it in general as gradient-type computation. Modern applications that use gradient-type computation often require handling massive amount of data, and distributing the storage and computation onto multiple machines has become a common approach. However, as more participants come into play, ensuring the privacy of datasets against potential “curious” workers becomes a fundamental challenge. This critical problem has created a surge of interests in privacy-preserving machine learning (e.g., [104,118–120]). As such motivated, we consider a master-worker computing framework, where the goal is to compute f(X 1 ,...,X K ),g(X 1 ) +... +g(X K ), (7.1) This chapter is based on [81]. 79 . . . . . . “Curious” workers D a t a s e t , … , Coding of the dataset Coding of the dataset ? . . . Figure 7.1: An overview of the framework considered in this chapter. The goal is to design a privacy-preserving coded computing scheme for any gradient-type function, using the minimum possible number of workers. The master aims to recover f(X 1 ,...,X K ),g(X 1 ) +... +g(X K ) given an input dataset X = (X 1 ,...,X K ), for a not necessarily linear function g. Each worker i takes a coded version of the inputs (denoted by ˜ X i ) and computes g( ˜ X i ). By carefully designing the coding strategy, the master can decode the function given computing results from the workers, while keeping the dataset private to any curious worker (workers 3 and N in this example). given a large input dataset X = (X 1 ,...,X K ), where g could be any fixed multivariate polynomial with degree deg g (See Fig. 7.1). Each worker i can take a coded version of the input variables, denoted ˜ X i , and then return g( ˜ X i ) to the master. We aim to find an optimum encoding design, which uses the minimum number of worker machines, such that the master can recoverf(X 1 ,...,X K ) given all computing results, and the input dataset is information theoretically private to any worker. We present a novel coded computing design, called the “Harmonic coding”, which universally computes any gradient-type function while enabling the privacy of input data. Our main result is that by carefully designing the encoding strategy, the proposed Harmonic Coding computes any gradient-type function while providing the required data privacy using only N =K(deg g− 1) + 2 workers. This design strictly improves over the state-of-the-art cryptographic approaches that are based on Shamir’s secret sharing scheme [117] and the recently proposed Lagrange Coded Computing (LCC) [20] that can be applied to general polynomial computations. These schemes would respectively require N Shamir , K(deg g + 1) and N LCC , Kdeg g + 1 workers. Moreover, Harmonic Coding universally computes any gradient-type function with any given degree, using identical encoding designs. This property allows pre-computing and storing the encoded data far before the identity of the computing task is revealed, reducing the required computation time. 80 The main idea of Harmonic Coding is by designing the encoding of the dataset, the computation result from each worker can be viewed as a linear combination of a partial gradient and some predefined intermediate variables. Using harmonic progression in the coding design, we can let both the intermediate variables and their coefficients be redundant, which enables cancellation of the unneeded results in the decoding process. Moreover, Harmonic Coding has a simple recursive structure, which allows efficient (linear complexity) algorithms for both encoding and decoding. We prove the optimality of Harmonic coding through a matching converse. We show that any linear scheme that computes any gradient-type function with data privacy must use at least the same number of workers required by Harmonic Coding, when the characteristic of the base field is sufficiently large. In the other case where the characteristic of the base field could be small, we show that improved scheme could be developed for certain functions, while Harmonic Coding remains optimal whenever the partial gradient function g is multilinear. As a side consequence, this converse result also provides a sharp characterization on the characteristic condition of the base field for the existence of universally optimal schemes. 7.1 Problem Formulation We consider a problem of evaluating a gradient-type function f : V K → U, characterized by a multivariate polynomial g :V→U according to equation (7.1), given an input dataset X 1 ,...,X K ∈ V, whereV andU are some vector spaces over a finite fieldF. 1 The computation is carried out in a distributed system with a master and N workers, where each worker i can take a coded variable, denoted by ˜ X i , compute g( ˜ X i ), and return the result to the master. The master aims to recover f(X 1 ,...,X K ) using all computing results from the workers. More specifically, using some possibly random encoding functions h 1 ,...,h N , each worker i stores ˜ X i =h i (X 1 ,...,X K ) prior to computation. Then after all workers return the results, the master uses a decoding function ` (which is also possibly random, but is independent of the encoding functions) to recover the final output, by computing `(g( ˜ X 1 ),...,g( ˜ X N )). We restrict our attention to linear coding schemes 2 , which ensures low coding complexities and is easy to implement. We say a computing scheme is valid, if the master always recovers f(X 1 ,...,X K ) for any possible values of the input dataset. Moreover, we require that the encoding scheme be data-private, in the sense that none of the workers can infer any information regarding the input dataset. Formally, we 1 We focus on the non-trivial case, where g is not a constant. We also assume that F is sufficiently large. 2 A formal definition is provided in Section 7.4. 81 require that I(X 1 ,...,X K ; ˜ X i ) = 0 (7.2) for any worker i, if the input variables are randomly sampled from any distribution. We aim to characterize the fundamental limit of this problem: finding the minimum possible number of workers among all valid data-private encoding-decoding designs, as well as finding an explicit construction which achieves this optimality. 7.2 Main Results We summarize our main results in the following theorem. Our main result is two fold: we first characterized the number workers required by the proposed Harmonic coding scheme, then we prove that it achieves the fundamental limit, i.e., using the minimum number of workers. Theorem 7.1. For any gradient-type function characterized by a polynomial g, Harmonic coding provides a data-private scheme using K(degg−1)+2 workers, where degg denotes the total degree 3 of g. Moreover, Harmonic Coding requires the optimum number of workers among all linear coding schemes, when the characteristic of the base field is greater than deg g. Remark 7.1. To prove the first result in Theorem 7.1, we present Harmonic Coding in Section 7.3, which only uses the stated optimum number of workers. Compared to Harmonic Coding, one conventional approach used in Multiparty computing (MPC) is to first encode each input variable separately using Shamir’s secret sharing scheme [117], then apply the computation on top of the shared pieces. This MPC based approach requires deg g + 1 evaluations of function g to compute every single g(X i ), and thus uses K(deg g + 1) workers in total. More recently, we proposed Lagrange Coded Computing [20], which enables evaluating all g(X i )’s through a joint computing design. Lagrange Coded Computing encodes the data by constructing a degree K polynomial whose evaluations are the input variables and the padded random keys at K + 1 points, then assign the workers its evaluation at N other distinct points. After computation, the decoding process reduces to interpolating a polynomial of degree Kdeg g, which requires Kdeg g + 1 workers. Harmonic Coding strictly improves these two approaches. Moreover, Harmonic Coding also enables significant reduction in terms of the required number of workers. As an example, consider the linear regression model presented in [20], where the computational bottleneck is to evaluate a gradient-type function characterized by g(X i ) =X | i X i w 3 Formally, deg g is defined based on the canonical representation of g, in which the individual degree within each term is no more than (|F|− 1). 82 for some fixed matrix w. As the size of the dataset (K) increases, the MPC and LCC based designs require approximately 3K and 2K workers respectively. On the other hand, Harmonic Coding only requires about K workers, which is a two-fold improvement upon the state of the art. Furthermore, it even approaches the fundamental limit for simply storing the data, where no computation is required. Unlike prior works, our main coding idea is instead by carefully designing the encoding, that the workers compute the sum of the g(X i )’s “in the air”. Specifically, we interpret the workers computing results as the sum of a “partial gradient” (i.e., a single g(X i )) and some intermediate variables. By encoding the inputs using harmonic progression, all intermediate variables cancels out in the decoding process, and the master directly obtain the sum of all gradients. Remark 7.2. Harmonic coding also has several additional properties of interest. First, Harmonic coding is identical for any function f with a given degree. Hence, it enables pre-encoding the data without knowing the identity of the function, and universally computes any function with an upper bounded degree. Second, Harmonic Coding enables linear complexity algorithms for both encoding and decoding, hence requiring negligible computational coding overheads for most applications. Finally, to provide data privacy against every single worker, Harmonic Coding only uses one single random key through the entire process. This achieves the minimum amount of randomness required by any linear scheme. Remark 7.3. Harmonic coding reduces to Lagrange Coded Computing in several basic cases. For example, whenK = 1, the master only needs a single evaluation of functiong, which can be optimally computed by LCC. On the other hand, when the computation task is linear, due to commutativity between g and the sum in function f, the master essentially wants to recover g( X 1 +...+X K K ), which can also be optimally computed using LCC by pre-encoding the dataset into a single variable X 1 +...+X K K . However, for all other cases, Harmonic Coding achieves the optimum number of workers, which was earlier unknown. We complete the proof of Theorem 7.1 by providing a matching converse, which is presented in Section 7.4. 7.3 Achievability Scheme In this section, we prove the achievability part of Theorem 7.1 by presenting Harmonic Coding. We start with a motivating example for the first non-trivial case, where the master aims to recover the sum of K = 2 quadratic functions. 83 7.3.1 Example for K = 2, deg g = 2 Consider a gradient-type function given input variables X 1 ,X 2 ∈ F m×m 5 for some integer m, characterized by a quadratic polynomial g(X i ) =AX | i X i +BX i +C with some constant matrices A, B, and C. We aim to find a data-private computing scheme which only uses 4 workers. To achieve the privacy requirement, we pick a uniformly random matrix Z∈F m×m 5 , and assign the coded variables by linearly combining X 1 , X 2 , and Z. One can verify that this requirement is satisfied as long as variable Z is encoded with non-zero coefficients for every ˜ X i . Hence, it remains to design the code with validity. The main idea of Harmonic Coding is to carefully design the linear combinations, such that after applying g to the coded variables, each computing result equals to the sum of a “partial gradient” and some intermediate values that can be canceled out in the decoding process. Such property is achieved by encoding the variables using harmonic progression 4 . Specifically, we first define some parameters, letting c = 4 and β = 4. These values are selected such that c6∈{0, 1, 2} and β6∈{0, 1, c c−1 , c c−2 }. We then defined some intermediate variables, which are coded using harmonic progression. P 0 , c c− 0 Z =Z P 1 , c c− 1 Z− 1 c− 1 X 1 = 3X 1 + 3Z P 2 , c c− 2 Z− 1 c− 2 (X 1 +X 2 ) = 2X 2 + 2X 1 + 2Z. Given these definitions, the input data is encoded as follows. ˜ X 1 =P 0 =Z ˜ X 2 = (1−β)X 1 +βP 0 = 2X 1 + 4Z ˜ X 3 = (1− c− 1 c β)X 2 + c− 1 c βP 1 = 3X 2 + 4X 1 + 4Z ˜ X 4 =P 2 = 2X 2 + 2X 1 + 2Z. Using this encoding design, the master can decode the final result by computing 2g( ˜ X 1 ) +g( ˜ X 2 ) + 3g( ˜ X 3 ) +g( ˜ X 4 ), which exactly recovers g(X 1 ) +g(X 2 ). As mentioned earlier, the intuition for this decodability is that g( ˜ X 2 ) and g( ˜ X 3 ) can be represented as the sum of some intermediate values and g(X 1 ), g(X 2 ), respectively. This representation is constructed using Lagrange’s interpolation formula by viewing each of them as a quadratic function of β or c−1 c β. 4 Explicitly, using the sequence{ 1 c−i } i∈N + as encoding coefficients. 84 For instance, by viewing ˜ X 2 as a linear function of β, after applying polynomial g, we obtain a quadratic function. Reevaluating this function at point 0 and 1 gives g(X 1 ) and g(P 0 ). Moreover, Harmonic Coding provides a recursive relationP 1 = (1− c c−1 )X 1 + c c−1 P 0 , which indicates thatg(P 1 ) can be viewed as evaluating the same polynomial at point c c−1 . Hence, the Lagrange’s interpolation formula suggests g(X 1 ) = 1· c c−1 (1−β)( c c−1 −β) g( ˜ X 2 ) + β· c c−1 (β− 1)( c c−1 − 1) g(P 0 ) + 1·β (1− c c−1 )(β− c c−1 ) g(P 1 ) =g( ˜ X 2 ) + 2g(P 0 ) + 3g(P 1 ). (7.3) Similarly, by viewing g( ˜ X 3 ) as a quadratic function of c−1 c β, and viewingg(X 2 ),g(P 1 ) andg(P 2 ) as its evaluations at points 0, 1, and c−1 c−2 , we have g(X 2 ) = 1· c−1 c−2 (1− c−1 c β)( c−1 c−2 − c−1 c β) g( ˜ X 3 ) + c−1 c β· c−1 c−2 ( c−1 c β− 1)( c−1 c−2 − 1) g(P 1 ) + 1· c−1 c β (1− c−1 c−2 )( c−1 c β− c−1 c−2 ) g(P 2 ) =3g( ˜ X 3 ) + 2g(P 1 ) +g(P 2 ). (7.4) Note that the LHS of equations (7.3) and (7.4) are exactly the two needed “partial gradients”, while the sum of the coefficients of g(P 1 ) on the RHS is zero (which is also due to the Harmonic Coding structure). Thus, by adding these two equations, the intermediate value g(P 1 ) is canceled, and we have shown thatf(X 1 ,X 2 ) can be recovered fromg( ˜ X 2 ),g( ˜ X 3 ),g(P 0 ), andg(P 2 ). Moreover, note that g(P 0 ) and g(P 2 ) are directly computed by worker 1 and worker 4. This completes the intuition for the validity of the proposed design. 7.3.2 General Scheme Now we present Harmonic Coding for any gradient-type function f with degree d, and for any parameter value of K. We first partition the workers into K + 2 groups, where the first K groups each contain d− 1 workers, and the rest two groups each contains 1 worker. For brevity, we refer to 85 the workers in the first K groups as “worker i in group j” for i∈ [d− 1], j∈ [K], and denote the assigned coded variable ˜ X (i,j) ; we refer to the rest 2 workers as worker 1 and worker N. Recall that the base field is assumed to be sufficiently large, we can find a parameter c∈F that is not from{0, 1,...,K}. Moreover, we find parameters β 1 ,...,β d−1 ∈F with distinct values that are not from{0}∪{ c c−i } K i=0 . Similar to Section 7.3.1, we use a uniformly random variable Z∈V, and define intermediate variables P 0 ,...,P K as follows. P j , c c−j Z− 1 c−j j X k=1 X K . (7.5) Then the input data is encoded based on the following equations. ˜ X 1 =P 0 (7.6) ˜ X (i,j) =X j (1−β i c−j + 1 c ) +P j−1 β i c−j + 1 c (7.7) ˜ X N =P K , (7.8) Using the above encoding scheme, one can verify that all coded variables are masked by the variable Z, which guarantees the data privacy. Hence, it remains to prove the decodability, i.e., f(X 1 ,...,X K ) can be recovered by linearly combining the results from the workers. The proof relies on the following lemma, which is proved in Appendix A. Lemma 7.1. For any gradient-type function with degree of at most d, using Harmonic coding, the master can compute Q j ,g(X j )−g(P j−1 )(c−j + 1) d−1 Y i=1 β i (c−j + 1) β i (c−j + 1)−c +g(P j )(c−j) d−1 Y i=1 β i (c−j) β i (c−j)−c (7.9) for any j∈ [K], by linearly combining computing results from workers in group j. Similar to the motivating example, the proof idea of Lemma 7.1 is to view{g(X (i,j) )} i∈[d−1] , g(X j ), g(P j−1 ), and g(P j ) as evaluations of a degree d polynomial at d + 2 different points, and derive equation (7.9) using Lagrange’s interpolation formula. Assuming the correctness of the Lemma 7.1, the master can first decode Q j ’s given the computing results from the workers. Then note that in equation (7.9) the sum of the coefficients of each g(P j ) in Q j and Q j+1 is zero for any j∈ [K− 1]. By adding all the variables Q 1 ,...,Q K , the master can obtain a linear combination of f(X 1 ,...,X K ), g(P 0 ), and g(P K ). Finally, because g(P 0 ) and g(P K ) are computed by worker 1 and 86 worker N, f(X 1 ,...,X K ) can be computed by subtracting the corresponding terms. This proves the validity of Harmonic Coding. Remark 7.4. Harmonic Coding enables efficient encoding and decoding algorithms, both with linearly complexities. A linear complexity encoding algorithm can be designed by exploiting the recursive structures between the intermediate variables and coded variables. Specifically, the encoder can first compute all P i ’s recursively, each P i by linearly combining P i−1 and X i ; then every other coded variable that is not available can be computed directly using equation (7.7). This encoding algorithm requires computing O(N) linear combinations of two variables in V, which is linear with respect to the output size of the encoder. On the other hand, the decoding process is simply linearly combining the outputs from all workers, and a natural algorithm achieves linear complexity with respect to the input size of the decoder. Remark 7.5. Harmonic Coding can also be extend to scenarios where the base field F is infinite. Note that any practical (digital) implementation for such computing tasks requires quantizing the variables into discrete values. We can thus embed them into a finite field, then directly apply the finite field version of Harmonic Coding. For instance, if the input variables and the coefficients of g are quantized into k-bit integers, the length of output values are thus bounded by ˜ O(k· degg), which only scales logarithmically with respect to parameters such as number of workers (N). We can always find a finite fieldF p with a prime p = ˜ Θ(k· degg), that enables computing f with zero numerical error. This approach also avoids potentially large intermediate computing results, which could save storage and computation time. 7.4 Converse In this Section, we prove the converse part of Theorem 7.1, which shows the optimality of Harmonic Coding. Formally, we define that linear coding schemes are ones that uses linear encoding functions and linear decoding functions. A linear encoding function computes a linear combination of the input variables and a list of independent uniformly random keys; while a linear decoding function computes a linear combination of workers’ output. We need to prove that for any gradient-type function characterized by a polynomial g, any linear coding scheme requires at least K(deg g− 1) + 2 workers, if the characteristic of F is greater than deg g. The proof rely on the following key lemma, which essentially states the optimallity of Harmonic Coding among any scheme that uses linear encoding functions when g is multilinear. Lemma 7.2. For any gradient-type computation where g is a multilinear function, any valid data-private scheme that uses linear encoding functions requires at least K(deg g− 1) + 2 workers. 87 The proof of Lemma 7.2 can be found in Appendix D.2, and the main idea is to construct instances of input values for any assumed scheme that uses a smaller number of workers, where the validity does not hold true. Assuming the correctness of Lemma 7.2 and to prove the converse part of Theorem 7.1, we need to generalize this converse to arbitrary polynomial functions g, using the extra assumptions on linear decoding and large characteristic of F. Note that the minimum number of workers stated in Theorem 7.1 only depends on the degree of the computation task. We can generalize Lemma 7.2 by showing that for any gradient-type function f that can be computed with N f workers, there exists a gradient-type function f 0 with the same degree characterized by a multilinear function, which can also be computed with N f workers. Specifically, given any function f characterized by a polynomial g with degree d, we provide an explicit construction for suchf 0 , which is characterized by a multilinear mapg 0 :V d →U, defined as g 0 (X i,1 ,...,X i,d ) = X S⊆[d] (−1) |S| g( X j∈S X i,j ) (7.10) for any{X i,j } j∈[d] ∈V d . As we have proved in an earlier work (see Lemma 4 in [20]),g 0 is multilinear with respect to thed inputs. Moreover, if the characteristic ofF is greater thand, theng 0 is non-zero. Given this construction, it suffices to prove that f 0 enables computation designs that uses at most the same number of workers compared to that of f. We prove this fact by constructing such computing schemes for f 0 given any design for f, presented as follows. Let X 0 1 ,...,X 0 K ∈V d denote the input variables of f 0 , we use a uniformly random key, denoted Z 0 ∈V d , and encode these variables linearly using the same encoding matrix used in the scheme for f. Then similarly in the decoding process, we let the master compute the final result using the same coefficients for the linear combination. Because the same encoding matrix is used, the new scheme constructed for f 0 is also data private. On the other hand, note that g 0 is defined as a linear combination of functions g( P j∈S X i,j ), each of which is a composition of a linear map and g. Given the linearity of the encoding design, for any subsetS, the variables{ P j∈S X i,j } i∈[K] are encoded as if using the scheme for f. Hence, the master would return P i∈[K] g( P j∈S X i,j ) if the workers only evaluate the term corresponding to S. Now recall that the decoding function is also assumed to be linear. The same scheme is also valid to any linear combinations of them, which includes f 0 . Hence, the same number of worker N f achievable for f can also be achieved by f 0 . This concludes the converse proof. 88 7.5 Conclusion In this chapter, we characterized the fundamental limit for computing gradient-type functions distributedly with data privacy. We proposed Harmonic Coding, which uses the optimum number of workers, and proved a matching converse. However, note that by relaxing the assumptions we made in the system model and the converse theorem (e.g., random key access, coding complexity, characteristic ofF), improved schemes could be found. We present the following two examples. 5 7.5.1 Random Key Access and Extra Computing Power at Master Recall that in the system model, we assumed that the decoding function is independent of the encoding functions. This essentially states that the master does not have access to the random keys when decoding the final results. Moreover, we assumed linear decoding, which restricts the computational power of the master. However, if both of these assumptions are removed, and the master has the knowledge of function f, an improved yet practical scheme based on Harmonic Coding can be obtained by letting the master compute g(Z) in parallel with the workers. In this way, the required number of workers can be reduced by 1. Alternatively, if the master has access to the input data, it can compute any g(X i ), effectively reducing K by 1, and reducing the required number of workers by deg g− 1. 7.5.2 An Optimum Scheme for Char F = deg g The converse presented in Theorem 7.1 is only stated for the case where the characteristic of the base field F is greater than the degree of function g. In fact, when this condition does not hold, improved coding designs can be found for certain functions f. For example, consider a gradient-type function characterized by g :F m →F n defined as g(X i ) =A· X d i,1 X d i,2 . . . X d i,m , (7.11) where A is a fixed non-zero n-by-m matrix, and d equals the characteristic ofF. By exploiting the “Freshman’s dream” formula (i.e., P i X d i,j = ( P i X i,j ) d ), one can actually design an optimal data 5 Similar examples and discussions can also be made to the polynomial evaluation problem we considered in Chapter 6. 89 private scheme that only uses 2 workers for any possible d, instead of using Harmonic coding which requires K(d− 1) + 2 workers. Recall that Lemma 7.2 states that Harmonic Coding does achieve optimality for any multilinear g and for any characteristic ofF. This implicates that when the characteristic of the base field equals deg g, one cannot find a coding scheme that is universally optimal for all gradient-type functions of the same degree. Equivalently, we have shown that the CharF> deg g requirement in the converse statement in Theorem 7.1 provides a sharp lower bound on the characteristic of the base field to guarantee the existence of universally optimal schemes. 90 Chapter 8 Entangled Polynomial Codes for Secure, Private, and Batch Distributed Matrix Multiplication: Breaking the “Cubic” Barrier In this chapter, we revisit the block matrix multiplication setup considered earlier in Chapter 4, for which we have presented the best known achievability result for straggler mitigation, achieved by two versions of the entangled polynomial code, which characterizes the optimum recovery threshold within a factor of 2. 1 We have shown that the coding gain achieve by entangled polynomial codes extended to fault-tolerant computing, and it is shown in Chapter 6 that security against Byzantine adversaries can also be provided the same way. Our goal is to demonstrate that entangled polynomial codes can be further extended to also include three important settings: secure, private, and batch distributed matrix multiplication; to provide unified solutions that order-wise improves the state of the arts. In secure distributed matrix multiplication [61–75], the goal is to compute a single matrix product while preserving the privacy of input matrices against eavesdropping adversaries; in private distributed matrix multiplication [67,69,74,76], the goal is to multiply a single pair from two lists of matrices while keeping the request (indices) private; batch distributed matrix multiplication [72,77,78] considers a scenario where more than one pair of matrices are to be multiplied. This chapter is based on [12]. 1 For brevity, we refer the collection of them as entangled polynomial codes. 91 There are works on each of these problems that considered general block-wise partitioning of input matrices [65, 66, 74, 77, 78]. However, all results presented in prior works are limited by a “cubic” barrier. Explicitly, when the input matrices to be multiplied are partitioned into m-by-p and p-by-n subblocks, all state of the arts require the workers computing at least pmn products of coded submatrices per each multiplication task. In particular, by dividing two input matrices into m-by-p and p-by-n subblocks, a single multiplication task can be viewed as computing linear combinations of pmn submatrix products, which can be assigned to pmn workers. Entangled polynomial codes provides a powerful method for breaking the cubic barrier. It achieves a subcubic recovery threshold, meaning that the final product can be recovered from any subset of multiplication results with a size order-wise smaller than pmn. One significance of entangled polynomial codes is that it maps non-straggler-mitigating linear coded computing schemes to bilinear-complexity decompositions, which bridges the areas of fast matrix multiplication and coded computation, enables utilizing techniques developed in the rich literature (e.g., [44,46–60,121–124]). Moreover, this connection reduces block-wise matrix multiplication to computing element-wise products, for which we developed the optimal strategy for straggler mitigation. We demonstrate how entangled polynomial codes can be extended to break the cubic barrier in all three problems. We show that the coding ideas of entangled polynomial codes and PCC can be applied to provide unified solutions with needed security and privacy, as well as efficiently handling batch evaluation. Moreover, we achieve order-wise improvements upon state of the arts with explicit coding constructions. 8.1 Preliminaries Recall that for block-partitioned coded matrix multiplication, the goal is to distributedly multiply matrices with sizes ofF t×s andF s×r , for a sufficiently large fieldF, with a set of N workers each can multiply a pair of possibly coded matrices with sizes F t m × s p andF s p × r n . Explicitly, in a basic setting, given a pair of input matrices A∈F s×t , B∈F s×r , each worker i is assigned a pair of possibly coded matrices ˜ A i ∈F s p × t m and ˜ B i ∈F s p × r n , which are encoded based on some (possibly random) functions of the input matrices respectively. The workers can each compute ˜ C i , ˜ A | i ˜ B i and return them to the master. The master tries to recover the final product C,A | B based on results from possibly a subset of workers using some decoding functions. The best known recovery threshold for computing block-partitioned matrix multiplication is achieved by entangled polynomial codes. In particular, entangled polynomial codes achieves a recovery threshold of min{pmn +p− 1, 2R(p,m,n)− 1} for any p,m, and n [16]. Here R(p,m,n) 92 denotes the bilinear complexity [43] for multiplying two matrices of sizes m-by-p and p-by-n, which is well know to be subcubic, i.e, R(p,m,n) =o(pmn) when p, m, and n are large. Remark 8.1. The quantity bilinear complexityR(p,m,n) should not be confused with a closely related concept: the computational complexity of matrix multiplication. The computational complexity captures the costs from all operations to compute a function. As the relative costs for each basic operation could be variable in different computing systems, the computational complexity is often stated in an approximated form (using big-O notation) and most related works focus on studying its asymptotic behaviour. On the other hand, the bilinear complexity R(p,m,n) is a well-defined integer given any p,m,n and the base field F, which can be accurately stated and characterized. For example 2 R(2, 2, 2) = 7, which indicates that even for a basic scenario where the inputs are partitioned into 2-by-2 subblocks and no straggler mitigation is required, one should consider using linear codes to reduce the number of workers from 8 to 7, as long as the input matrices are large enough such that coding overheads are negligible [44,108]. Note that even for cases where R(p,m,n) is not yet known, one can still obtain explicit coding constructions by swapping in any upper bound constructions (e.g, [46–60]). Subcubic recovery thresholds can still be achieved for any sufficiently large p, m, and n even one only applies the well known Strassen’s construction [46]. Hence, for simplicity, in this chapter we present all results in terms of R(p,m,n), and explicit subcubic constructions can be obtained in the same way. We focus on linear codes, defined similarly as in Chapter 4 and Chapter 6, which guarantees linear coding complexities w.r.t. the sizes of input matrices, and are dimension independent. Precisely, in a linear coding design, the input matrix A (or each inputA for more general settings) is partitioned into p-by-m subblocks of equals sizes (and possibly padded with a list of i.i.d. uniformly random matrices of same sizes, referred to as random keys). 3 Matrix (or matrices)B are partitioned similarly. Each worker is then assigned a pair of linear combinations of these two lists of submatrices as coded inputs. Moreover, the master uses decoding functions that computes linear combinations of received computing results. 4 All results presented in this paper for distributed matrix multiplication directly extends to general codes with possibly non-linear constructions, by swapping any upper bound of R(p,m,n) into the number workers required by any computing scheme, as we illustrated in [108]. 2 More generally, R(p,m,n)<pmn for any p,m,n> 1. 3 To make sure the setting is well defined, we assume F is finite whenever data-security or privacy is taken into account. 4 Note that by relaxing certain assumptions made in the paper, such as allowing the decoder to access inputs and random keys and allowing extra computational cost at workers or master, one can further reduce the recovery threshold (e.g., see discussions in [72,81]). 93 8.2 Secure, Private, Batch Distributed Matrix Multiplication and Main Results We show that entangled polynomial codes can be adapted to the settings of secure, private and batch distributed matrix multiplication, achieving order-wise improvement with subcubic recovery thresholds while meeting the systems’ requirements. To demonstrate the coding gain, we focus on applying the second version of the entangled polynomial code, the one that achieves a recovery thrshold of 2R(p,m,n)− 1 for straggler mitigation. 8.2.1 Secure Distributed Matrix Multiplication Secure distributed matrix multiplication follows a similar setup discussed in Section 8.1, where the goal to multiply a single pair of matrices, with additional constraints that either one or both of the input matrices are information-theoretic private to the workers, even if up to a certain number of them can collude. In particular, we say an encoding scheme is one-sided T -secure, if I({ ˜ A i } i∈T ;A) = 0 (8.1) for any subsetT with size of at most T , where A is generated uniformly at random. Similarly, we say an encoding scheme is fully T -secure, if instead I({ ˜ A i , ˜ B i } i∈T ;A,B) = 0 (8.2) is satisfied for any|T|≤T , for uniformly randomly generated A and B. Secure distributed matrix multiplication has been studied in [61–75]. In particular, [65,66,74] presented coded computing designs for general block-wise partitioning of the input matrices, all requiring at least pmn workers’ computation. 5 Entangled polynomial codes achieves subcubic recovery thresholds for both one-sided and fully secure settings, formally stated in the following theorem. Theorem 8.1. For secure distributed matrix multiplication, there are one-sided T -secure linear coding schemes that achieves a recovery threshold of 2R(p,m,n) +T− 1, and fully T -secure linear coding schemes that achieves a recovery threshold of 2R(p,m,n) + 2T− 1. Remark 8.2. Entangled polynomial codes order-wise improves the state of the arts for general block-wise partitioning [65,66,74], by providing explicit constructions that require subcubic number of workers. Moreover, entangled polynomial codes simultaneously handles data security and straggler 5 In addition, at least T extra workers are needed per each input matrix required to be stored securely. 94 issues by tolerating arbitrarily many stragglers while maintaining the same recovery threshold and privacy level. Remark 8.3. Following similar converse proof steps we developed in [20,108], one can show that any linear code that is either one-sided T -secure or fully T -secure requires using at least R(p,m,n) +T workers. Hence, entangled polynomial codes enables achieving optimal recovery thresholds within a factor of 2 for both settings. 8.2.2 Private Distributed Matrix Multiplication Private distributed matrix multiplication has been studied in [67,69,74,76], where the goal is to instead multiply a matrix A by one of the matrices B (D) fromB = (B (1) ,...,B (M) ) while keeping the request D private to the workers. In particular, the master send a (possibly random) query Q i to each worker i based the request D. Then the matricesB are encoded by each worker i into a coded submatrix ˜ B i ∈F s p × r n based on Q i . The matrix A is encoded the same as the basic setting, and each worker computes the product of their coded matrices. The index D should be kept private to any single worker in the sense that 6 I(D;Q i , ˜ A i ,B) = 0 (8.3) for any i∈ [N], where A,B,D are sampled uniformly at random. The master can decode the final output based on the returned results, the request D and query Q i ’s. Moreover, in some related works [67,69,74], the encoding ofA is also required to be secure against any single curious worker. I.e., I( ˜ A i ;A) = 0 (8.4) for any i∈ [N] if A is sampled uniformly random. This setting is referred to as private and secure distributed matrix multiplication. The state of the art for private and secure distributed matrix multiplication with general block- partitioning based designs was proposed in [74], which requires at least pmn number of workers. Entangled polynomial codes achieves subcubic recovery thresholds, formally stated in the following theorem. 6 Note that a stronger privacy condition I(D;Qi, ˜ Ai,A,B) = 0 can still be achieved, if one use the scheme for private and secure distributed matrix multiplication presented later in this paper. 95 Theorem 8.2. For private coded matrix multiplication, there are linear coding schemes that achieve a recovery threshold of 2R(p,m,n). For private and secure distributed matrix multiplication, linear coding schemes can achieve a recovery threshold of 2R(p,m,n) + 1. Remark 8.4. Entangled polynomial codes order-wise improves the state of the arts for general block-wise partitioning [74], by providing explicit constructions that achieves subcubic recovery thresholds, while simultaneously provides straggler-resiliency, data-security and privacy. Remark 8.5. Similar to the discussion in Remark 8.2, one can show that any linear code requires at least R(p,m,n) workers for private coded matrix multiplication and R(p,m,n) + 1 workers for private and secure distributed matrix multiplication, even if one ignore the privacy requirement. This indicates a factor-of-2 optimality of entangled polynomial codes for both settings. Entangled polynomial codes also applies to a more general scenario where the encoding functions for both input matrices are assigned to the workers, which we refer to as fully private coded matrix multiplication and formulate as follows. In fully private coded matrix multiplication, we have two lists of input matricesA = (A (1) ,...,A (M) ) andB = (B (1) ,...,B (M) ), and the master aims to compute A (D) | B (D) given an index D. We assume M > 1, because otherwise the privacy requirement is trivial. We aim to find computation designs such that D is private against any single workers. Explicitly, the master send a (possibly random) query Q i to each worker i based on the demand D. Then workeri encodes both A andB based onQ i . We require the requests to be private in the sense that I(D;Q i ,A,B) = 0 (8.5) for any i∈ [N], whereA,B,D are sampled uniformly at random. We summarize the performance of entangled polynomial codes for fully private coded matrix multiplication as follows. Theorem 8.3. For fully private coded matrix multiplication, there are linear coding schemes that achieve a recovery threshold of 2R(p,m,n) + 1. Remark 8.6. Similar to earlier discussions, entangled polynomial codes provides coding constructions for fully private coded matrix multiplication with subcubic recovery thresholds. One can prove that any fully private linear code requires at leastR(p,m,n)+1 workers. Hence, the factor-of-2 optimality of entangled polynomial codes also holds true for fully private coded matrix multiplication. 8.2.3 Batch Distributed Matrix Multiplication The authors of [72, 77, 78] considered a scenario where the goal is to compute L copies of the matrix multiplication task in one round of communication. Formally, a basic setting for batch 96 distributed matrix multiplication is that we have two lists of input matrices A = (A (1) ,...,A (L) ) and B = (B (1) ,...,B (L) ), and the master aims to compute their element-wise product C = (A (1) | B (1) ,...,A (L) | B (L) ). Given partitioning parameters p,m, and n, each worker still computes a single multiplication of coded submatrices with sizes F t m × s p andF s p × r n . For general block-partitioning based schemes, the state of the art design is provided in [77,78], where the focus is to reduce the recovery threshold and no security or privacy is required. All known coding constructions presented for batch distributed matrix multiplication requires cubic number of workers per each multiplication task even no straggler presents (i.e., requiring at least Lpmn workers in total). We show that entangled polynomial codes offer a unified coding framework for batch matrix multiplication, achieving subcubic recovery thresholds while simultaneously handling all security and privacy requirements that are discussed earlier in this section. We present this result in the following theorem. 7 The proofs and detailed formulations can be found in Section 8.5. Theorem 8.4. For coded distributed batch matrix multiplication with parameters p,m,n, and L, there are linear coding schemes that achieve a recovery threshold of 2LR(p,m,n)− 1. Moreover, for extended settings in batch matrix multiplication, linear coding schemes achieve the following recovery thresholds: • One sided T -security: 2LR(p,m,n) +T− 1, • Fully T -security: 2LR(p,m,n) + 2T− 1, • Privacy of request: 2LR(p,m,n), • Security and Privacy: 2LR(p,m,n) + 1, • Full Privacy: 2LR(p,m,n) + 1. Remark 8.7. Entangled polynomial codes provide coding schemes that order-wise improves the state-of-the-art schemes in [77,78] for batch matrix multiplication when the matrices are block-wise partitioned. Remark 8.8. The main proof idea is to note that batch-multiplying L pairs of matrices is still computing a bilinear function, so one can simply use similar bilinear decomposition bounds for this operation as in [108], and all earlier achievability and converse results extend to batch com- putation. However, to better demonstrate the achievability of subcubic recovery thresolds, we 7 Similar to [108], in the most basic scenario with no requirements on resiliency, security, and privacy (i.e., requiring a recovery threshold of N, with T = 0 and M = 1), one can directly apply any upper bound construction of bilinear complexity for batch matrix multiplication to further reduce the number of workers by a factor of 2. However, here we focus on demonstrating the coding gain and present the results for general scenarios. 97 present our results based on a subadditivity upper bound. Specifically, let R(L,p,m,n) denote the bilinear complexity of batch multiplying L pairs of m-by-p and p-by-n matrices, it satisfies R(L,p,m,n)≤LR(p,m,n). More generally, one can obtain achievabilty and converse results in batch matrix multiplication by simply substituting the quantity R(p,m,n) in results for single matrix multiplication with R(L,p,m,n). One can similarly prove the factor-of-2 optimalities for the general entangled polynomial codes framework for all settings we presented for batch matrix multiplication. 8.3 Achievability Schemes for Secure Distributed Matrix Multi- plication In this section, we present coding schemes for the simple scenario where the only additional requirement for distributed matrix multiplication is to maintain the security of input matrices. This provides a proof for Theorem 8.1. Given parametersp,m, andn, we denote the partitioned uncoded input matrices by{A i,j } i∈[p],j∈[m] and{B i,j } i∈[p],j∈[n] . The encoding consists of two steps. In Step 1, given any upper bound construction of R(p,m,n) (e.g., Strassen’s construction) with rank R and tensor tuples a∈F R×p×m , b∈F R×p×n , and c∈F R×m×n , we pre-encode the inputs each into a list of R coded submatrices. 8 ˜ A i,vec , X j,k A j,k a ijk , ˜ B i,vec , X j,k B j,k b ijk . (8.6) As we have explained in [108], this pre-encoding essentially provides a linear coding scheme with R workers that does not provide straggler-resiliency and data-security, of which we need to take into account in the second part of the encoding. In Step 2, note that it suffice to recover the element-wise products ˜ A | 1,vec ˜ B 1,vec ,..., ˜ A | R,vec ˜ B R,vec . We can build upon optimal coding constructions for element-wise multiplication, first presented in [16] for straggler mitigation and then extended in [20] to also provide data-privacy. We first pad the two vectors{ ˜ A i,vec } i∈[R] and{ ˜ B i,vec } i∈[R] with uniformly random keys. If matrix A needs to be stored securely against up to T colluding workers, we pad the pre-coded matrices of A with T uniformly random matrices Z 1 ,...,Z T ∈F s p × t m . Explicitly, we define ~ A 0 vec , ( ˜ A 1,vec ,..., ˜ A R,vec ,Z 1 ,...,Z T ) (8.7) 8 For detailed definitions of bilinear complexity and upper bound constructions, see [108]. 98 if A needs to be stored securely; otherwise, we define ~ A 0 vec , ( ˜ A 1,vec ,..., ˜ A R,vec ). (8.8) Similarly, we define vector ~ B 0 vec for matrixB in the same way. For brevity, we denote the lengths of ~ A 0 vec and ~ B 0 vec by L A and L B . Then we arbitrarily select R +T distinct elements fromF, denoted x 1 ,...,x R+T , and N distinct elements fromF\{x 1 ,...,x R }, denoted y 1 ,...,y N . We encode the inputs for each worker i as follows. ˜ A i = X j∈[L A ] ˜ A 0 j,vec · Y k∈[L A ]\j (y i −x k ) (x j −x k ) , (8.9) ˜ B i = X j∈[L B ] ˜ B 0 j,vec · Y k∈[L B ]\j (y i −x k ) (x j −x k ) . (8.10) As proved in [20], the above encoding scheme satisfies the requirements for both one-sided and fully T -secure settings. 9 According to the PCC framework, we have encoded the input matrices using polynomials with degrees of L A − 1 and L B − 1, where each worker i is assigned their evaluations at y i . Hence, after the workers multiply their coded matrices, they obtain evaluations of the multiplicative product of these polynomials, which has degree L A +L B − 2. Note that evaluations of this composed polynomial at x 1 ,...,x R recovers the needed element-wise products. The decodability requirement of PCC is satisfied. Consequently, the master can recover the final output by interpolating the composed polynomial after sufficiently many results from the workers are received, achieving a recovery threshold of L A +L B − 1. Recall that for one-sided T-secure setting, we have L A = R +T and L B = R; then for fully T-secure setting, we have L A = L B = R +T. Hence, we have obtained linear coding schemes with recovery thresholds of 2R +T− 1 and 2R + 2T− 1 for both settings respectively given any upper bound constructions of R(p,m,n) with rank R. Fundamentally, there exists constructions that exactly achieves the rank R(p,m,n), which proves the existance of coding schemes stated in Theorem 8.1. Remark 8.9. The coding scheme we presented for computing element-wise product with one-sided privacy naturally extends to provide optimal codes for the scenario of batch computation of multilinear functions where each of the input entries are coded to satisfy possibly different security requirements. 9 Such property is referred to as T -private in [20,125] 99 8.4 Achievability Schemes for Private Distributed Matrix Multi- plication In this section, we present the coding scheme for proving Theorem 8.2 and 8.3. We start with the setting for Theorem 8.2, where the goal is to multiply matrix A by one of the matrices from B (1) ,...,B (M) . Similar to Section 8.3, we first pre-encode the input matrices into lists of vectors of lengthR, given any upper bound construction ofR(p,m,n) with rankR and tensor tuplesa∈F R×p×m ,b∈F R×p×n , and c∈F R×m×n . In particular, given parameters p, m, and n, we denote the partitioned uncoded input matrices by{A i,j } i∈[p],j∈[m] and{B (`) i,j } i∈[p],j∈[n],`∈[M] . We define ˜ A i,vec , X j,k A j,k a ijk , ˜ B (`) i,vec , X j,k B (`) j,k b ijk , (8.11) for each i∈ [R] and `∈ [M]. Then given any request D ∈ [M], it suffices to compute the element-wise product{ ˜ A | i,vec ˜ B (D) i,vec } i∈[R] while keeping D private. For the second part of the encoding scheme, we present a new coding construction for computing element-wise product with privacy, which is motivated by coding ideas developed in [16,67] and earlier sections. In particular, we first pad the pre-encoded vector ofA with random keys for security. We define ~ A 0 vec , ( ˜ A 1,vec ,..., ˜ A R,vec ,Z), (8.12) if A needs to be stored securely, where Z is a random key sampled from F s p × t m with a uniform distribution; otherwise, ~ A 0 vec , ( ˜ A 1,vec ,..., ˜ A R,vec ). (8.13) For brevity, we denote the length of ~ A 0 vec by L A . We arbitrarily select R + 1 distinct elements fromF, denoted x 1 ,...,x R+1 , and encode matrix A by defining the following Lagrange polynomial, ˜ A(x), X i∈[L A ] ˜ A 0 i,vec · Y j∈[L A ]\i (x−x j ) (x i −x j ) , (8.14) 100 We then arbitrarily select a finite subsetY ofF\{x 1 ,...,x R } with at least N elements, and let the master uniformly randomly generate N distinct elements fromY, denoted y 1 ,...,y N . The master send ˜ A i = ˜ A(y i ) to each worker i, which satisfied the security of A when required. Given a request D, we similarly define ˜ B(x), X i∈[R+1] ˜ B 0 i,vec · Y j∈[R+1]\i (x−x j ) (x i −x j ) , (8.15) where ~ B 0 vec , ( ˜ B (D) 1,vec ,..., ˜ B (D) R,vec ,Y ), (8.16) and Y ∈F s p × r n is a quantity to be specified later. If the encoding can be designed such that each worker essentially computes ˜ A | (y i ) ˜ B(y i ), then we can achieve the recovery thresholds stated in Theorem 8.2. To construct a private computing scheme where ˜ B i is equivalent to ˜ B(y i ), we divide ˜ B(x) by a scalar 10 c(x), Y j∈[R] (x−x j ) (x R+1 −x j ) , (8.17) so that the result can be expressed as the unweighted sum ofY and ˜ B (D) Norm (x) with function ˜ B (·) Norm (x) defined as follows ˜ B (k) Norm (x),− X i∈[R] ˜ B (k) i,vec Y j∈[R]\i (x R+1 −x j ) (x i −x j ) (x−x R+1 ) (x−x i ) . We let the master generate i.i.d. uniformly random variables{z i } i∈[M]\D fromY independent ofy i ’s. The master send a query Q i = (q i1 ,...,q iM ) to each worker i with q ij =y i for j =D and q ij =z j for j6=D. Because each Q i appears uniformly random to worker i, the presented coding scheme satisfies the privacy requirement. We let each workeri encodeB by computing P j ˜ B (j) Norm (q ij ). Consequently, each encoded variable can be re-expressed as ˜ B i = ˜ B(y i ) c(y i ) (8.18) 10 Note that here we are exploiting the fact that each worker computes a function that is multilinear. For more general scenarios (e.g., general polynomial evaluations we considered in [20]), scaling the coded variables could affect decodability. 101 with Y = P j6=D ˜ B (j) Norm (q ij ) independent of y i . After the workers multiply the coded matrices, each workeri essentially returns ˜ A | (y i ) ˜ B(y i )/c(y i ). Because y i is available at the decoder, the master can decode ˜ A | (y i ) ˜ B(y i ) given each worker i’s returned result by computing c(y i ). Hence, by receiving results from sufficiently workers, the master can recovery the needed element-wise product by Lagrange interpolating the polynomial ˜ A | (x) ˜ B(x), and proceed to compute the final output. Because the degree of ˜ A | (x) ˜ B(x) equals L A − 1 +R, the presented coding scheme achieves a recovery threshold ofL A +R. Note thatL A =R when no security is required andL A =R + 1 when A is stored securely. We have obtained linear coding schemes with recovery thresholds of 2R− 1 for private coded matrix multiplication, and 2R for private and secure distributed matrix multiplication for any upperbound construction of R(p,m,n), which completes the proof for Theorem 8.2. Remark 8.10. This coding scheme naturally extends to the scenario where the encoding of A is required to be T -secure. A recovery threshold of 2R(p,m,n) +T can be achieved, which is optimal within a factor of 2. We now present the coding scheme for the fully private setting. The matrices are pre-encoded the same way and we denote the corresponding matrices by{ ˜ A (`) i,vec , ˜ B (`) i,vec } i∈[R],`∈[M] . To recover the final output, it suffices to compute{ ˜ A (D)| i,vec ˜ B (D) i,vec } i∈[R] . We arbitrarily selectR + 1 distinct elements fromF, denotedx 1 ,...,x R+1 , and define the following functions ˜ A (k) Norm (x),− X i∈[R] ˜ A (k) i,vec Y j∈[R]\i (x R+1 −x j ) (x i −x j ) (x−x R+1 ) (x−x i ) . ˜ B (k) Norm (x),− X i∈[R] ˜ B (k) i,vec Y j∈[R]\i (x R+1 −x j ) (x i −x j ) (x−x R+1 ) (x−x i ) . We then arbitrarily select a finite subsetY of F\{x 1 ,...,x R } with at least N elements. Let the master uniformly randomly generate N distinct elements fromY, denoted y 1 ,...,y N , and i.i.d. uniformly random variables{z i } i∈[M]\D fromY independent of y i ’s. The master send a query Q i = (q i1 ,...,q iM ) to each worker i with q ij =y i for j =D, and q ij =z j for j6=D. This query is fully private, because for each worker i, q i1 ,...,q iM appears i.i.d. uniformly random inY. Each worker i encodes the input matrices as follows ˜ A i = X j ˜ A (j) Norm (q ij ), (8.19) 102 ˜ B i = X j ˜ B (j) Norm (q ij ). (8.20) After the computation result is received from any worker i, by multiplying a scalar factor c 2 (y i ) with function c defined in equation (8.17), the master recovers the evaluation of the product of two Lagrange polynomials of degree R at pointy i . By interpolating this polynomial and re-evaluating it at x i ’s, the master can recover all needed element-wise products. This provides a coding scheme that proves Theorem 8.3. 8.5 Achievability Schemes for Batch Distributed Matrix Multipli- cation In this section, we present the coding scheme for proving Theorem 8.4. We start with the basic setting where no security or privacy is required. As mentioned in Section 8.2.3, one can directly decompose the tensor characterizing the L-batch matrix multiplication, and all earlier results as well as Theorem 3 in [16] extends to batch distributed matrix multiplication. However, we instead present one certain class of upper bounds based on subadditivity of tensor rank. Explicitly, we denote the partitioned uncoded input matrices by{A (k) i,j } i∈[p],j∈[m],k∈[L] and {B (k) i,j } i∈[p],j∈[n],k∈[L] . Given any upper bound construction of R(p,m,n) with rank R and tensor tuples a∈F R×p×m , b∈F R×p×n , and c∈F R×m×n , we define ˜ A i,`,vec , X j,k A (`) j,k a ijk , ˜ B i,`,vec , X j,k B (`) j,k b ijk . (8.21) for each i∈ [R] and `∈ [L]. Note that the batch product can be recovered from the element-wise product{ ˜ A | i,`,vec ˜ B i,`,vec } i∈[R],`∈[L] . One can directly apply the optimal coding scheme presented in [16], which encodes the pre-encoded vectors using Lagrange polynomials. According to Corollary 1 in [16], the resulting scheme achieves a recovery threshold of 2LR− 1, which proves the based scenario for Theorem 8.4. Remark 8.11. In [100], Lagrange encoding is also applied to compute inner product (sum of element- wise products) to achieve the same recovery threshold. Remarkably, [100] pointed out that the encoding can be made systematic as Lagrange polynomials passes through all uncoded inputs, as stated in [126]. It is mentioned in [100] that the main benefits of using systematic encoding designs is to enable recovery from results of a certain smaller subset of “systematic” workers, which provides backward-compatibility and potentially reduces computation and decoding latency. Based on this observation, the entangled polynomial codes can be adapted to a “systematic” version that goes beyond inner product and handles generalized block-wise partitioned matrices by choosing the same 103 evaluation points as in [126], so that a subset of R(p,m,n) workers computes all needed “uncoded” products of the pre-encoded matrices, and all major benefits of systematic encoding are provided. This construction gives a practical solution to an open problem stated in [127], in the sense of achieving all major benefits of systematic encoding, and improving recovery thresholds for any sufficiently large values of p, m, and n. Now we formally state the settings with security and privacy requirements. Similar to Section 8.2, for batch matrix multiplication with security requirement, the formulation is the same as the basic setup for batch distributed matrix multiplication, except that the inputs needs to be stored information-theoretic privately even if up to T workers collude. We say a coding scheme is one-sided T -secure, if I({ ˜ A i } i∈T ;A) = 0 (8.22) for any subsetT with size of at most T, where A is generated uniformly at random. We say an encoding scheme is fully T -secure, if instead I({ ˜ A i , ˜ B i } i∈T ;A,B) = 0 (8.23) is satisfied for any|T|≤T , for uniformly randomly generated A and B. When privacy is taken into account, the goal is to instead batch multiply a list of L matrices by one unknown subset of L matricesB ={B (i,D) } i∈[L] from a setB ={B (i,j) } i∈[L],j∈[M] while keeping the request D private to the workers. The master send a query and a coded version ofA with sizeF s p × t m to each worker, then each worker encodes matricesB into a coded submatrix of sizeF s p × r n based on the query, the same as in private distributed matrix multiplication. We say a computing scheme for batch matrix multiplication is private, if I(D;Q i , ˜ A i ,B) = 0 (8.24) for any i∈ [N], where A,B,D are sampled uniformly at random. Furthermore, we say the computing scheme is private and secure, if we also have I( ˜ A i ;A) = 0 (8.25) for any i∈ [N] whenA is sampled uniformly random. Finally, for fully private batch matrix multiplication, the goal is to batch multiply L pairs of matrices given two lists of inputsA ={A (i,j) } i∈[L],j∈[M] andB ={B (i,j) } i∈[L],j∈[M] . The master aims to compute{A (i,D) | B (i,D) } i∈[L] given an index D, while keeping D private. The rest of the 104 computation follows similarly to the fully private and the batch distributed matrix multiplication frameworks. Explicitly, we require that I(D;Q i ,A,B) = 0 (8.26) for any i∈ [N], whereA,B,D are sampled uniformly at random and Q i denotes the query the master send to worker i. The achievability schemes for all these settings can be built based on coding ideas we presented earlier in this paper. In particular, by first pre-encoding each of the input matrices using any upper bound construction of R(p,m,n), the task of batch-multiplying L matrices is reduced to computing element-wise product of two vectors of lengths at mostLR(p,m,n). Then observe that in the second parts of all coding schemes we presented in earlier sections for non-batch matrix multiplication, we essentially provided linear codes that compute element-wise products of vectors of any sizes. By directly applying those proposed designs to the extended pre-coded vectors for batch multiplication, we obtain the needed computing schemes for proving Theorem 8.4 where the achieved recovery threshold upper bounds are stated by swapping R(p,m,n) into LR(p,m,m). 105 Part III Optimal Codes for Communication 106 Chapter 9 Fundamental Limits of Communication for Coded Distributed Computing In recent years, distributed systems like Apache Spark [30] and computational primitives like MapReduce [29], Dryad [128], and CIEL [129] have gained significant traction, as they enable the execution of production-scale computation tasks on data sizes of the order of tens of terabytes and more. The design of these modern distributed computing platforms is driven by scaling out computations across clusters consisting of as many as tens or hundreds of thousands of machines. As a result, there is an abundance of computing resources that can be utilized for distributed processing of computation tasks. However, as we allocate more and more computing resources to a computation task and further distribute the computations, a large amount of (partially) computed data must be moved between consecutive stages of computation tasks among the nodes, hence the communication load can become the bottleneck. In this chapter, we study a recently proposed coding framework for distributed computing, namely Coded Distributed Computing [23], which allows for designing computation and communication schemes to optimally trade computation load with communication load in distributed computing. Formally, we consider a general MapReduce-type framework for distributed computing (see, e.g., [29, 30]), in which the overall computation is decomposed to three stages, Map, Shuffle, and Reduce that are executed distributedly across several computing nodes. In the Map phase, each input This chapter is based on [23,96] 107 file is processed locally, in one (or more) of the nodes, to generate intermediate values. In the Shuffle phase, for every output function to be calculated, all intermediate values corresponding to that function are transferred to one of the nodes for reduction. Finally, in the Reduce phase all intermediate values of a function are reduced to the final result. In Coded Distributed Computing, we allow redundant execution of Map tasks at the nodes, since it can result in significant reductions in data shuffling load by enabling in-network coding. In fact, in [22, 23] it has been shown that by assigning the computation of each Map task at r carefully chosen nodes, we can enable novel coding opportunities that reduce the communication load by exactly a multiplicative factor of the computation load r. For example, the communication load can be reduced by more than 50% when each Map task is computed at only one other node (i.e., r = 2). We develop information-theoretic converse bounds to prove the exact optimally of this tradeoff, which establishes a fundamental inverse proportional relationship between computation and commu- nication. To do that, we first derive a lower bound on the communication load that applies to the most general setup where each intermediate value can be stored and requested by any subset. 9.1 Problem Formulation We consider a problem of computing Q output functions from N input files, for some system parameters Q,N∈N. More specifically, given N input files w 1 ,...,w N ∈F 2 F , for some F∈N, the goal is to compute Q output functions φ 1 ,...,φ Q , where φ q : (F 2 F ) N →F 2 B, q∈{1,...,Q}, maps all input files to a B-bit output value u q =φ q (w 1 ,...,w N )∈F 2 B, for some B∈N. We employ a MapReduce-type distributed computing structure and decompose the computation of the output function φ q , q∈{1,...,Q}, as follows: φ q (w 1 ,...,w N ) =h q (g q,1 (w 1 ),...,g q,N (w N )), (9.1) where as illustrated in Fig. 9.1, • The “Map” functions~ g n =(g 1,n ,...,g Q,n ):F 2 F→(F 2 T ) Q , n∈{1,...,N}, maps the input file w n into Q length-T intermediate values v q,n =g q,n (w n )∈F 2 T , q∈{1,...,Q}, for some T∈N. • The “Reduce” functions h q : (F 2 T ) N →F 2 B, q∈{1,...,Q}, maps the intermediate values of the output functionφ q in all input files into the output valueu q =h q (v q,1 ,...,v q,N ) =φ q (w 1 ,...,w N ). We perform the above computation using K distributed computing servers, labelled by Server 1, ..., Server K. The K servers carry out the computation in three phases: Map, Shuffle and Reduce. 108 Map Functions Reduce Functions Figure 9.1: Illustration of a two-stage distributed computing framework. The overall computation is decomposed into computing a set of Map and Reduce functions. Map Phase. In the Map phase, each server maps a subset of input files. For each k∈{1,...,K}, we denote the indices of the files mapped by Server k asM k , which is a design parameter. Each file is mapped by at least one server, i.e., ∪ k=1,...,K M k ={1,...,N}. For each n inM k , Server k computes the Map function~ g n (w n )=(v 1,n ,...,v Q,n ). Definition 9.1 (Peak Computation Load). We define the peak computation load, denoted by p, 0≤p≤ 1, as the maximum number of files mapped at one server, normalized by the number of files N, i.e., p, max k=1,...,K |M k | N . 3 Shuffle Phase. We assign the tasks of computing the Q output functions across the K servers, and denote the indices of the output functions computed by Server k, k = 1,...,K, asW k . Each output function is computed exactly once at some server, i.e., 1) ∪ k=1,...,K W k ={1,...,Q}, and 2) W j ∩W k =∅ for j6=k. To compute the output value u q for some q∈W k , Server k needs the intermediate values that are not computed locally in the Map phase, i.e.,{v q,n :q∈W k ,n / ∈M k }. After the Map phase, the K server proceed to exchange the needed intermediate values for reduction. We formally define a shuffling scheme as follows: • Each server k, k∈{1,...,K}, creates a message X k as a function of the intermediate values computed locally in the Map phase, i.e., X k =ψ k ({~ g n :n∈M k }), and multicasts it to a subset of 1≤j≤K− 1 nodes. Definition 9.2 (Communication Load). We define the communication load, denoted byL, 0≤L≤ 1, as the total number of bits communicated by all server in the Shuffle phase, normalized by QNT (which equals the total number of bits in all intermediate values {v q,n : q ∈ {1,...,Q},n ∈ {1,...,N}}). 1 3 1 In this paper, we assume that the cost of multicasting to multiple servers is the same as unicasting to one server. 109 Reduce Phase. Server k, k∈{1,...,K}, uses the local Map results{~ g n : w n ∈M k } and the received messages X 1 ,...,X K in the Shuffle phase to construct the inputs to the assigned Reduce functions inW k , and computes the output value u q =h q (v q,1 ...v q,N ) for all q∈W k . Consider a computing task with parameters Q and N. Given a certain number of servers K, a Map task assignmentM and a Reduce task assignmentW on these servers, we say a shuffling scheme is valid if, for any possible outcomes of the intermediate values v q,n , each server can decode all its needed intermediate values based on the values that are locally computed in the map phase and the messages received during the shuffle phase. In this setting, we are interested in designing distributed computing schemes, which includes the selection of K, the assignment of the Map tasks M, (M 1 ...,M K ), the assignment of the Reduce tasksW , (W 1 ...,W K ), and the design of the data shuffling scheme, in order to minimize the communication load while satisfying some given constraints [23–25,96] on computation. 9.2 Main Results We first state the following key lemma, which is proved in Appendix E.1, that applies to the most general setting where each intermediate value can be stored and needed by any subsets of nodes. Lemma 9.1 (Converse Bound for Communication Load). Consider a distributed computing task, for any integers s,d, let a s,d denotes the number of intermediate values that are available at s nodes, and required by (but not available at) d nodes. The following lower bound on the communication load holds: L≥ 1 QN K X s=1 K−s X d=1 a s,d d s +d− 1 (9.2) Remark 9.1. There are several bounding techniques proposed for coded distributed computing and coded caching with uncoded prefetching [11,23–25,130,131] . All of them can be derived as special cases of the above simple lemma. Remark 9.2. Although we assume that each server sends messages independently during the shuffling phase, the above lemma directly generalizes to computing models where the data shuffling process can be carried out in multiple rounds and dependency between messages are allowed. We can prove that even multiple round communication is allowed, the exactly same lower bound stated in Lemma 9.1 still holds. Consequently, requiring the servers communicating independently does not induce costs in the total communication load. Now we illustrate how this key lemma is applied to derive matching lower bounds on communication load. We consider an example setup formulated in [23], where each reduce function is assigned to 110 one server and the reduce jobs are evenly distributed on all the servers. Under these assumptions, we characterized the optimal tradeoff between peak computation load and communication load, which provides the converse proof for the following theorem. 2 Theorem 9.1 (Theorem 1, [23]). In a coded MapReduce framework where each worker is assigned at mostp fraction of files in the Map phase and a disjoint subset of Q K reduce functions, the optimum communication load, denoted L ∗ (p), is given by L ∗ (p) =L coded (p), 1 Kp · (1−p), p∈{ 1 K , 2 K ,..., 1}, (9.3) for large N. For general 1 K ≤ p ≤ 1, L ∗ (r) is the lower convex envelop of the above points {(p, 1 Kp · (1−p)) :p∈{ 1 K , 2 K ,..., 1}}. 9.3 Converse of Theorem 9.1 We prove the lower bound on L ∗ (p) in Theorem 9.1. Recall that for k∈{1,...,K}, we denote the set of indices of the files mapped by Nodek asM k . As the first step, we consider the communication load for a given file assignmentM, (M 1 ,M 2 ...,M K ) in the Map phase. We denote the minimum communication load under the file assignmentM by L ∗ M . We denote the number of files that are mapped at j nodes under a file assignmentM, as a j M , for all j∈{1,...,K}: a j M = X J⊆{1,...,K}:|J|=j |(∩ k∈J M k )\(∪ i/ ∈J M i )|. (9.4) For each file that is mapped by s =j nodes, there are K−j K fraction of intermediate values each needed by d = 1 server. We apply Lemma 9.1 and obtain a lower bound on L ∗ M stated as follows: L ∗ M ≥ K X j=1 a j M N · K−j Kj . (9.5) It is clear that the minimum communication load L ∗ (p) is lower bounded by the minimum value of L ∗ M over all possible file assignments with a peak computation load no greater than p: L ∗ (p)≥ inf M:max{|M 1 |,···,|M K |}≤pN L ∗ M ≥ inf M:|M 1 |+···+|M K |≤pKN L ∗ M . (9.6) 2 A specialized version of Lemma 9.1 was presented [23] to prove the needed lower bound. Here we directly proceed from the generalized version. 111 Then we have L ∗ (p)≥ inf M:|M 1 |+···+|M K |≤pKN K X j=1 a j M N · K−j Kj . (9.7) For every valid file assignmentM such that|M 1 | +··· +|M K |≤pKN,{a j M } K j=1 satisfy a j M ≥ 0, j∈{1,...,K}, (9.8) K X j=1 a j M =N, (9.9) K X j=1 ja j M ≤pKN. (9.10) Then since the function K−j Kj in (9.7) is convex and non-increasing in j, and by (9.9) K P j=1 a j M N = 1, (9.7) becomes L ∗ (p)≥ inf M:|M 1 |+···+|M K |≤pKN K− K P j=1 j a j M N K K P j=1 j a j M N (a) = 1−p Kp , (9.11) where (a) is due to the requirement imposed by the computation load in (9.10). The lower bound on L ∗ (p) in (9.11) holds for general 1 K ≤p≤ 1. We can further improve the lower bound for non-integer valued r,pK as follows. For a particular r / ∈N, we first find the line c +dj as a function of 1≤j≤K connecting the two points ( brc K , K−brc Kbrc ) and ( dre K , K−dre Kdre ). More specifically, we find c,d∈R such that c +dj| j= brc K = K−brc Kbrc , (9.12) c +dj| j= dre K = K−dre Kdre . (9.13) Then by the convexity of the function K−j Kj in j, we have for integer-valued j = 1,...,K, K−j Kj ≥c +dj, j = 1,...,K. (9.14) 112 Then (9.7) reduces to L ∗ (r)≥ inf M:|M 1 |+···+|M K |≤pKN K X j=1 a j M N · (c +dj) (9.15) = inf M:|M 1 |+···+|M K |≤pKN K X j=1 a j M N ·c + K X j=1 ja j M N ·d (9.16) (b) = p +qr, (9.17) where (b) is due to the constraints on{a j M } K j=1 in (9.9), (9.10), and the fact that K−j Kj is non- increasing. Therefore, L ∗ (p) is lower bounded by the lower convex envelop of the points{(p, 1−p Kp ) : p∈ { 1 K , 2 K ,..., 1}}. This completes the proof of the converse part of Theorem 1. 9.4 Conclusion and Future Directions The converse bounding technique presented in this chapter can be extended and applied in several interesting problems to improve the state-of-the-art results. For instance, we can consider a heterogeneous setting, where the processing speeds of the computing nodes varies significantly, and to characterize the minimum overall latency achievable by any computing designs. This technique can also be applied to communication problems where no computation is involved. Similar proof ideas will be utilize for coded caching in the following sections. 113 Chapter 10 The Exact Rate-Memory Tradeoff for Caching with Uncoded Prefetching Caching is a commonly used approach to reduce traffic rate in a network system during peak-traffic times, by duplicating part of the content in the memories distributed across the network. In its basic form, a caching system operates in two phases: (1) a placement phase, where each cache is populated up to its size, and (2) a delivery phase, where the users reveal their requests for content and the server has to deliver the requested content. During the delivery phase, the server exploits the content of the caches to reduce network traffic. Conventionally, caching systems have been based on uncoded unicast delivery where the objective is mainly to maximize the hit rate, i.e. the chance that the requested content can be delivered locally [132–139]. While in systems with single cache memory this approach can achieve optimal performance, it has been recently shown in [140] that for multi-cache systems, the optimality no longer holds. In [140], an information theoretic framework for multi-cache systems was introduced, and it was shown that coding can offer a significant gain that scales with the size of the network. Several coded caching schemes have been proposed since then [4,5,131,141–143]. The caching problem has also been extended in various directions, including decentralized caching [144], online caching [145], caching with nonuniform demands [146–149], hierarchical caching [150–152], device-to-device caching [153], cache-aided interference channels [154–158], caching on file selection networks [159–161], caching on broadcast channels [162–165], and caching for channels with delayed feedback with channel state information [166,167]. The same idea is also useful in the context of distributed computing, in order to take advantage of extra computation to reduce the communication load [22,24,25,96,168]. This chapter is based on [11] 114 Characterizing the exact rate-memory tradeoff in the above caching scenarios is an active line of research. Besides developing better achievability schemes, there have been efforts in tightening the outer bound of the rate-memory tradeoff [6–8,161,169,170]. Nevertheless, in almost all scenarios, there is still a gap between the state-of-the-art communication load and the converse, leaving the exact rate-memory tradeoff an open problem. In this paper, we focus on an important class of caching schemes, where the prefetching scheme is required to be uncoded. In fact, almost all caching schemes proposed for the above mentioned problems use uncoded prefetching. As a major advantage, uncoded prefetching allows us to handle asynchronous demands without increasing the communication rates, by dividing files into smaller subfiles [144]. Within this class of caching schemes, we characterize the exact rate-memory tradeoff for both the average rate for uniform file popularity and the peak rate, in both centralized and decentralized settings, for all possible parameter values. In particular, we first propose a novel caching strategy for the centralized setting (i.e., where the users can coordinate in designing the caching mechanism, as considered in [140]), which strictly improves the state of the art, reducing both the average rate and the peak rate. We exploit commonality among user demands by showing that the scheme in [140] may introduce redundancy in the delivery phase, and proposing a new scheme that effectively removes all such redundancies in a systematic way. In addition, we demonstrate the exact optimality of the proposed scheme through a matching converse. The main idea is to divide the set of all demands into smaller subsets (referred to as types), and derive tight lower bounds for the minimum peak rate and the minimum average rate on each type separately. We show that, when the prefetching is uncoded, the rate-memory tradeoff can be completely characterized using this technique, and the placement phase in the proposed caching scheme universally achieves those minimum rates on all types. Moreover, we extend the techniques we developed for the centralized caching problem to charac- terize the exact rate-memory tradeoff in the decentralized setting (i.e. where the users cache the contents independently without any coordination, as considered in [144]). Based on the proposed centralized caching scheme, we develop a new decentralized caching scheme that strictly improves the state of the art [143,144]. In addition, we formally define the framework of decentralized caching, and prove matching converses given the framework, showing that the proposed scheme is optimal. To summarize, the main results of this chapter are as follows: • Characterizing the rate-memory tradeoff for average rate, by developing a novel caching design and proving a matching information theoretic converse. 115 • Characterizing the rate-memory tradeoff for peak rate, by extending the achievability and converse proofs to account for the worst case demands. • Characterizing the rate-memory tradeoff for both average rate and peak rate in a decentralized setting, where the users cache the contents independently without coordination. Furthermore, we will show in Chapter 11 that the presented achievablity scheme also leads to the yet known tightest characterization (within factor of 2) in the general problem with coded prefetching, for both average rate and peak rate, in both centralized and decentralized settings. The problem of caching with uncoded prefetching was initiated in [130,131], which showed that the scheme in [140] is optimal when considering peak rate and centralized caching, if there are more files than users. Although not stated in [130,131], the converse bound in this chapter for the special case of peak rate and centralized setting could have also been derived using their approach. In this chapter however, we present the novel idea of demand types, which allows us to go beyond and characterize the rate-memory tradeoff for both peak rate and average rate for all possible parameter values, in both centralized and decentralized settings. Our result covers the peak rate centralized setting, as well as strictly improves the bounds in all other cases. More importantly, we introduce a new achievability scheme, which strictly improves the scheme in [140]. The rest of this chapter is organized as follows. Section 10.1 formally establishes a centralized caching framework, and defines the main problem studied in this chapter. Section 10.2 summarizes the main result of this chapter for the centralized setting. Section 10.3 describes and demonstrates the optimal centralized caching scheme that achieves the minimum expected rate and the minimum peak rate. Section 10.4 proves matching converses that show the optimality of the proposed centralized caching scheme. Section 10.5 extends the techniques we developed for the centralized caching problem to characterize the exact rate-memory tradeoff in the decentralized setting. 10.1 System Model and Problem Definition In this section, we formally introduce the system model for the centralized caching problem. Then, we define the rate-memory tradeoff based on the introduced framework, and state the main problem studied in this chapter. 116 10.1.1 System Model We consider a system with one server connected to K users through a shared, error-free link (see Fig. 10.1). The server has access to a database of N files W 1 ,...,W N , each of size F bits. 1 We denote the jth bit in file i by B i,j , and we assume that all bits in the database are i.i.d. Bernoulli random variables with p = 0.5. Each user has an isolated cache memory of size MF bits, where M∈ [0,N]. For convenience, we define parameter t = KM N . N files shared link K users caches size M server Figure 10.1: Caching system considered in this chapter. The figure illustrates the case where K =N = 3, M = 1. The system operates in two phases, a placement phase and a delivery phase. In the placement phase, users are given access to the entire database, and each user can fill their cache using the database. However, instead of allowing coding in prefetching [140], we focus on an important class of prefetching schemes, referred to as uncoded prefetching schemes: Definition 10.1. An uncoded prefetching scheme is where each user k selects no more than MF bits from the database and stores them in its own cache, without coding. LetM k denote the set of indices of the bits chosen by user k, then we denote the prefetching as M = (M 1 ,...,M K ). In the delivery phase, only the server has access to the database. Each user k requests one of the files in the database. To characterize user requests, we define demand d = (d 1 ,...,d K ), where d k is the index of the file requested by user k. We denote the number of distinct requested files ind by N e (d), and denote the set of all possible demands byD, i.e.,D ={1,...,N} K . The server is informed of the demand and proceeds by generating a signal X of size RF bits as a function of W 1 ,...,W N , and transmits the signal over the shared link. R is a fixed real number given the demandd. The values RF and R are referred to as the load and the rate of the shared 1 Although we only focus on binary files, the same techniques developed in this chapter can also be used for cases of q-ary files and files using a mixture of different alphabets, to prove that same rate-memory trade off holds. 117 link, respectively. Using the values of bits inM k and the signal X received over the shared link, each user k aims to reconstruct their requested file W d k . 10.1.2 Problem Definition Based on the above framework, we define the rate-memory tradeoff for the average rate using the following terminology. Given a prefetchingM = (M 1 ,...,M K ), we say a communication rate R is -achievable for demandd if and only if there exists a message X of length RF such that every active user k is able to recover its desired file W d k with a probability of error of at most . This is rigorously defined as follows: Definition 10.2. R is -achievable given a prefetchingM and a demandd if and only if we can find an encoding function ψ :{0, 1} NF →{0, 1} RF that maps the N files to the message: X =ψ(W 1 ,...,W N ), and K decoding functions μ k :{0, 1} RF ×{0, 1} |M k | →{0, 1} F that each map the signal X and the cached content of user k to an estimate of the requested file W d k , denoted by ˆ W d,k : ˆ W d,k =μ k (X,{B i,j | (i,j)∈M k }), such that P( ˆ W d,k 6=W d k )≤. We denote R ∗ (d,M) as the minimum -achievable rate givend andM. Assuming that all users are making requests independently, and that all files are equally likely to be requested by each user, the probability distribution of the demandd is uniform onD. We define the average rate R ∗ (M) as the expected minimum achievable rate given a prefetchingM under uniformly random demand, i.e., R ∗ (M) =E d [R ∗ (d,M)]. The rate-memory tradeoff for the average rate is essentially finding the minimum average rate R ∗ , for any given memory constraint M, that can be achieved by prefetchings satisfying this constraint with vanishing error probability for sufficiently large file size. Rigorously, we want to find R ∗ = sup >0 lim sup F→+∞ min M R ∗ (M). 118 as a function of N, K, and M. Similarly, the rate-memory tradeoff for peak rate is essentially finding the minimum peak rate, denoted by R ∗ peak , which is formally defined in Appendix F.2. 10.2 Main Results We state the main result of this chapter in the following theorem. Theorem 10.1. For a caching problem with K users, a database of N files, local cache size of M files at each user, and parameter t = KM N , we have R ∗ =E d K t+1 − K−Ne(d) t+1 K t , (10.1) fort∈{0, 1,...,K}, whered is uniformly random onD ={1,...,N} K andN e (d) denotes the number of distinct requests ind. Furthermore, for t / ∈{0, 1,...,K}, R ∗ equals the lower convex envelope of its values at t∈{0, 1,...,K}. 2 Remark 10.1. To prove Theorem 10.1, we propose a new caching scheme that strictly improves the state of the art [140], which was relied on by all prior works considering the minimum average rate for the caching problem [146–148,161]. In particular, the rate achieved by the previous best known caching scheme equals the lower convex envelope of min{ K−t t+1 ,E d [N e (d)(1− t K )]} att∈{0, 1,...,K}, which is strictly larger than R ∗ when N > 1 and t<K− 1. For example, when K = 30, N = 30, and t = 1, the state-of-the-art scheme requires a communication rate of 14.12, while the proposed scheme achieves the rate 12.67, both rounded to two decimal places. The improvement of our proposed scheme over the state of the art can be interpreted intuitively as follows. The caching scheme proposed in [140] essentially decomposes the problem into 2 cases: in one case, the redundancy of user demands is ignored, and the information is delivered by satisfying different demands using single coded multicast transmission; in the other case, random coding is used to deliver the same request to multiple receivers. Our result demonstrates that the decomposition of the caching problem into these 2 cases is suboptimal, and our proposed caching scheme precisely accounts for the effect of redundant user demands. Remark 10.2. The technique for finding the minimum average rate in the centralized setting can be straightforwardly extended to find the minimum peak rate, which was solved for N≥K [130]. Here we show that we not only recover their result, but also fully characterize the rate for all possible values of N and K, resulting in the following corollary, which will be proved in Appendix F.2. 2 We define n k = 0 when k>n. 119 Corollary 10.1. For a caching problem with K users, a database of N files, a local cache size of M files at each user, and parameter t = KM N , we have R ∗ peak = K t+1 − K−min{K,N} t+1 K t (10.2) for t∈{0, 1,...,K}. Furthermore, for t / ∈{0, 1,...,K}, R ∗ peak equals the lower convex envelope of its values at t∈{0, 1,...,K}. Remark 10.3. As we will discuss in Section 10.5, we can also extend the techniques that we developed for proving Theorem 10.1 to the decentralized setting. The exact rate-memory tradeoff for both the average rate and the peak rate can be fully characterized using these techniques. Besides, the newly proposed decentralized caching scheme for achieving the minimum rates strictly improves the state of the art [143,144]. Remark 10.4. Prior to this result, there have been several other works on this coded caching problem. Both centralized and decentralized settings have been considered, and many caching schemes using uncoded prefetching were proposed. Several caching schemes have been proposed focusing on minimizing the average communication rates [146–149]. However in the case of uniform file popularity, the achievable rates provided in these works reduce to the results of [140] or [144], while our proposal strictly improves the state of the arts in both [140] and [144] by developing a novel delivery strategy that exploits the commonality of the user demands. There have also been several proposed schemes that aim to minimize the peak rates [131,143]. The main novelty of our work compared to their results is that we not only propose an optimal design that strictly improves upon all these works through a leader based strategy, but also provide an intuitive proof for its decodability. The decodability proof is based on the observation that the caching schemes proposed in [140] and [144] may introduce redundancy in the delivery phase, while our proposed scheme provides a systematic way to optimally remove all the redundancy, which allows delivering the same amount of information with strictly improved communication rates. Remark 10.5. We numerically compare our results with the state-of-the-art schemes and the converses for the centralized setting. As shown in Fig. 10.2, both the achievability scheme and the converse provided in our paper strictly improve the prior arts, for both average rate and peak rate. Similar results can be shown for the decentralized setting, and a numerical comparison is provided in Section 10.5. Remark 10.6. There have also been several prior works considering caching designs with coded prefetching [4, 5, 140–142]. They focused on the centralized setting and showed that the peak communication rate achieved by uncoded prefetching schemes can be improved in some low capacity regimes. Even taking coded prefetching schemes into account, our work strictly improves the prior art in most cases (see Section 10.6 for numerical results). More importantly, the caching schemes 120 0 0.5 1 1.5 2 2.5 3 3.5 4 Local Cache Size M 0 5 10 15 20 25 Average Communication Rates Maddah-Ali-Niesen Scheme [140] Conventional Uncoded Scheme [140] Memory Sharing Scheme Wang et al. Converse [169] Optimal Tradeoff (Proposed) (a) Average rates for N =K = 30. For this scenario, the best communication rate stated in prior works is achieved by the memory-sharing between the conventional uncoded scheme [140] and the Maddah-Ali-Niesen scheme [140]. The tightest prior converse bound in this scenario is provided by [169]. 0 0.5 1 1.5 2 2.5 3 3.5 4 Local Cache Size M 0 5 10 15 20 25 30 Peak Communication Rates Maddah-Ali-Niesen Scheme [140] Conventional Uncoded Scheme [140] Amiri et al. Scheme [143] Memory Sharing Ghasemi-Ramamoorthy Converse [6] Sengupta et al. Converse [7] N. et al. Converse [8] Optimal Tradeoff (Proposed) (b) Peak rates for N = 20, K = 40. For this scenario, the best communication rate stated in prior works is achieved by the memory-sharing among the conventional uncoded scheme [140], the Maddah-Ali-Niesen scheme [140], and the Amiri et al. scheme [143]. The tightest prior converse bound in this scenario was provided by [6–8]. Figure 10.2: Numerical comparison between the optimal tradeoff and the state of the arts for the centralized setting. Our results strictly improve the prior arts in both achievability and converse, for both average rate and peak rate. 121 presented in this chapter is within a factor of 2 optimal in the general coded prefetching setting, for both average and peak rates, centralized and decentralized settings [9]. In the following sections, we prove Theorem 10.1 by first describing a caching scheme that achieves the minimum average rate (see Section 10.3), and then deriving tight lower bounds of the expected rates for any uncoded prefetching scheme (see Section 10.4). 10.3 The Optimal Caching Scheme In this section, we provide a caching scheme (i.e. a prefetching scheme and a delivery scheme) to achieveR ∗ stated in Theorem 10.1. Before introducing the proposed caching scheme, we demonstrate the main ideas of the proposed scheme through a motivating example. 10.3.1 Motivating Example Consider a caching system with 3 files (denoted by A, B, and C), 6 users, and a caching size of 1 file for each user. To develop a caching scheme, we need to design an uncoded prefetching scheme, independent of the demands, and develop delivery strategies for each of the possible 3 6 demands. For the prefetching strategy, we break file A into 15 subfiles of equal size, and denote their values by A {1,2} , A {1,3} , A {1,4} , A {1,5} , A {1,6} , A {2,3} , A {2,4} , A {2,5} , A {2,6} , A {3,4} , A {3,5} , A {3,6} , A {4,5} , A {4,6} , andA {5,6} . Each userk caches the subfiles whose index includes k, e.g., user 1 caches A {1,2} , A {1,3} , A {1,4} , A {1,5} , and A {1,6} . The same goes for files B and C. This prefetching scheme was originally proposed in [140]. Given the above prefetching scheme, we now need to develop an optimal delivery strategy for each of the possible demands. In this subsection, we demonstrate the key idea of our proposed delivery scheme through a representative demand scenario, namely, each file is requested by 2 users as shown in Figure 10.3. We first consider a subset of 3 users{1, 2, 3}. User 1 requires subfileA {2,3} , which is only available at users 2 and 3. User 2 requires subfile A {1,3} , which is only available at users 1 and 3. User 3 requires subfile B {1,2} , which is only available at users 1 and 2. In other words, the three users would like to exchange subfiles A {2,3} , A {1,3} , and B {1,2} , which can be enabled by transmitting the message A {2,3} ⊕A {1,3} ⊕B {1,2} over the shared link. Similarly, we can create and broadcast messages for any subsetA of 3 users that exchange 3 subfiles among those 3 users. As a short hand notation, we denote the corresponding message by 122 shared link users caches server A B C A { 1,j} A=? A=? B =? B =? C =? C =? B { 1,j} C { 1,j} A { 2,j} A { 3,j} A { 4,j} A { 5,j} A { 6,j} B { 2,j} B { 3,j} B { 4,j} B { 5,j} B { 6,j} C { 2,j} C { 3,j} C { 4,j} C { 5,j} C { 6,j} Figure 10.3: A caching system with 6 users, 3 files, local cache size of 1 file at each user, and a demand where each file is requested by 2 users. Y A . According to the delivery scheme proposed in [140], if we broadcast all 6 3 = 20 messages that could be created in this way, all users will be able to decode their requested files. However, in this paper we propose a delivery scheme where, instead of broadcasting all those 20 messages, only 19 of them are computed and broadcasted, omitting the messageY {2,4,6} . Specifically, we broadcast the following 19 values: Y {1,2,3} =B {1,2} ⊕A {1,3} ⊕A {2,3} Y {1,2,4} =B {1,2} ⊕A {1,4} ⊕A {2,4} Y {1,2,5} =C {1,2} ⊕A {1,5} ⊕A {2,5} Y {1,2,6} =C {1,2} ⊕A {1,6} ⊕A {2,6} Y {1,3,4} =B {1,3} ⊕B {1,4} ⊕A {3,4} Y {1,3,5} =C {1,3} ⊕B {1,5} ⊕A {3,5} Y {1,3,6} =C {1,3} ⊕B {1,6} ⊕A {3,6} Y {1,4,5} =C {1,4} ⊕B {1,5} ⊕A {4,5} Y {1,4,6} =C {1,4} ⊕B {1,6} ⊕A {4,6} Y {1,5,6} =C {1,5} ⊕C {1,6} ⊕A {5,6} Y {2,3,4} =B {2,3} ⊕B {2,4} ⊕A {3,4} Y {2,3,5} =C {2,3} ⊕B {2,5} ⊕A {3,5} Y {2,3,6} =C {2,3} ⊕B {2,6} ⊕A {3,6} Y {2,4,5} =C {2,4} ⊕B {2,5} ⊕A {4,5} Y {2,5,6} =C {2,5} ⊕C {2,6} ⊕A {5,6} Y {3,4,5} =C {3,4} ⊕B {3,5} ⊕B {4,5} Y {3,4,6} =C {3,4} ⊕B {3,6} ⊕B {4,6} 123 Y {3,5,6} =C {3,5} ⊕C {3,6} ⊕B {5,6} Y {4,5,6} =C {4,5} ⊕C {4,6} ⊕B {5,6} Surprisingly, even after taking out the extra message, all users are still able to decode the requested files. The reason is as follows: User 1 is able to decode file A, because every subfile A {i,j} that is not cached by user 1 can be computed with the help of Y {1,i,j} , which is directly broadcasted. The above is the same decoding procedure used in [140]. User 2 can easily decode all subfiles in A except A {4,6} in a similar way, although decoding A {4,6} is more challenging since the value Y {2,4,6} , which is needed in the above decoding procedure for decodingA {4,6} , is not directly broadcasted. However, user 2 can still decode A {4,6} by adding Y {1,4,6} , Y {1,4,5} , Y {1,3,6} , and Y {1,3,5} , which gives the binary sum of A {4,6} , A {4,5} , A {3,6} , and A {3,5} . Because A {4,5} , A {3,6} , and A {3,5} are easily decodable, A {4,6} can be obtained consequently. Due to symmetry, all other users can decode their requested files in the same manner. This completes the decoding tasks for the given demand. 10.3.2 General Schemes Now we present a general caching scheme that achieves the rate R ∗ stated in Theorem 10.1. We focus on presenting prefetching schemes and delivery schemes whent∈{0, 1,...,K}, since for general t, the minimum rate R ∗ can be achieved by memory sharing. Remark 10.7. Note that the rates stated in equation (10.1) for t∈{0, 1,...,K} form a convex sequence, which are consequently on their lower convex envelope. Thus those rates cannot be further improved using memory sharing. To prove the achievability of R ∗ , we need to provide an optimal prefetching schemeM, an optimal delivery scheme for every possible user demandd of which the average rate achieves R ∗ , and a valid decoding algorithm for the users. The main idea of our proposed achievability scheme is to first design a prefetching scheme that enables multicast coding opportunities, and then in the delivery phase, we optimally deliver the message by effectively solving an index coding problem. We consider the following optimal prefetching: We partition each file i into K t non-overlapping subfiles with approximately equal size. We assign the K t subfiles to K t different subsets of{1,..,K} of size t, and denote the value of the subfile assigned to subsetA by W i,A . Given this partition, each user k caches all bits in all subfiles W i,A such that k∈A. Because each user caches K−1 t−1 N 124 subfiles, and each subfile has F/ K t bits, the caching load of each user equals NtF/K =MF bits, which satisfies the memory constraint. This prefetching was originally proposed in [140]. In the rest of this chapter, we refer to this prefetching as symmetric batch prefetching. Given this prefetching (denoted byM batch ), our goal is to show that for any demandd, we can find a delivery scheme that achieves the following optimal rate with zero error probability: 3 R ∗ =0 (d,M batch ) = K t+1 − K−Ne(d) t+1 K t . (10.3) Hence, by taking the expectation over demand d, the rate R ∗ stated in Theorem 10.1 can be achieved. Remark 10.8. Note that, in the special case where all users are requesting different files (i.e.,N e (d) = K), the above rate equals K−t t+1 , which can already be achieved by the delivery scheme proposed in [140]. Our proposed scheme aims to achieve this optimal rate in more general circumstances, when some users may share common demands. Remark 10.9. Finding the minimum communication load given a prefetchingM can be viewed as a special case of the index coding problem. Theorem 10.1 indicates the optimality of the delivery scheme given the symmetric batch prefetching, which implies that (10.3) gives the solution to a special class of non-symmetric index coding problem. The optimal delivery scheme is designed as follows: For each demandd, recall thatN e (d) denotes the number of distinct files requested by all users. The server arbitrarily selects a subset of N e (d) users, denoted byU ={u 1 ,...,u Ne(d) }, that request N e (d) different files. We refer to these users as leaders. Given an arbitrary subsetA of t + 1 users, each user k∈A needs the subfile W d k ,A\{k} , which is known by all other users inA. In other words, all users in setA would like to exchange subfiles W d k ,A\{k} for all k∈A. This exchange can be processed if the binary sum of all those files, i.e. ⊕ x∈A W dx,A\{x} , is available from the broadcasted message. To simplify the description of the delivery scheme, for each subsetA of users, we define the following short hand notation Y A =⊕ x∈A W dx,A\{x} . (10.4) To achieve the rate stated in (10.3), the server only greedily broadcasts the binary sums that directly help at least 1 leader. Rigorously, the server computes and broadcasts all Y A for all subsets 3 Rigorously, we prove equation (10.3) for F| K t . In other cases, the resulting extra communication overhead is negligible for large F . 125 A of sizet + 1 that satisfyA∩U6=∅. The length of the message equals K t+1 − K−Ne(d) t+1 times the size of a subfile, which matches the stated rate. We now prove that each user who requests a file is able to decode the requested file upon receiving the messages. For any leader k∈U and any subfile W d k ,A that is requested but not cached by user k, the message Y {k}∪A is directly available from the broadcast. Thus, k is able to obtain all requested subfiles by decoding each subfile W d k ,A from messageY {k}∪A using the following equation: W d k ,A =Y {k}∪A ⊕ ⊕ x∈A W dx,{k}∪A\{x} , (10.5) which directly follows from equation (10.4). The decoding procedure for a non-leader user k is less straightforward, because not all messages Y {k}∪A for corresponding required subfiles W d k ,A are directly broadcasted. However, user k can generate these messages simply based on the received messages, and can thus decode all required subfiles. We prove the above fact as follows. First we prove the following simple lemma: Lemma 10.1. Given a demand d, and a set of leadersU. For any subsetB⊆{1,...,K} that includesU, letV F be the family of all subsetsV ofB such that each requested file ind is requested by exactly one user inV. The following equation holds: ⊕ V∈V F Y B\V = 0 (10.6) if each Y B\V is defined in (10.4). Proof. To prove Lemma 10.1, we essentially need to show that, after expanding the LHS of equation (10.6) into a binary sum of subfiles using the definition in (10.4), each subfile is counted an even number of times. This will ensure that the net sum is equal to 0. To rigorously prove this fact, we start by defining the following. For each u∈U we defineB u as B u ={x∈B| d x =d u }. (10.7) Then all setsB u disjointly cover the setB, and the following equations hold: 126 ⊕ V∈V F Y B\V = ⊕ V∈V F ⊕ x∈B\V W dx,B\(V∪{x}) (10.8) =⊕ u∈U ⊕ V∈V F ⊕ x∈(B\V)∩Bu W du,B\(V∪{x}) (10.9) =⊕ u∈U ⊕ V∈V F ⊕ x∈Bu\V W du,B\(V∪{x}) . (10.10) For each u∈U, we letV u be the family of all subsetsV 0 ofB\B u such that each requested file in d, except d u , is requested by exactly one user inV 0 . ThenV F can be represented as follows: V F ={{y}∪V 0 | y∈B u ,V 0 ∈V u }. (10.11) Consequently, the following equation holds for each u∈U: ⊕ V∈V F ⊕ x∈Bu\V W du,B\(V∪{x}) = ⊕ V 0 ∈Vu ⊕ y∈Bu ⊕ x∈Bu\{y} W du,B\(V 0 ∪{x,y}) (10.12) = ⊕ V 0 ∈Vu ⊕ (x,y)∈B 2 u x6=y W du,B\(V 0 ∪{x,y}) (10.13) Note that W du,B\(V 0 ∪{x,y}) and W du,B\(V 0 ∪{y,x}) are the same subfile. Hence, every single subfile in the above equation is counted exactly twice, which sum up to 0. Consequently, the LHS of equation (10.6) also equals 0. Consider any subsetA of t + 1 non-leader users. From Lemma 10.1, the message Y A can be directly computed from the broadcasted messages using the following equation: Y A = ⊕ V∈V F \{U} Y B\V , (10.14) whereB =A∪U, given the fact that all messages on the RHS of the above equation are broadcasted, because eachB\V has a size oft + 1 and contains at least one leader. Hence, each user k can obtain the value Y A for any subsetA of t + 1 users, and can subsequently decode its requested file as previously discussed. Remark 10.10. An interesting following direction is to find more efficient decoding algorithms for the proposed optimal caching scheme. The decoding algorithm proposed in this chapter imposes extra computation at the non-leader users, since they have to solve for the missing messages to 127 recover all needed subfiles. Although the required complexities for non-leader users can be made the same as that of the standard decoding approach for leader users (by simply reusing intermediate computing results, see Appendix F.5). There are some ideas that one may explore to further improve this decoding strategy, e.g. designing a smarter approach for non-leader users instead of naively recovering all required messages before decoding the subfiles (see the decoding approach provided in the motivating example in Section 10.3.1). 10.4 Converse In this section, we derive a tight lower bound on the minimum expected rate R ∗ , which shows the optimality of the caching scheme presented in this chapter. To derive the corresponding lower bound on the average rate over all demands, we divide the setD into smaller subsets, and lower bound the average rates within each subset individually. We refer to these smaller subsets as types, which are defined as follows. 4 D 4,0,0,0 D 1,1,1,1 D 2,2,0,0 D 3,1,0,0 D 2,1,1,0 Figure 10.4: DividingD into 5 types, for a caching problem with 4 files and 4 users. Given an arbitrary demandd, we define its statistics, denoted bys(d), as a sorted array of length N, such that s i (d) equals the number of users that request the ith most requested file. We denote the set of all possible statistics byS. Grouping by the same statistics, the set of all demandsD can be broken into many small subsets. For any statisticss∈S, we define typeD s as the set of queries with statisticss. For example, consider a caching problem with 4 files (denoted by A, B, C, and D) and 4 users. The statistics of the demandd = (A,A,B,C) equalss(d) = (2, 1, 1, 0). More generally, the set of all possible statistics for this problem isS ={(4, 0, 0, 0), (3, 1, 0, 0), (2, 2, 0, 0), (2, 1, 1, 0), (1, 1, 1, 1)}, andD can be divided into 5 types accordingly, as shown in Fig. 10.4. Note that for each demandd, the value N e (d) only depends on its statisticss(d), and thus the value is identical across all demands inD s . For convenience, we denote that value by N e (s). 4 The notion of type was also recently introduced in [171] in order to simplify the LP for finding better converse bounds for the coded caching problem. 128 Given a prefetchingM, we denote the average rate within each typeD s byR ∗ (s,M). Rigorously, R ∗ (s,M) = 1 |D s | X d∈Ds R ∗ (d,M). (10.15) Recall that all demands are equally likely, so we have R ∗ = sup >0 lim sup F→+∞ min M E s [R ∗ (s,M)] (10.16) ≥ sup >0 lim sup F→+∞ E s [min M R ∗ (s,M)]. (10.17) Hence, in order to lower bound R ∗ , it is sufficient to bound the minimum value of R ∗ (s,M) for each typeD s individually. We show that, when the prefetching is uncoded, the minimum average rate within a type can be tightly bounded (when F is large and is small), thus the rate-memory tradeoff can be completely characterized using this technique. The lower bounds of the minimum average rates within each type are presented in the following lemma: Lemma 10.2. Consider a caching problem with N files, K users, and a local cache size of M files for each user. For any typeD s , the minimum value of R ∗ (s,M) is lower bounded by min M R ∗ (s,M)≥Conv K t+1 − K−Ne(s) t+1 K t − 1 F +N 2 e (s) , (10.18) where Conv(f(t)) denotes the lower convex envelope of the following points: {(t,f(t)) | t ∈ {0, 1,...,K}}. Remark 10.11 (Universal Optimality of Symmetric Batch Prefetching). The above lemma character- izes the minimum average rate given a typeD s , if the prefetchingM can be designed based on s. However, for (10.17) to be tight, the average rate for each different type has to be minimized on the same prefetching. Surprisingly, such an optimal prefetching exists, an example being the symmetric batch prefetching according to Section 10.3. This indicates that the symmetric batch prefetching is universally optimal for all types in terms of the average rates. We postpone the proof of Lemma 10.2 to Appendix F.1 and first prove the converse using the lemma. From (10.17) and Lemma 10.2, R ∗ can be lower bounded as follows: R ∗ ≥ sup >0 lim sup F→+∞ E s min M R ∗ (s,M) (10.19) 129 ≥E s Conv K t+1 − K−Ne(s) t+1 K t . (10.20) Because the sequence c n = K n+1 − K−Ne(s) n+1 K n (10.21) is convex, we can switch the order of the expectation and the Conv in (10.20). Therefore, R ∗ is lower bounded by the rate defined in Theorem 10.1. 5 10.5 Extension to the Decentralized Setting In the sections above, we introduced a new centralized caching scheme and a new bounding technique that completely characterize the minimum average communication rate and the minimum peak rate, when the prefetching is required to be uncoded. Interestingly, these techniques can also be extended to fully characterize the rate-memory tradeoff for decentralized caching. In this section, we formally establish a system model for decentralized caching systems, and state the exact rate-memory tradeoff as main results for both the average rate and the peak rate. 10.5.1 System Model and Problem Formulation In many practical systems, out of the large number of users that may potentially request files from the server through the shared error-free link, only a random unknown subset are connected to the link and making requests at any given time instance. To handle this situation, the concept of decentralized prefetching scheme was introduced in [144], where each user has to fill their caches randomly and independently, based on the same probability distribution. The goal in the decentralized setting is to find a decentralized prefetching scheme, without the knowledge of the number and the identities of the users making requests, to minimize the required communication rates given an arbitrarily large caching system. Based on the above framework, we formally define decentralized caching as follows: Definition 10.3. In a decentralized caching scheme, instead of following a deterministic caching scheme, each userk caches a subsetM k of size no more thanMF bits randomly and independently, based on the same probability distribution, denoted by P M . Rigorously, when K users are making 5 As noted in Remark 10.7, the rate R ∗ stated in equation (10.1) for t∈{0, 1,...,K} is convex, so it is sufficient to prove R ∗ is lower bounded by the convex envelope of its values at t∈{0, 1,...,K}. 130 requests, the probability distribution of the prefetchingM is given by P (M = (M 1 ,...,M K )) = K Y i=1 P M (M i ). We define that a decentralized caching scheme, denoted by P M;F , is a distribution parameterized by the file size F , that specifies the prefetching distribution P M for all possible values of F . Similar to the centralized setting, when K users are making requests, we say that a rate R is -achievable given a prefetching distribution P M and a demand d if and only if there exists a message X of length RF such that every active user k is able to recover its desired file W d k with a probability of error of at most . This is rigorously defined as follows: Definition 10.4. When K users are making requests, R is -achievable given a prefetching distri- bution P M and a demandd if and only if for every possible realization of the prefetchingM, we can find a real number M , such that R is M -achievable givenM andd, and E[ M ]≤. We denote R ∗ ,K (d,P M ) as the minimum -achievable rate given K,d and P M , and we define the rate-memory tradeoff for the average rate based on this notation. For each K∈N, and each prefetching schemeP M;F , we define the minimum average rateR ∗ K (P M;F ) as the minimum expected rate under uniformly random demand that can be achieved with vanishing error probability for sufficiently large file size. Specifically, R ∗ K (P M;F ) = sup >0 lim sup F 0 →+∞ E d [R ∗ ,K (d,P M;F (F =F 0 ))], where the demandd is uniformly distributed on{1,...,N} K . Given the fact that a decentralized prefetching scheme is designed without the knowledge of the number of active users K, we characterize the rate-memory tradeoff using an infinite dimensional vector, denoted by{R K } K∈N , where each term R K corresponds to the needed communication rates when K users are making requests. We aim to find the region in this infinite dimensional vector space that can be achieved by any decentralized prefetching scheme, and we denote this region by R. Rigorously, we aim to find R = ∪ P M;F {{R K } K∈N |∀K∈N,R K ≥R ∗ K (P M;F )}, which is a function of N and M. Similarly, we define the rate-memory tradeoff for the peak rate as follows: For each K∈N, and each prefetching scheme P M;F , we define the minimum peak rate R ∗ K,peak (P M;F ) as the minimum communication rate that can be achieved with vanishing error probability for sufficiently large file 131 size, for the worst case demand. Specifically, R ∗ K,peak (P M;F ) = sup >0 lim sup F 0 →+∞ max d∈D [R ∗ ,K (d,P M;F (F =F 0 ))], We aim to find the region in the infinite dimensional vector space that can be achieved by any decentralized prefetching scheme in terms of the peak rate, and we denote this regionR peak . Rigorously, we aim to find R peak = ∪ P M;F {{R K } K∈N |∀K∈N,R K ≥R ∗ K,peak (P M;F )}, as a function of N and M. 10.5.2 Exact Rate-Memory Tradeoff for Decentralized Setting The following theorem completely characterizes the rate-memory tradeoff for the average rate in the decentralized setting: Theorem 10.2. For a decentralized caching problem with parameters N and M,R is completely characterized by the following equation: R = ( {R K } K∈N R K ≥E d " N−M M 1− N−M N Ne(d) !#) , (10.22) where demandd given eachK is uniformly distributed on{1,...,N} K andN e (d) denotes the number of distinct requests ind. 6 The proof of the above theorem is provided in Appendix F.3. Remark 10.12. Theorem 10.2 demonstrates thatR has a very simple shape with one dominating point: {R K =E d [ N−M M (1− ( N−M N ) Ne(d) )]} K∈N . In other words, we can find a decentralized prefetching scheme that simultaneously achieves the minimum expected rates for all possible numbers of active users. Therefore, there is no tension among the expected rates for different numbers of active users. In Appendix F.3, we will show that one example of the optimal prefetching scheme is to let each user cache MF N bits in each file uniformly independently. Remark 10.13. To prove Theorem 10.2, we propose a decentralized caching scheme that strictly improves the state of the art [143,144] (see Appendix F.3.1), for both the average rate and the peak 6 If M = 0,R ={{RK} K∈N | RK≥E d [Ne(d)]}. 132 rate. In particular for the average rate, the state-of-the-art scheme proposed in [144] achieves the rate N−M N · min{ N M (1− (1− M N ) K ),E d [N e (d)]}, which is strictly larger than the rate achieved by our proposed scheme E d [ N−M M (1− ( N−M N ) Ne(d) )] in most cases. Similarly one can show that our scheme strictly improves [143], and we omit the details for brevity. Remark 10.14. We also prove a matching information-theoretic outer bound ofR, by showing that the achievable rate of any decentralized caching scheme can be lower bounded by the achievable rate of a caching scheme with centralized prefetching that is used on a system where, there are a large number of users that may potentially request a file, but only a subset of K users are actually making the request. Interestingly, the tightness of this bound indicates that, in a system where the number of potential users is significantly larger than the number of active users, our proposed decentralized caching scheme is optimal, even compared to schemes where the users are not caching according to an i.i.d.. Using the proposed decentralized caching scheme and the same converse bounding technique, the following corollary, which completely characterizes the rate-memory tradeoff for the peak rate in the decentralized setting, directly follows: Corollary 10.2. For a decentralized caching problem with parameters N and M, the achievable regionR peak is completely characterized by the following equation: 7 R peak = ( {R K } K∈N R K ≥ N−M M 1− N−M N min{N,K} !) . (10.23) The proof of the above corollary is provided in Appendix F.4. Remark 10.15. Corollary 10.2 demonstrates thatR peak has a very simple shape with one dominating point: {R K = N−M M (1− ( N−M N ) min{N,K} )} K∈N . In other words, we can find a decentralized prefetching scheme that simultaneously achieves the minimum peak rates for all possible numbers of active users. Therefore, there is no tension among the peak rates for different numbers of active users. In Appendix F.4, we will show that one example of the optimal prefetching scheme is to let each user cache MF N bits in each file uniformly independently. Remark 10.16. Similar to the average rate case, a matching converse can be proved by deriving the minimum achievable rates of centralized caching schemes in a system where only a subset of users are actually making the request. Consequently, in a caching system where the number of potential users is significantly larger than the number of active users, our proposed decentralized scheme 7 If M = 0,R ={{RK} K∈N | RK≥ min{N,K}}. 133 is also optimal in terms of peak rate, even compared to schemes where the users are not caching according to an i.i.d.. 0 0.5 1 1.5 2 2.5 3 3.5 4 Local Cache Size M 0 5 10 15 20 25 Average Communication Rates Maddah-Ali-Niesen Scheme [144] Conventional Uncoded Scheme [140] Memory Sharing Wang et al. Converse [169] Optimal Tradeoff (Proposed) (a) Average rates for N =K = 30. For this scenario, the best communication rate stated in prior works is achieved by the memory-sharing between the conventional uncoded scheme [140] and the Maddah-Ali-Niesen scheme [144]. The tightest prior converse bound in this scenario is provided by [169]. 0 0.5 1 1.5 2 2.5 3 3.5 4 Local Cache Size M 0 5 10 15 20 25 Peak Communication Rates Amiri et al. Scheme [143] Ghasemi-Ramamoorthy Converse [6] Sengupta et al. Converse [7] N. et al. Converse [8] Optimal Tradeoff (Proposed) (b) Peak rates for N = 20, K = 40. For this scenario, the best communication rate stated in prior works is achieved by the Amiri et al. scheme [143]. The tightest prior converse bound in this scenario is provided by [6–8]. Figure 10.5: Numerical comparison between the optimal tradeoff and the state of the arts for the decentralized setting. Our results strictly improve the prior arts in both achievability and converse, for both average rate and peak rate. Remark 10.17. We numerically compare our results with the state-of-the-art schemes and the converses for the decentralized setting. As shown in Fig. 10.5, both the achievability scheme and the converse provided in our paper strictly improve the prior arts, for both average rate and peak rate. 134 10.6 Concluding Remarks 0 0.5 1 1.5 2 2.5 3 3.5 4 Local Cache Size M 0 5 10 15 20 25 Peak Communication Rates Optimal Uncoded Prefetching (Proposed) Tian-Chen Scheme [4] Amiri-Gunduz Scheme [5] Gomez-Vilardebo Scheme [10] Ghasemi-Ramamoorthy Converse [6] Sengupta et al. Converse [7] N. et al. Converse [8] Yu et al. Converse [9] Figure 10.6: Achievable peak communication rates for centralized schemes that allow coded prefetching. For N = 20,K = 40, we compare our proposed achievability scheme with prior-art coded-prefetching schemes [4,5], prior-art converse bounds [6–8], and two recent results [9,10]. The achievability scheme presented in this chapter achieves the best performance to date in most cases, and is within a factor of 2 optimal as shown in [9], even compared with schemes that allow coded prefetching. In this chapter, we characterized the rate-memory tradeoff for the coded caching problem with uncoded prefetching. To that end, we proposed the optimal caching schemes for both centralized setting and decentralized setting, and proved their exact optimality for both average rate and peak rate. The techniques we presented in this chapter can be directly applied to many other problems, immediately improving their state of the arts. For instance, the achievability scheme presented in this chapter has already been applied in various different settings, achieving improved results [9,172,173]. Beyond these works, the techniques can also be applied in directions such as online caching [145], caching with non-uniform demands [146], and hierarchical caching [151], where improvements can be immediately achieved by directly plugging in our results. One interesting follow-up direction is to consider coded caching problem with coded placement. In this scenario, it has been shown that in the centralized setting, coded prefetching schemes can achieve better peak communication rates. For example, Figure 10.6 shows that one can improve the peak communication rate by coded placement when the cache size is small. As we will show in the following Chapter, through a new converse bounding technique, the achievability scheme we presented in this chapter is optimal within a factor of 2. However, finding the exact optimal solution in this regime remains an open problem. 135 Chapter 11 Characterizing the Rate-Memory Tradeoff in Cache Networks within a Factor of 2 In this chapter, we focus on the basic bottleneck caching network considered in [140] and presented in Chapter 10. For this case, the peak rate vs. memory tradeoff (the tradeoff between maximum R over all possible user demands and M) was formulated and characterized within a factor of 12 [140]. This caching framework has been extended to many scenarios as mentioned in Chapter 10. Many of these extensions share similar ideas in terms of the achievability and the converse bounds. Therefore, if we can improve the results for the basic bottleneck caching network, the ideas can be used to improve the results in other cases as well. In the literature, various approaches have been proposed to improve the bounds on rate-memory tradeoff for the bottleneck network. When the prefetching is uncoded, the exact rate-memory tradeoff for both peak and average rate (under uniform file popularity) and for both centralized and decentralized settings have been established in Chapter 10. However, for the general case, where the cached content can be an arbitrary function of the files in the database, the exact characterization of the tradeoff remains open. In this case, the state of the art is an approximation within a factor of 4 for peak rate [6] and 4.7 for average rate under uniform file popularity [161]. In this chapter, we improve the approximation on characterizing the rate-memory tradeoff by proving new information-theoretic converse bounds, and achieving an approximation within a factor This chapter is based on [9] 136 of 2.00884, for both the peak rate and the average rate under uniform file popularity. These converse bounds hold for the general information theoretic framework, in the sense that there is no constraint on the caching or delivery process. In particular it is not limited to linear coding or uncoded prefetching. This improved characterization is approximately a two-fold improvement with respect to the state of the art in current literature [6,161]. Furthermore, for a practically important case where the number of files is large, we exactly characterize the rate-memory tradeoff for systems with no more than 5 users. In this case, we also characterize the rate-memory tradeoff within a factor of 2 for networks with an arbitrary number of users, slightly improving our factor-of-2.00884 characterization in the general case. In prior works, despite various attempts, this tradeoff has only been exactly characterized in two instances: the single-user case [140] and, more recently, the two-user case [170]. To prove these results we develop two new converse bounds for cache networks. The first converse is developed based on the idea of enhancing the cutset bound, to effectively capture the isolation of cache contents of the users that belong to the same side of the cut. This approach strictly improves the compound cutset bound, which was used in most of the prior works. Furthermore, using this converse, we are able to characterize both the peak rate and the average rate within factor of 2.00884. To prove this result, we essentially demonstrate that our new converse is within a factor of 2.00884 from the achievable scheme developed in [21] for all possible parameter values. Moreover, we develop a second converse bound, which is proved by carefully dividing the set of all user demands into certain subsets, and lower bounding the communication rate within each subset separately. Unlike the first converse, it exploits the scenarios where users may have common demands. This enables improvement upon the first converse, and allows exact characterization of the rate-memory tradeoff for systems with up to 5 users. The rest of this chapter is organized as follows. In Section 11.1, we formally define the caching framework and the rate-memory tradeoff. Then in Section 11.2 we summarize our main results. Section 11.3 proves our first main result, which characterizes the peak rate-memory tradeoff within a constant factor of 2.00884 for all possible parameter values, and characterizes this tradeoff within a factor of 2 when the number of files is large. Section 11.3 proves the converse bound that is needed to establish this characterization. For brevity, we prove the rest of the results in appendices. 11.1 System Model and Problem Formulation In this section, we introduce the system model, define the rate-memory tradeoff for both peak rate and average rate based on the introduced framework, and state the main problems studied in this 137 chapter. 11.1.1 System Model We consider the same caching system described in Section 10.1 where a server is connected to K users through a shared, error-free link. The server has access to N files each of size F bits, denoted by W 1 ,...,W N . Each user k has an isolated cache memory of size MF bits, where M∈ [0,N]. For convenience, we define a parameter r = KM N . In the placement phase, each user can fill the contents of their caches using the database without knowledge of their future demands. 1 However, unlike Chapter 10, here the content stored by each worker can be computed with any function. We denote the cached content of each user k by Z k . Then in a following delivery phase, given a demandd = (d 1 ,...,d K ), each userk requests the content of file d k . The server is informed of the demand and sends a message of size RF bits, denoted byX d . Using the contents Z k of its cache and the message X d received over the shared link, each user k aims to reconstruct its requested file W d k . 11.1.2 Problem Definition Based on the above framework, we define the rate-memory tradeoff using the following terminology. We characterize a prefetching scheme by its K caching functions = (φ 1 ,...,φ K ), each of which maps the file contents to the cache content of a specific user: Z k =φ k (W 1 ,...,W N ) ∀k∈{1,...,K}. (11.1) Given a prefetching scheme, we say that a communication rate R is-achievable if and only if, for every requestd, there exists a messageX d of lengthRF that allows all users to recover their desired file d k with a probability of error of at most . Given parameters N, K, and M, we define the minimum peak rate, denoted by R ∗ , as the minimum rate that is -achievable over all prefetching schemes for large F and any > 0. Rigorously, R ∗ = sup >0 lim sup F→∞ min {R|R is -achievable given prefetching} (11.2) 1 This is due to the fact that in most caching systems the caching phase happens during off-peak hours, in order to improve performance during the peak hours when actual user demands are revealed. 138 Similarly for the average rate, we say that a communication rate R is -achievable for demand d, given a prefetching scheme, if and only if we can create a message X d of length RF that allows all users to recover their desired file d k with a probability of error of at most . Given parameters N, K, and M, we define the minimum average rate, denoted by R ∗ ave , as the minimum rate over all prefetching schemes such that, we can find a function R(d) that is is-achievable for any demandd, satisfying R ∗ ave =E d [R(d)], whered is uniformly random inD ={1,...,N} K , for large F and any > 0. Finding the rate-memory tradeoff is essentially finding the values of R ∗ and R ∗ ave as a function of N, K, and M. In this chapter, we aim to find converse bounds that characterize R ∗ and R ∗ ave within a constant factor. Moreover, we aim to better characterize R ∗ and R ∗ ave for an important case where N ls large, when K and M N are fixed. 11.1.3 Related Works Coded caching was originally proposed in [140], where the peak rate vs. memory tradeoff was characterized within a factor of 12. This result was later extended in [146], where the minimum average rate under uniform file popularity was characterized within a factor of 72. Since then, various efforts has been made on improving these characterizations [6,7,161,169]. The state of the art is an approximation within a factor of 4 for peak rate [6] and 4.7 for average rate [161]. In this chapter, we characterize both the peak rate and the average rate within a factor of 2.00884, which is about a two-fold improvement upon the prior arts. This improvement is achieved by improving both the achievability scheme and the converse. Specifically, we use the achievability scheme we proposed in [11] (presented in Chapter 10) to upper bound the communication rates. This upper bound strictly improves upon the communication rates achieved by [140] (and its relaxed version in [144]), which was relied on by all the above works (i.e., [6,7,146,161,169]). It also achieves the exact optimum communication rates among all caching schemes with uncoded prefetching, for all possible values of N,K, andM. As a shorthand notation, we denote the peak and average rates achieved in [11] by R u (N,K,r) and R u,ave (N,K,r), respectively. 2 More precisely, we define these functions as follows. Definition 11.1. Given problem parameters N, K, M, and r = KM N , we define R u (N,K,r) = K r+1 − K−min{K,N} r+1 K r , (11.3) 2 Recall that r, KM N . The letter “u” in the subscript represents “upper bound”, and “uncoded prefetching” . 139 R u,ave (N,K,r) =E d K r+1 − K−Ne(d) r+1 K r (11.4) for r ∈ {0,...,K}, where d is uniformly random inD = {1,...,N} K , and N e (d) denotes the number of distinct requests ind. 3 Furthermore, for general (non-integer) r∈ [0,K], R u (N,K,r) and R u,ave (N,K,r) are defined as the lower convex envelope of their values at r∈{0, 1,...,K}, respectively. Specifically, for any non-integer r∈ [0,K], we have 4 R u (N,K,r) = (r−brc)R u (N,K,dre) + (dre−r)R u (N,K,brc), (11.5) R u,ave (N,K,r) = (r−brc)R u,ave (N,K,dre) + (dre−r)R u,ave (N,K,brc). (11.6) Given the above upper bounds, we develop improved converse bounds in this chapter, which provides better characterizations for both the peak rate and the average rate. 11.2 Main Results We summarize our main results in the following theorems. Theorem 11.1. For a caching system with K users, a database of N files, and a local cache size of M files at each user, we have R u (N,K,r) 2.00884 ≤R ∗ ≤R u (N,K,r), (11.7) R u,ave (N,K,r) 2.00884 ≤R ∗ ave ≤R u,ave (N,K,r). (11.8) whereR u (N,K,r) andR u,ave (N,K,r) are defined in Definition 11.1. Furthermore, ifN is sufficiently large (specifically, N≥ K(K+1) 2 ), we have R u (N,K,r) 2 ≤R ∗ ≤R u (N,K,r), (11.9) R u,ave (N,K,r) 2 ≤R ∗ ave ≤R u,ave (N,K,r). (11.10) 3 Here the letter “e” in the subscript represents “effective”, given that the function Ne(d) can also be interpreted as the “effective” number of files for any demandd. Specifically, for any demandd, the needed communication rate stated in equation (11.4) is exactly the peak communication rate stated in equation (11.3) for a caching system with N =Ne(d) files. 4 Rigorously, the fact that equations (11.5) and (11.6) define lower convex envelopes is due to the convexity of Ru(N,K,r) and Ru,ave(N,K,r) on r∈{0, 1,...,K}. This convexity was observed in [11] and can be proved using elementary combinatorics. A short proof of the convexity ofRu(N,K,r) andRu,ave(N,K,r) can be found in Appendix G.10. 140 Remark 11.1. The above theorem characterizes R ∗ and R ∗ ave within a constant factor of 2.00884 for all possible values of parameters K, N, and M. To the best of our knowledge, this gives the best characterization to date. Prior to this work, the best proved constant factors were 4 for peak rate [6] and 4.7 for average rate (under uniform file popularity) [161]. Furthermore, Theorem 11.1 characterizes R ∗ and R ∗ ave for large N within a constant factor of 2. Remark 11.2. The converse bound that we develop for proving Theorem 11.1 also immediately results in better approximation of rate-memory tradeoff in other scenarios, such as online caching [145], caching with non-uniform demands [146], and hierarchical caching [151]. For example, in the case of online caching [145], where the current approximation result is within a multiplicative factor of 24, it can be easily shown that this factor can be reduced to 4.01768 using our proposed bounding techniques. Remark 11.3. R u (N,K,r) and R u,ave (N,K,r), as defined in Definition 11.1, are the optimum peak rate and the optimum average rate that can be achieved using uncoded prefetching, as we proved in [11]. This indicates that for the coded caching problem, using uncoded prefetching schemes is within a factor of 2.00884 optimal for both peak rate and average rate. More interestingly, we can show that even for the improved decentralized scheme we proposed in [11], where each user fills their cache independently without coordination but the delivery scheme was designed to fully exploit the commonality of user demands, the optimum rate is still achieved within a factor of 2.00884 in general, and a factor of 2 for large N. 5 Remark 11.4. Based on the proof idea of Theorem 11.1, we can completely characterize the rate- memory tradeoff for the two-user case, for any possible values of N and M, for both peak rate and average rate. Prior to this work, the peak rate vs. memory tradeoff for the two-user case was characterized in [140] for N≤ 2, and is characterized in [170] for N≥ 3 very recently. However the average rate vs. memory tradeoff has never been completely characterized for any non-trivial case. In this chapter, we prove that the exact optimal tradeoff for the average rate for two-user case can be achieved using the caching scheme we provided in [11] (see Appendix G.8). To prove the Theorem 11.1, we derive new converse bounds of R ∗ and R ∗ ave for all possible values of K, N, and M. We highlight the converse bound of R ∗ in the following theorem: Theorem 11.2. For a caching system with K users, a database of N files, and a local cache size of M files at each user, R ∗ is lower bounded by R ∗ ≥s− 1 +α− s(s− 1)−`(`− 1) + 2αs 2(N−` + 1) M, (11.11) 5 This can be proved based on the fact that, in the proof of Theorem 11.1, we showed the communication rates of the decentralized caching scheme we proposed in [11] (e.g., R dec (M) for the peak rate) are within constant factor optimal as intermediate steps. 141 for any s∈{1,..., min{N,K}}, α∈ [0, 1], where `∈{1,...,s} is the minimum value such that 6 s(s− 1)−`(`− 1) 2 +αs≤ (N−` + 1)`. (11.12) Remark 11.5. The above theorem improves the state of the art in various scenarios. For example, when N is sufficiently large (i.e., N ≥ K(K+1) 2 ), the above theorem gives tight converse bound for KM N ≤ 1, as shown in (11.23). The above matching converse can not be proved directly using converse bounds provided in [6–8,161,169,170] (e.g., for K = 4, N = 10, and M = 1, none of these bounds give R ∗ ≥ 3). Remark 11.6. Although Theorem 11.2 gives infinitely many linear converse bounds onR ∗ , the region of the memory-rate pair (M,R ∗ ) characterized by Theorem 11.2 has a simple shape with finite corner points. Specifically, by applying the arguments used in the proof of Theorem 11.1, one can show that the exact bounded region given by Theorem 11.2 is bounded by the lower convex envelop of points{( N−`+1 s , s−1 2 + `(`−1) 2s )| s∈{1,...,J},`∈{1,...,s}}∪{(0,J)}, where J = min{N,K}. For the case of large N, we can exactly characterize the values of R ∗ and R ∗ ave for K≤ 5. We formally state this result in the following theorem: Theorem 11.3. For a caching system with K users, a database of N files, and a local cache size of M files at each user, we have R ∗ =R ∗ ave =R u (N,K,r) (11.13) for large N (i.e., N→ +∞) when K≤ 5, where R u (N,K,r) is defined in Definition 11.1. 7 Remark 11.7. As discussed in [144], the special case of large N is important to handle asynchronous demands. More specifically, [144] showed that asynchronous demands can be handled by splitting each file into many subfiles, and delivering concurrent subfile requests using the optimum caching schemes. In this case, we essentially need to solve the caching problem when the number of files (i.e., the subfiles) is large, but the fraction of files that can be stored at each user is fixed. In this chapter, we completely characterize this tradeoff for systems with up to 5 users, for both peak rate and average rate, while in prior works, this tradeoff has only been exactly characterized in two instances: the single-user case [140] and, more recently, the two-user case [170]. Remark 11.8. Although Theorem 11.3 only consider systems with up to 5 users, the converse bounds used in its proof also tightly characterize the minimum communication rate in many cases even for systems with more than 5 users. For both peak rate and average rate, we can show that more than 6 Such ` always exists, because when ` =s, (11.12) can be written as αs≤ (N−s + 1)s, which always holds true. 7 Rigorously, we show that the maximum possible gap betweenR ∗ ,R ∗ ave , andRu(N,K,r) overM∈ [0,N] approaches 0 as N goes to infinity. 142 half of the convex envelope achieved by [11] are optimal for large N (e.g., see Lemma G.1 for peak rate). To prove Theorem 11.3, we state the following Theorem, which provides tighter converse bounds on R ∗ for certain values of N, K, and M. Theorem 11.4. For a caching system with K users, a database of N files, and a local cache size of M files at each user, R ∗ is lower bounded by R ∗ ≥ 2K−n+1 n+1 − K(K+1) n(n+1) · M N if β +α K−2n−1 2 ≤ 0, 2K−n+1 n+1 − 2K(K−n) n(n+1) · M N−β otherwise, (11.14) for any n∈{max{1,K−N + 1},...,K− 1}, where α =b N−1 K−n c and β =N−α(K−n). Remark 11.9. The above theorem improves Theorem 11.2 and the state of the art in many cases. For example, when r∈ K− 1− N−1 d 2N K+1 e ,K− 1 , the converse bound (11.14) given by n =br + 1c is tight and we have R ∗ =R u (N,K,r). This result can not be proved in general using the converse bounds provided in [6–8,161,169,170] (e.g., for K = 4, N = 10, and M = 4, none of these bounds give R ∗ ≥ 1). Remark 11.10. We numerically compare our two converse bounds (i.e., Theorem 11.2 and Theorem 11.4), benchmarked against the upper bound R u (N,K,r) we achieved in [11] under three different settings (see Fig. 11.1). In all these cases, the two converse bounds together provide a tight characterization: Theorem 11.2 is tight for r≤ 1 and r≥ K− 1, and Theorem 11.4 is tight for 1≤r≤K− 1. The same holds true in the proof of Theorem 11.3, where the number of users is no more than 5 but the number of files is large. In the rest of this chapter, we prove Theorem 11.1 for the peak rate in Section 11.3, and we prove Theorem 11.2 in Section 11.4. For brevity, we prove the rest of the results in the appendices. Specifically, Appendix G.1 proves Theorem 11.3 for the peak rate, Appendix G.2 proves Theorem 11.4, Appendix G.7 proves Theorem 11.1 for the average rate, and Appendix G.9 proves Theorem 11.3 for the average rate. 11.3 Proof of Theorem 11.1 for peak rate In this section, we prove Theorem 11.1 assuming the correctness of Theorem 11.2. The proof of Theorem 11.2 can be found in Section 11.4. For brevity, we only prove Theorem 11.1 for the peak 143 0 0.5 1 1.5 2 2.5 3 r 0 0.5 1 1.5 2 2.5 3 Peak Rate R Achievability: R u (N,K,r) Converse: Theorem 2 Converse: Theorem 4 (a) Rate-memory tradeoff for K = 3, N = 6. 0 1 2 3 4 r 0 1 2 3 4 Peak Rate R Achievability: R u (N,K,r) Converse: Theorem 2 Converse: Theorem 4 (b) Rate-memory tradeoff for K = 4, N = 10. 0 1 2 3 4 5 r 0 1 2 3 4 5 Peak Rate R Achievability: R u (N,K,r) Converse: Theorem 2 Converse: Theorem 4 (c) Rate-memory tradeoff for K = 5, N→ +∞. Figure 11.1: Numerical comparison among the two converse bounds presented in Theorem 11.2 and Theorem 11.4, and the upper bound achieved in [11]. Our converse bounds tightly characterize the peak rate-memory tradeoff in all three presented scenarios. 144 rate (i.e., inequalities (11.7) and (11.9)) within this section. The proof for the average rate (i.e., inequalities (11.8) and (11.10)) can be found in Appendix G.7. We start by proving the general factor-of-2.00884 characterization for inequality (11.7). Then we focus on the special case of N≥ K(K+1) 2 and prove inequality (11.9). As mentioned in Remark 11.3, the upper bounds of R ∗ stated in Theorem 11.1 can be proved using the caching scheme provided in [11]. Hence, it suffices to prove the lower bounds of (11.7) and (11.9). 11.3.1 Proof of inequality (11.7) The proof of inequality (11.7) consists of 2 steps. In Step 1, we first prove, assuming the correctness of Theorem 11.2, that the memory-rate pair (M,R ∗ ) is lower bounded by the lower convex envelope of a set of points inS Lower ∪{(0,J)}, where S Lower = (M,R) = N−` + 1 s , s− 1 2 + `(`− 1) 2s s∈{1,...,J},`∈{1,...,s} (11.15) where J = min{N,K}, given parameters N and K. Then in Step 2, we exploit the convexity of the upper bound R u (N,K,r), and prove that it is within a factor of 2.00884 from the above converse by checking all the corner points of the envelope. For Step 1, we first prove thatR ∗ is lower bounded by the convex envelope. To prove this statement, it is sufficient to show that any linear function that lower bounds all points inS Lower ∪{(0,J)}, also lower bounds the point (M,R ∗ ). We prove this for any such linear function, denoted by A +BM, by first finding a converse bound of R ∗ using Theorem 11.2 with certain parameters s and α, and then proving that this converse bound is lower bounded by the linear function. We consider the following 2 possible cases: If A≥ 0, note that (0,J) should be lower bounded by the linear function, so we have A≤ J. Thus, we can choose s =dAe, α =A−s + 1, and let ` be the minimum value in{1,...,s} such that (11.12) holds. Because N−`+1 s , s−1 2 + `(`−1) 2s ∈S Lower , we have A +B N−` + 1 s ≤ s− 1 2 + `(`− 1) 2s . (11.16) By the definition of α, we have A =s− 1 +α. Consequently, the slope B can be upper bounded as follows: B≤ s(s− 1) +`(`− 1)− 2As 2(N−` + 1) 145 =− s(s− 1)−`(`− 1) + 2αs 2(N−` + 1) . (11.17) Thus, for any M≥ 0, we have A +BM≤s− 1 +α− s(s− 1)−`(`− 1) + 2αs 2(N−` + 1) M. (11.18) Note that the RHS of the above inequality is exactly the lower bound provided in Theorem 11.2. Hence, A +BM≤R ∗ . If A< 0, let s =` = 1, we have (N, 0)∈S Lower from (11.15). Hence, A +BN≤ 0, and for any M∈ [0,N] we have A +BM = A(N−M) + (A +BN)M N ≤ 0. (11.19) Obviously R ∗ ≥ 0, hence we have A +BM≤R ∗ . Combining the above two cases, we have proved that the memory-rate pair (M,R ∗ ) is lower bounded by the lower convex envelope ofS Lower ∪{(0,J)}. This completes the proof of Step 1. For Step 2, we only need to prove that the ratio of R u (N,K,r) to the lower convex envelope of S Lower ∪{(0,J)} is at most 2.00884. As mentioned at the beginning of this proof, given that the upper boundR u (N,K,r) is convex, 8 this ratio can only be maximized at the corner points of the envelope, which is a subset ofS Lower ∪{(0,J)}. Hence, we only need to check that R u (N,K,r)≤ 2.00884R holds for any (M,R)∈S Lower ∪{(0,J)}. To further simplify the problem, we upper bound R u (N,K,r) using the following inequality, which can be easily proved using the results of [11]: 9 R u (N,K,r)≤R dec (M), N−M M (1− (1− M N ) J ). (11.20) Consequently, to prove inequality (11.7), it suffices to prove the following lemma. Lemma 11.1. For any (M,R)∈S Lower ∪{(0,J)}, we have R dec (M)≤ 2.00884R. The proof of Lemma 11.1 can be found in Appendix G.3. Assuming its correctness, we have R u (N,K,r)≤ 2.00884R ∗ for all possible parameter values of N, K, and M. This completes the proof of inequality (11.7). 8 A short proof can be found in Appendix G.10 9 Here the upper bound R dec (M) is the exact minimum communication rate needed for decentralized caching with uncoded prefetching, as proved in [11]. When M = 0, R dec (M),J. 146 11.3.2 Proof of inequality (11.9) Now we prove that R ∗ ≥ Ru(N,K,r) 2 holds for any N≥ K(K+1) 2 . In this case, we can verify that inequality (11.12) holds for any s∈{1,...,K}, α = 1, and ` = 1. Consequently, from Theorem 11.2, R ∗ can be bounded as follows: R ∗ ≥s− 1 + 1− s(s− 1) + 2s 2(N− 1 + 1) M =s− s 2 +s 2 · M N . (11.21) Then we prove R ∗ ≥ Ru(N,K,r) 2 by considering the following 2 possible cases: If KM N ≤ 1, we have R u (N,K,r) =K− K 2 +K 2 · M N (11.22) as defined in Definition 11.1. Let s =K, we have the following bounds from (11.21) which tightly characterizes R u (N,K,r): R ∗ ≥K− K 2 +K 2 · M N =R u (N,K,r)≥ R u (N,K,r) 2 . (11.23) If KM N > 1, let s =b N M c, we have M N ∈ [ 1 s+1 , 1 s ]. Consequently, we can derive the following lower bound on R ∗ : R ∗ ≥s− s 2 +s 2 · M N = N−M 2M + s 2 +s 2 · N M · M N − 1 s + 1 · 1 s − M N ≥ N−M 2M . (11.24) As mentioned earlier in this section, the following upper bound can be easily proved using the results of [11]: R u (N,K,r)≤ N−M M (1− (1− M N ) K ). (11.25) Consequently, we have R u (N,K,r)≤ N−M M ≤ 2R ∗ . To conclude, we have proved R ∗ ≥ Ru(N,K,r) 2 for both cases. Hence, inequality (11.9) holds for large N for any possible values of K and M. 147 11.4 Proof of Theorem 11.2 Before proving the converse bound stated in Theorem 11.2, we first present the following key lemma, which gives a lower bound on any -achievable rate given any prefetching scheme. Lemma 11.2. Consider a coded caching problem with parameters N and K. Given a certain prefetching scheme, for any demandd, any -achievable rate R is lower bounded by 10 R≥ 1 F min{N,K} X k=1 H(W d k |Z {1,...,k} ,W {d 1 ,...,d k−1 } ) − min{N,K}( 1 F +). (11.26) The above lemma is developed based on the idea of enhancing the cutset bound, which is further explained in the proof of this lemma in Appendix G.4. One can show that this approach strictly improves the compound cutset bound, which was used in most of the prior works. We now continue to prove Theorem 11.2 assuming the correctness of Lemma 11.2. The rest of the proof consists of two steps. In Step 1, we exploit the homogeneity of the problem, and derive a symmetrized version of the converse presented in Lemma 11.2. Then in Step 2, we derive the converse bound in Theorem 11.2, which is independent of the prefetching scheme, by essentially minimize the symmetrized converse over all possible designs. For Step 1, we observe that the caching problem proposed in this chapter assumes that all users has the same cache size, and all files are of the same size. To fully utilize this homogeneity, we define the following useful notations. For any positive integer i, we denote the set of all permutations of {1,...,i} byP i . For any setS⊆{1,...,i} and any permutationp∈P i , we definepS ={p(s)|s∈S}. For any subsetsA⊆{1,...,N} andB⊆{1,...,K}, we define H ∗ (W A ,Z B ), 1 N!K! X p∈P N ,q∈P K H(W pA ,Z qB ). (11.27) Similarly, we define the same notation for conditional entropy in the same way. We can verify that the functions defined above satisfies all Shannon’s inequalities. I.e., for any sets of random variables A,B andC, we have H ∗ (A|B)≥H ∗ (A|B,C). (11.28) 10 By an abuse of notation, we denote a sub-array by using a set of indices as the subscript. Besides, we define {d1,...,d k−1 } =∅ for k = 1. Similar convention will be used throughout this chapter. 148 Note that from the homogeneity of the problem, for any -achievable rate R, Lemma 11.2 holds for any demands, under any possible relabeling of the users. Thus, by considering the class of demands where at least min{N,K} files are requested, we have R≥ 1 F ( min{N,K} X k=1 H(W q(k) |Z {p(1),...,p(k)} ,W {q(1),...,q(k−1)} )) − min{N,K}( 1 F +) (11.29) for any p∈P K and q∈P N . Averaging the above bound over all possible p and q, we have R≥ 1 F ( min{N,K} X k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} )) − min{N,K}( 1 F +). (11.30) Recall that R ∗ is defined to be the minimum -achievable rate over all prefetching scheme φ for large F for any > 0, we have R ∗ ≥ sup >0 lim sup F→∞ min { 1 F ( min{N,K} X k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} )) − min{N,K}( 1 F +)} = sup >0 lim sup F→∞ min { 1 F ( min{N,K} X k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} )} ≥ inf F∈N + min { 1 F ( min{N,K} X k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} )}. (11.31) Now we have derived a symmetrized version of the converse bound. To simplify the discussion, we define R A (F,) = 1 F min{N,K} P k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} ). Consequently, R ∗ ≥ inf F∈N + min R A (F,). (11.32) For Step 2, as mentioned previously in this proof, to derive the converse bound presented in Theorem 11.2, we aim to minimize the symmetrized converse R A (F,) over all prefetching scheme 149 . Moreover, we need to prove that it is no less than the RHS of (11.11) for any parameters s and α. We present the following lemma, which essentially solves this problem. Lemma 11.3. For any parameters s∈{1,..., min{N,K}}, α∈ [0, 1], and any prefetching scheme , we have R A (F,)≥s− 1 +α− s(s− 1)−`(`− 1) + 2αs 2(N−` + 1) M, (11.33) where `∈{1,...,s} is the minimum value such that s(s− 1)−`(`− 1) 2 +αs≤ (N−` + 1)`. (11.34) The proof of Lemma 11.3 can be found in Appendix G.5. Note that the lower bound in the above lemma is identical to the converse in Theorem 11.2. Assuming its correctness, then given any s and α, we can bound R ∗ as follows: R ∗ ≥ inf F∈N + min R ∗ A (F,φ) ≥ (s− 1 +α)− s(s− 1)−`(`− 1) + 2αs 2(N−` + 1) M. (11.35) This completes the proof of Theorem 11.2. 11.5 Conclusion In this chapter, we developed novel converse bounding techniques for caching networks, and characterized the rate-memory tradeoff of the basic bottleneck caching network within a factor of 2.00884 for both the peak rate and the average rate. This is approximately a two-fold improvement with respect to the state of the art. We also provided tight characterization of rate-memory tradeoff for systems with no more than 5 users, when the number of files is large. The results of this chapter can also be used to improve the approximation of rate-memory tradeoff in several other settings, such as online caching, caching with non-uniform demands, and hierarchical caching. 150 Bibliography [1] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, pp. 1514–1529, March 2018. [2] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multiplication,” in 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2418–2422, June 2017. [3] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” in Advances In Neural Information Processing Systems, pp. 2092–2100, 2016. [4] C. Tian and J. Chen, “Caching and delivery via interference elimination,” in 2016 IEEE International Symposium on Information Theory (ISIT), (Barcelona, Spain), pp. 830–834, July 2016. [5] M. M. Amiri and D. Gunduz, “Fundamental limits of coded caching: Improved delivery rate-cache capacity tradeoff,” IEEE Transactions on Communications, vol. 65, pp. 806–815, Feb 2017. [6] H. Ghasemi and A. Ramamoorthy, “Improved lower bounds for coded caching,” IEEE Transactions on Information Theory, vol. 63, pp. 4388–4413, July 2017. [7] A. Sengupta, R. Tandon, and T. C. Clancy, “Improved approximation of storage-rate tradeoff for caching via new outer bounds,” in 2015 IEEE International Symposium on Information Theory (ISIT), (Hong Kong), pp. 1691–1695, June 2015. [8] A. N., N. S. Prem, V. M. Prabhakaran, and R. Vaze, “Critical database size for effective caching,” in 2015 Twenty First National Conference on Communications (NCC), (Mumbai, India), pp. 1–6, Feb 2015. 151 [9] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Characterizing the rate-memory tradeoff in cache networks within a factor of 2,” IEEE Transactions on Information Theory, vol. 65, pp. 647–663, Jan 2019. [10] J. G´ omez-Vilardeb´ o, “Fundamental limits of caching: improved bounds with coded prefetching,” arXiv preprint arXiv:1612.09071, 2016. [11] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “The exact rate-memory tradeoff for caching with uncoded prefetching,” IEEE Transactions on Information Theory, vol. 64, pp. 1281–1296, Feb 2018. [12] Q. Yu and A. S. Avestimehr, “Entangled polynomial codes for secure, private, and batch distributed matrix multiplication: Breaking the “cubic” barrier,” in 2020 IEEE International Symposium on Information Theory (ISIT), 2020, arXiv:2001.05101, 2020. [13] Q. Yu and A. S. Avestimehr, “Coded computing for resilient, secure, and privacy-preserving distributed matrix multiplication,” arXiv:2001.05101, 2020. [14] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high- dimensional coded matrix multiplication,” in Advances in Neural Information Processing Systems 30, pp. 4406–4416, Curran Associates, Inc., 2017, arXiv:1705.10464, 2017. [15] M. Fahim, H. Jeong, F. Haddadpour, S. Dutta, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,” in 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1264–1270, Oct 2017. [16] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed ma- trxix multiplication: Fundamental limits and optimal coding,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 2022–2026, June 2018, arXiv:1801.07487v1, 2018. [17] S. Dutta, V. Cadambe, and P. Grover, “Coded convolution for parallel and distributed computing within a deadline,” in 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2403–2407, June 2017. [18] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded fourier transform,” in 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 494– 501, Oct 2017. [19] H. Jeong, T. M. Low, and P. Grover, “Coded fft and its communication overhead,” arXiv preprint arXiv:1805.09891, 2018. 152 [20] Q. Yu, S. Li, N. Raviv, S. M. M. Kalan, M. Soltanolkotabi, and S. A. Avestimehr, “Lagrange coded computing: Optimal design for resiliency, security, and privacy,” in Proceedings of Machine Learning Research (K. Chaudhuri and M. Sugiyama, eds.), vol. 89 of Proceedings of Machine Learning Research, pp. 1215–1225, PMLR, 16–18 Apr 2019, arXiv:1806.00939, 2018. [21] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Advances in Neural Information Processing Systems, pp. 19–27, 2014. [22] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded MapReduce,” 53rd Annual Allerton Conference on Communication, Control, and Computing, Sept. 2015. [23] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, “A fundamental tradeoff between com- putation and communication in distributed computing,” IEEE Transactions on Information Theory, vol. 64, pp. 109–128, Jan 2018. [24] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Edge-facilitated wireless distributed computing,” in Global Communications Conference (GLOBECOM), 2016 IEEE, pp. 1–7, IEEE, 2016. [25] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “A scalable framework for wireless distributed computing,” IEEE/ACM Transactions on Networking, 2017. [26] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “A unified coding framework for distributed computing with straggling servers,” arXiv preprint arXiv:1609.01690, 2016. [27] Q. Yu, S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “How to optimally allocate resources for coded distributed computing?,” in 2017 IEEE International Conference on Communications (ICC), pp. 1–7, May 2017, arXiv:1702.07297, 2017. [28] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of caching,” IEEE Transactions on Information Theory, vol. 60, pp. 2856–2867, Mar. 2014. [29] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Sixth USENIX Symposium on Operating System Design and Implementation, Dec. 2004. [30] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in Proceedings of the 2nd USENIX HotCloud, vol. 10, p. 10, June 2010. [31] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013. 153 [32] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving MapReduce performance in heterogeneous environments,” OSDI, vol. 8, p. 7, Dec. 2008. [33] A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avestimehr, “Coded computation over heterogeneous clusters,” IEEE Transactions on Information Theory, vol. 65, pp. 4227–4242, July 2019. [34] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding,” arXiv preprint arXiv:1612.03301, 2016. [35] R. Singleton, “Maximum distance q-nary codes,” IEEE Transactions on Information Theory, vol. 10, no. 2, pp. 116–118, 1964. [36] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE Transactions on Computers, vol. C-33, pp. 518–528, June 1984. [37] J.-Y. Jou and J. A. Abraham, “Fault-tolerant matrix arithmetic and signal processing on highly concurrent computing structures,” Proceedings of the IEEE, vol. 74, pp. 732–741, May 1986. [38] F. Didier, “Efficient erasure decoding of reed-solomon codes,” arXiv preprint arXiv:0901.1886, 2009. [39] K. S. Kedlaya and C. Umans, “Fast polynomial factorization and modular composition,” SIAM Journal on Computing, vol. 40, no. 6, pp. 1767–1802, 2011. [40] R. Roth, Introduction to coding theory. Cambridge University Press, 2006. [41] S. Baktir and B. Sunar, “Achieving efficient polynomial multiplication in fermat fields using the fast fourier transform,” in Proceedings of the 44th annual Southeast regional conference, pp. 549–554, ACM, 2006. [42] J. Von Zur Gathen and J. Gerhard, Modern computer algebra. Cambridge university press, 2013. [43] M. Bl¨ aser, Fast Matrix Multiplication. No. 5 in Graduate Surveys, Theory of Computing Library, 2013. [44] V. Strassen, “Gaussian elimination is not optimal,” Numerische Mathematik, vol. 13, pp. 354– 356, Aug 1969. [45] S. Dutta, Z. Bai, H. Jeong, T. M. Low, and P. Grover, “A unified coded deep neural network training strategy based on generalized polydot codes,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1585–1589, June 2018. 154 [46] V. Y. Pan, “Strassen’s algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations,” in 19th Annual Symposium on Foundations of Computer Science (sfcs 1978), pp. 166–176, Oct 1978. [47] D. Bini, “Relations between exact and approximate bilinear algorithms. applications,” CAL- COLO, vol. 17, pp. 87–97, Jan 1980. [48] A. Sch¨ onhage, “Partial and total matrix multiplication,” SIAM Journal on Computing, vol. 10, pp. 434–455, aug 1981. [49] F. Romani, “Some properties of disjoint sums of tensors related to matrix multiplication,” SIAM Journal on Computing, vol. 11, no. 2, pp. 263–267, 1982. [50] D. Coppersmith and S. Winograd, “On the asymptotic complexity of matrix multiplication,” in Proceedings of the 22Nd Annual Symposium on Foundations of Computer Science, SFCS ’81, (Washington, DC, USA), pp. 82–90, IEEE Computer Society, 1981. [51] V. Strassen, “The asymptotic spectrum of tensors and the exponent of matrix multiplication,” in Proceedings of the 27th Annual Symposium on Foundations of Computer Science, SFCS ’86, (Washington, DC, USA), pp. 49–54, IEEE Computer Society, 1986. [52] D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,” Journal of Symbolic Computation, vol. 9, no. 3, pp. 251 – 280, 1990. Computational algebraic complexity editorial. [53] A. J. Stothers, On the complexity of matrix multiplication. PhD thesis, University of Edinburgh, 2010. [54] V. V. Williams, “Multiplying matrices faster than coppersmith-winograd,” in In Proc. 44th ACM Symposium on Theory of Computation, pp. 887–898, 2012. [55] J. Hopcroft and L. Kerr, “On minimizing the number of multiplications necessary for matrix multiplication,” SIAM Journal on Applied Mathematics, vol. 20, no. 1, pp. 30–36, 1971. [56] J. D. Laderman, “A noncommutative algorithm for multiplying 3× 3 matrices using 23 multiplications,” Bulletin of the American Mathematical Society, vol. 82, no. 1, pp. 126–128, 1976. [57] C.-E. Drevet, M. Nazrul Islam, and E. Schost, “Optimization techniques for small matrix multiplication,” Theoretical Computer Science, vol. 412, no. 22, pp. 2219–2236, 2011. [58] A. V. Smirnov, “The bilinear complexity and practical algorithms for matrix multiplication,” Computational Mathematics and Mathematical Physics, vol. 53, pp. 1781–1795, Dec 2013. 155 [59] A. Sedoglavic, “A non-commutative algorithm for multiplying 5×5 matrices using 99 multipli- cations,” arXiv preprint arXiv:1707.06860, 2017. [60] A. Sedoglavic, “A non-commutative algorithm for multiplying (7× 7) matrices using 250 multiplications,” arXiv preprint arXiv:1712.07935, 2017. [61] W.-T. Chang and R. Tandon, “On the capacity of secure distributed matrix multiplication,” arXiv preprint arXiv:1806.00469, 2018. [62] H. Yang and J. Lee, “Secure distributed computing with straggling servers using polynomial codes,” IEEE Transactions on Information Forensics and Security, vol. 14, pp. 141–150, Jan 2019. [63] J. Kakar, S. Ebadifar, and A. Sezgin, “Rate-efficiency and straggler-robustness through parti- tion in distributed two-sided secure matrix computation,” arXiv preprint arXiv:1810.13006, 2018. [64] R. G. L. D’Oliveira, S. El Rouayheb, and D. Karpuk, “Gasp codes for secure distributed matrix multiplication,” in 2019 IEEE International Symposium on Information Theory (ISIT), pp. 1107–1111, July 2019. [65] H. A. Nodehi, S. R. H. Najarkolaei, and M. A. Maddah-Ali, “Entangled polynomial coding in limited-sharing multi-party computation,” in 2018 IEEE Information Theory Workshop (ITW), pp. 1–5, Nov 2018. [66] M. Aliasgari, O. Simeone, and J. Kliewer, “Distributed and private coded matrix computation with flexible communication load,” arXiv preprint arXiv:1901.07705, 2019. [67] M. Kim and J. Lee, “Private secure coded computation,” arXiv preprint arXiv:1902.00167, 2019. [68] J. Kakar, S. Ebadifar, and A. Sezgin, “On the capacity and straggler-robustness of distributed secure matrix multiplication,” IEEE Access, vol. 7, pp. 45783–45799, 2019. [69] W.-T. Chang and R. Tandon, “On the upload versus download cost for secure and private matrix multiplication,” arXiv preprint arXiv:1906.10684, 2019. [70] S. Ebadifar, J. Kakar, and A. Sezgin, “The need for alignment in rate-efficient distributed two-sided secure matrix computation,” in ICC 2019 - 2019 IEEE International Conference on Communications (ICC), pp. 1–6, May 2019. [71] H. A. Nodehi and M. A. Maddah-Ali, “Secure coded multi-party computation for massive matrix operations,” arXiv preprint arXiv:1908.04255, 2019. 156 [72] Z. Jia and S. A. Jafar, “On the capacity of secure distributed matrix multiplication,” arXiv preprint arXiv:1908.06957, 2019. [73] J. Kakar, A. Khristoforov, S. Ebadifar, and A. Sezgin, “Uplink-downlink tradeoff in secure distributed matrix multiplication,” arXiv preprint arXiv:1910.13849, 2019. [74] M. Aliasgari, O. Simeone, and J. Kliewer, “Private and secure distributed matrix multiplication with flexible communication load,” arXiv preprint arXiv:1909.00407, 2019. [75] R. G. D’Oliveira, S. El Rouayheb, D. Heinlein, and D. Karpuk, “Degree tables for secure distributed matrix multiplication,” 2019. [76] M. Kim, H. Yang, and J. Lee, “Private coded matrix multiplication,” IEEE Transactions on Information Forensics and Security, pp. 1–1, 2019. [77] Z. Jia and S. A. Jafar, “Cross subspace alignment codes for coded distributed batch matrix multiplication,” arXiv preprint arXiv:1909.13873, 2019. [78] Z. Jia and S. A. Jafar, “Generalized cross subspace alignment codes for coded distributed batch matrix multiplication,” 2019. [79] J. So, B. Guler, A. S. Avestimehr, and P. Mohassel, “Codedprivateml: A fast and privacy- preserving framework for distributed machine learning,” arXiv preprint arXiv:1902.00641, 2019. [80] S. Li, M. Yu, S. Avestimehr, S. Kannan, and P. Viswanath, “Polyshard: Coded sharding achieves linearly scaling efficiency and security simultaneously,” arXiv preprint arXiv:1809.10361, 2018. [81] Q. Yu and A. S. Avestimehr, “Harmonic coding: An optimal linear code for privacy-preserving gradient-type computation,” in 2019 IEEE International Symposium on Information Theory (ISIT), pp. 1102–1106, July 2019. [82] M. Frigo and S. G. Johnson, “The design and implementation of FFTW3,” Proceedings of the IEEE, vol. 93, no. 2, pp. 216–231, 2005. Special issue on “Program Generation, Optimization, and Platform Adaptation” . [83] M. Pippig, “Pfft: An extension of fftw to massively parallel architectures,” SIAM Journal on Scientific Computing, vol. 35, no. 3, pp. C213–C236, 2013. [84] J. Y. Jou and J. A. Abraham, “Fault-tolerant fft networks,” IEEE Transactions on Computers, vol. 37, pp. 548–561, May 1988. 157 [85] S.-J. Wang and N. K. Jha, “Algorithm-based fault tolerance for fft networks,” IEEE Transac- tions on Computers, vol. 43, pp. 849–854, Jul 1994. [86] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier series,” Mathematics of computation, vol. 19, no. 90, pp. 297–301, 1965. [87] A. Soro and J. Lacan, “Fnt-based reed-solomon erasure codes,” in Consumer Communications and Networking Conference (CCNC), 2010 7th IEEE, pp. 1–5, IEEE, 2010. [88] D. G. Cantor and E. Kaltofen, “On fast multiplication of polynomials over arbitrary algebras,” Acta Informatica, vol. 28, no. 7, pp. 693–701, 1991. [89] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coding for distributed fog computing,” IEEE Communications Magazine, vol. 55, pp. 34–40, Apr. 2017. [90] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning.,” in OSDI, vol. 16, pp. 265–283, 2016. [91] M. Li, D. G. Andersen, A. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, (Cambridge, MA, USA), pp. 19–27, MIT Press, 2014. [92] N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, and R. Katz, “Multi-task learning for straggler avoiding predictive job scheduling,” Journal of Machine Learning Research, vol. 17, no. 106, pp. 1–37, 2016. [93] P. Blanchard, R. Guerraoui, J. Stainer, et al., “Machine learning with adversaries: Byzantine tolerant gradient descent,” in Advances in Neural Information Processing Systems, pp. 118–128, 2017. [94] R. Cramer, I. B. Damgrd, and J. B. Nielsen, Secure Multiparty Computation and Secret Sharing. New York, NY, USA: Cambridge University Press, 1st ed., 2015. [95] D. Bogdanov, S. Laur, and J. Willemson, “Sharemind: A framework for fast privacy-preserving computations,” in Proceedings of the 13th European Symposium on Research in Computer Security: Computer Security, ESORICS ’08, (Berlin, Heidelberg), pp. 192–206, Springer-Verlag, 2008. [96] Q. Yu, S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “How to optimally allocate resources for coded distributed computing?,” in 2017 IEEE International Conference on Communications (ICC), pp. 1–7, May 2017. 158 [97] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 3368–3376, 2017. [98] W. Halbawi, N. A. Ruhi, F. Salehi, and B. Hassibi, “Improving distributed gradient descent using reed-solomon codes,” CoRR, vol. abs/1706.05436, 2017. [99] N. Raviv, I. Tamo, R. Tandon, and A. G. Dimakis, “Gradient coding from cyclic mds codes and expander graphs,” arXiv preprint arXiv:1707.03858, 2017. [100] S. Dutta, M. Fahim, F. Haddadpour, H. Jeong, V. Cadambe, and P. Grover, “On the optimal recovery threshold of coded matrix multiplication,” IEEE Transactions on Information Theory, vol. 66, pp. 278–301, Jan 2020, arXiv:1801.10292, 2018. [101] H. A. Nodehi and M. A. Maddah-Ali, “Limited-sharing multi-party computation for massive matrix operations,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1231–1235, June 2018. [102] L. Chen, Z. Charles, D. Papailiopoulos, et al., “Draco: Robust distributed training via redundant gradients,” arXiv preprint arXiv:1803.09877, 2018. [103] M. Ben-Or, S. Goldwasser, and A. Wigderson, “Completeness theorems for non-cryptographic fault-tolerant distributed computation,” in Proceedings of the twentieth annual ACM sympo- sium on Theory of computing, pp. 1–10, ACM, 1988. [104] P. Mohassel and Y. Zhang, “Secureml: A system for scalable privacy-preserving machine learning,” in 2017 IEEE Symposium on Security and Privacy (SP), vol. 00, pp. 19–38, May 2017. [105] R. Bitar, P. Parag, and S. E. Rouayheb, “Minimizing latency for secure coded computing using secret sharing via staircase codes,” arXiv preprint arXiv:1802.02640, 2018. [106] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Straggler mitigation in distributed optimization through data encoding,” in Advances in Neural Information Processing Systems, pp. 5440–5448, 2017. [107] S. Wang, J. Liu, N. Shroff, and P. Yang, “Fundamental limits of coded linear transform,” arXiv preprint arXiv:1804.09791, 2018. [108] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding,” IEEE Transactions on Information Theory, vol. 66, pp. 1920–1933, March 2020. 159 [109] P. Renteln, Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physi- cists. Cambridge University Press, 2013. [110] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [111] S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr, “Polynomially coded regression: Optimal straggler mitigation via data encoding,” arXiv preprint arXiv:1805.09934, 2018. [112] S. Li, S. M. M. Kalan, A. S. Avestimehr, and M. Soltanolkotabi, “Near-optimal straggler mitigation for distributed gradient methods,” arXiv preprint arXiv:1710.09990, 2017. [113] S. Li, S. Supittayapornpong, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded terasort,” 6th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, 2017. [114] Y. H. Ezzeldin, M. Karmoose, and C. Fragouli, “Communication vs distributed computation: an alternative trade-off curve,” arXiv preprint arXiv:1705.08966, 2017. [115] S. Prakash, A. Reisizadeh, R. Pedarsani, and S. Avestimehr, “Coded computing for distributed graph analytics,” arXiv preprint arXiv:1801.05522, 2018. [116] K. Konstantinidis and A. Ramamoorthy, “Leveraging Coding Techniques for Speeding up Distributed Computing,” ArXiv e-prints, 2018. [117] A. Shamir, “How to share a secret,” Commun. ACM, vol. 22, pp. 612–613, Nov. 1979. [118] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft, “Privacy-preserving ridge regression on hundreds of millions of records,” in IEEE Symposium on Security and Privacy, pp. 334–348, IEEE, 2013. [119] A. Gasc´ on, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, and D. Evans, “Privacy-preserving distributed linear regression on high-dimensional data,” Proceedings on Privacy Enhancing Technologies, vol. 2017, no. 4, pp. 345–364, 2017. [120] V. Chen, V. Pastro, and M. Raykova, “Secure computation for machine learning with SPDZ,” arXiv:1901.00329, 2019. [121] S. Winograd, “On multiplication of 2× 2 matrices,” Linear Algebra and its Applications, vol. 4, no. 4, pp. 381 – 388, 1971. [122] J. Landsberg, “The border rank of the multiplication of 2× 2 matrices is seven,” Journal of the American Mathematical Society, vol. 19, no. 2, pp. 447–459, 2006. 160 [123] P. B¨ urgisser, M. Clausen, and M. A. Shokrollahi, Algebraic complexity theory, vol. 315. Springer Science & Business Media, 2013. [124] F. Le Gall, “Complexity of matrix multiplication and bilinear problems,” [125] M. Ben-Or, S. Goldwasser, and A. Wigderson, “Completeness theorems for non-cryptographic fault-tolerant distributed computation,” in Proceedings of the Twentieth Annual ACM Sympo- sium on Theory of Computing, STOC ’88, (New York, NY, USA), p. 1–10, Association for Computing Machinery, 1988. [126] I. Tamo and A. Barg, “A family of optimal locally recoverable codes,” IEEE Transactions on Information Theory, vol. 60, pp. 4661–4676, Aug 2014. [127] H. Jeong, Y. Yang, and P. Grover, “Systematic matrix multiplication codes,” in 2019 IEEE International Symposium on Information Theory (ISIT), pp. 1–5, July 2019. [128] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM SIGOPS Operating Systems Review, vol. 41, pp. 59–72, June 2007. [129] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand, “CIEL: a universal execution engine for distributed data-flow computing,” in Proc. 8th ACM/USENIX Symposium on Networked Systems Design and Implementation, pp. 113–126, 2011. [130] K. Wan, D. Tuninetti, and P. Piantanida, “On the optimality of uncoded cache placement,” in Information Theory Workshop (ITW), 2016 IEEE, pp. 161–165, IEEE, 2016. [131] K. Wan, D. Tuninetti, and P. Piantanida, “On caching with more users than files,” in Information Theory (ISIT), 2016 IEEE International Symposium on, pp. 135–139, IEEE, 2016. [132] D. D. Sleator and R. E. Tarjan, “Amortized efficiency of list update and paging rules,” Commun. ACM, vol. 28, pp. 202–208, Feb. 1985. [133] L. W. Dowdy and D. V. Foster, “Comparative models of the file assignment problem,” ACM Comput. Surv., vol. 14, pp. 287–313, June 1982. [134] K. C. Almeroth and M. H. Ammar, “The use of multicast delivery to provide a scalable and interactive video-on-demand service,” IEEE Journal on Selected Areas in Communications, vol. 14, pp. 1110–1122, Aug 1996. [135] A. Dan, D. Sitaram, and P. Shahabuddin, “Dynamic batching policies for an on-demand video server,” Multimedia Systems, vol. 4, pp. 112–121, Jun 1996. 161 [136] M. R. Korupolu, C. Plaxton, and R. Rajaraman, “Placement algorithms for hierarchical cooperative caching,” Journal of Algorithms, vol. 38, no. 1, pp. 260 – 302, 2001. [137] A. Meyerson, K. Munagala, and S. Plotkin, “Web caching using access statistics,” in Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’01, (Washington, D.C., USA), pp. 354–363, Society for Industrial and Applied Mathematics, 2001. [138] I. Baev, R. Rajaraman, and C. Swamy, “Approximation algorithms for data placement problems,” SIAM Journal on Computing, vol. 38, no. 4, pp. 1411–1429, 2008. [139] S. Borst, V. Gupta, and A. Walid, “Distributed caching algorithms for content distribution networks,” in 2010 Proceedings IEEE INFOCOM, (San Diego, CA, USA), pp. 1–9, March 2010. [140] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of caching,” IEEE Transactions on Information Theory, vol. 60, pp. 2856–2867, May 2014. [141] Z. Chen, “Fundamental limits of caching: Improved bounds for small buffer users,” arXiv preprint arXiv:1407.1935, 2014. [142] S. Sahraei and M. Gastpar, “K users caching two files: An improved achievable rate,” in 2016 Annual Conference on Information Science and Systems (CISS), (Princeton, NJ, USA), pp. 620–624, March 2016. [143] M. M. Amiri, Q. Yang, and D. Gunduz, “Coded caching for a large number of users,” in 2016 IEEE Information Theory Workshop (ITW), (Cambridge, UK), pp. 171–175, Sept 2016. [144] M. A. Maddah-Ali and U. Niesen, “Decentralized coded caching attains order-optimal memory- rate tradeoff,” IEEE/ACM Transactions on Networking, vol. 23, pp. 1029–1040, Aug 2015. [145] R. Pedarsani, M. A. Maddah-Ali, and U. Niesen, “Online coded caching,” IEEE/ACM Transactions on Networking, vol. 24, pp. 836–845, April 2016. [146] U. Niesen and M. A. Maddah-Ali, “Coded caching with nonuniform demands,” IEEE Trans- actions on Information Theory, vol. 63, pp. 1146–1158, Feb 2017. [147] J. Zhang, X. Lin, and X. Wang, “Coded caching under arbitrary popularity distributions,” in 2015 Information Theory and Applications Workshop (ITA), (San Diego, CA, USA), pp. 98–107, Feb 2015. [148] M. Ji, A. M. Tulino, J. Llorca, and G. Caire, “Order-optimal rate of caching and coded multicasting with random demands,” IEEE Transactions on Information Theory, vol. 63, pp. 3923–3949, June 2017. 162 [149] A. Ramakrishnan, C. Westphal, and A. Markopoulou, “An efficient delivery scheme for coded caching,” in 2015 27th International Teletraffic Congress, (Ghent, Belgium), pp. 46–54, Sept 2015. [150] J. Hachem, N. Karamchandani, and S. Diggavi, “Multi-level coded caching,” in 2014 IEEE International Symposium on Information Theory, (Honolulu, HI, USA), pp. 56–60, June 2014. [151] N. Karamchandani, U. Niesen, M. A. Maddah-Ali, and S. N. Diggavi, “Hierarchical coded caching,” IEEE Transactions on Information Theory, vol. 62, pp. 3212–3229, June 2016. [152] J. Hachem, N. Karamchandani, and S. Diggavi, “Effect of number of users in multi-level coded caching,” in 2015 IEEE International Symposium on Information Theory (ISIT), (Hong Kong), pp. 1701–1705, June 2015. [153] M. Ji, G. Caire, and A. F. Molisch, “Fundamental limits of caching in wireless D2D networks,” IEEE Transactions on Information Theory, vol. 62, pp. 849–869, Feb 2016. [154] M. A. Maddah-Ali and U. Niesen, “Cache-aided interference channels,” in 2015 IEEE In- ternational Symposium on Information Theory (ISIT), (Hong Kong), pp. 809–813, June 2015. [155] N. Naderializadeh, M. A. Maddah-Ali, and A. S. Avestimehr, “Fundamental limits of cache- aided interference management,” IEEE Transactions on Information Theory, vol. 63, pp. 3092– 3107, May 2017. [156] J. Hachem, U. Niesen, and S. Diggavi, “A layered caching architecture for the interference channel,” in 2016 IEEE International Symposium on Information Theory (ISIT), (Barcelona, Spain), pp. 415–419, July 2016. [157] J. Hachem, U. Niesen, and S. N. Diggavi, “Degrees of freedom of cache-aided wireless interference networks,” arXiv preprint arXiv:1606.03175, 2016. [158] N. Naderializadeh, M. A. Maddah-Ali, and A. S. Avestimehr, “Cache-aided interference management in wireless cellular networks,” in 2017 IEEE International Conference on Com- munications (ICC), pp. 1–7, May 2017. [159] C. Y. Wang, S. H. Lim, and M. Gastpar, “Information-theoretic caching,” in 2015 IEEE International Symposium on Information Theory (ISIT), (Hong Kong), pp. 1776–1780, June 2015. [160] C. Y. Wang, S. H. Lim, and M. Gastpar, “Information-theoretic caching: Sequential coding for computing,” IEEE Transactions on Information Theory, vol. 62, pp. 6393–6406, Nov 2016. 163 [161] S. H. Lim, C. Y. Wang, and M. Gastpar, “Information theoretic caching: The multi-user case,” in 2016 IEEE International Symposium on Information Theory (ISIT), (Barcelona, Spain), pp. 525–529, July 2016. [162] R. Timo and M. Wigger, “Joint cache-channel coding over erasure broadcast channels,” in 2015 International Symposium on Wireless Communication Systems (ISWCS), (Brussels, Belgium), pp. 201–205, Aug 2015. [163] S. S. Bidokhti, M. Wigger, and R. Timo, “Erasure broadcast networks with receiver caching,” in 2016 IEEE International Symposium on Information Theory (ISIT), (Barcelona, Spain), pp. 1819–1823, July 2016. [164] S. S. Bidokhti, M. Wigger, and R. Timo, “Noisy broadcast networks with receiver caching,” arXiv preprint arXiv:1605.02317, 2016. [165] S. S. Bidokhti, M. A. Wigger, and R. Timo, “An upper bound on the capacity-memory tradeoff of degraded broadcast channels.,” in International Symposium on Turbo Codes & Iterative Information Processing, (Brest, France), pp. 350–354, 2016. [166] J. Zhang and P. Elia, “Fundamental limits of cache-aided wireless BC: Interplay of coded- caching and CSIT feedback,” IEEE Transactions on Information Theory, vol. 63, pp. 3142–3160, May 2017. [167] J. Zhang, F. Engelmann, and P. Elia, “Coded caching for reducing CSIT-feedback in wireless communications,” in 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), (Monticello, IL, USA), pp. 1099–1105, Sept 2015. [168] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coding for distributed fog computing,” IEEE Communications Magazine, vol. 55, pp. 34–40, April 2017. [169] C. Y. Wang, S. H. Lim, and M. Gastpar, “A new converse bound for coded caching,” in 2016 Information Theory and Applications Workshop (ITA), (La Jolla, CA, USA), pp. 1–6, Jan 2016. [170] C. Tian, “Symmetry, outer bounds, and code constructions: A computer-aided investigation on the fundamental limits of caching,” arXiv preprint arXiv:1611.00024, 2016. [171] C. Tian, “Symmetry, demand types and outer bounds in caching systems,” in 2016 IEEE International Symposium on Information Theory (ISIT), (Barcelona, Spain), pp. 825–829, July 2016. [172] H. Hara Suthan C, I. Chugh, and P. Krishnan, “An improved secretive coded caching scheme exploiting common demands,” arXiv preprint arXiv:1705.08092, 2017. 164 [173] K. Wan, D. Tuninetti, and P. Piantanida, “Novel delivery schemes for decentralized coded caching in the finite file size regime,” in 2017 IEEE International Conference on Communica- tions Workshops (ICC Workshops), (Paris, France), pp. 1183–1188, May 2017. [174] E. Berlekamp, “Nonbinary bch decoding (abstr.),” IEEE Transactions on Information Theory, vol. 14, pp. 242–242, March 1968. [175] J. Massey, “Shift-register synthesis and bch decoding,” IEEE Transactions on Information Theory, vol. 15, pp. 122–127, January 1969. [176] M. Sudan, “Notes on an efficient solution to the rational function interpolation problem,” Avaliable from http://people.csail.mit.edu/madhu/FT01/notes/rational.ps , 1999. [177] M. Rosenblum, “A fast algorithm for rational function approximations,” Avaliable from http://people.csail.mit.edu/madhu/FT01/notes/rosenblum.ps , 1999. [178] V. Y. Pan, “Matrix structures of vandermonde and cauchy types and polynomial and rational computations,” in Structured Matrices and Polynomials, pp. 73–116, Springer, 2001. 165 Appendix A Supplement to Chapter 2 A.1 Optimality of Polynomial Code in Latency and Communica- tion Load In this section we prove the optimality of polynomial code for distributed matrix multiplication in terms of computation latency and communication load. Specifically, we provide the proof of Theorem 2.2 and Theorem 2.3. A.1.1 Proof of Theorem 2.2 Consider an arbitrary computation strategy, we denote its computation latency by T . By definition, T is given as follows: T = min{ t≥ 0| C is decodable given results from all workers in{ i| T i ≤t}}, (A.1) where T i denotes the computation time of worker i. To simplify the discussion, we define S(t) ={ i| T i ≤t} (A.2) given T 0 ,T 1 ,...,T N−1 . As proved in Section 2.2.3, if C is decodable at any time t, there must be at least mn workers finishes computation. Consequently, we have T = min{ t≥ 0| C is decodable given results from all workers inS(t)} 166 = min{ t≥ 0| C is decodable given results from all workers inS(t) and|S(t)|≥mn} ≥ min{ t≥ 0||S(t)|≥mn}. (A.3) On the other hand, we consider the latency of polynomial code, denoted by T poly . Recall that for the polynomial code, the output C is decodable if and only if at least mn workers finishes computation, i.e.,|S(t)≥mn|. We have T poly = min{ t≥ 0||S(t)|≥mn}. (A.4) Hence, T≥T poly always holds true, which proves Theorem 2. A.1.2 Proof of Theorem 2.3 Recall that in Section 2.2.3 we have proved that if the input matrices are sampled based on a certain distribution, then decoding the output C requires that the entropy of the entire message received by the server is at least rt log 2 q. Consequently, it takes at least rt log 2 q bits deliver such messages, which lower bounds the minimum communication load. On the other hand, the polynomial code requires delivering rt elements in F q in total, which achieves this minimum communication load. Hence, the minimum communication load L ∗ equals rt log 2 q. 167 Appendix B Supplement to Chapter 4 B.1 An Equivalence Between Fault Tolerance and Straggler Miti- gation In this appendix, we start by formulating a fault-tolerant computing problem for matrix multiplica- tion, then we prove Theorem 4.4 by building a connection between straggler mitigation and fault tolerance, by extending the concept of Hamming distance to coded computing. B.1.1 Problem Formulation We consider a matrix multiplication problem with two input matrices A∈F s×r and B∈F s×t , and we are interested in computing C ,A | B using a master node and N worker nodes, where each worker can store 1 pm fraction ofA and 1 pn fraction ofB. Similar to the straggler mitigation problem, each worker i can store two coded matrices ˜ A i ∈F s p × r m and ˜ B i ∈F s p × t n , computed based on A and B respectively. Each worker can compute the product ˜ C i , ˜ A | i ˜ B i , and return it to the master. Unlike the straggler setting, the master waits for all workers before proceeding to recover the final output C. However, a subset of workers can return error results, and the master has no information on which subset of results are false. Under this setting, the master wants to: (1) determine if there is an error in the workers’ outputs, and (2) try to recover the final output C using the possibly false computing results from the workers. Given the above system model, we formulate this fault-tolerant computing problem based on the following terminology. Similar to our main problem in this paper, we define the encoding functions and denote them by (a,b). We also define the decoding function for the master, however in this 168 problem it can either return an estimate of C, or report an error. We only consider the valid decoding functions, which always correctly decodes C when no worker is making mistakes. For any integer E, we say the encoding functions can detect E errors if we can find a decoding function that either returns the correct value ofC or reports an error, when no more thanE workers are making mistakes. Moreover, we say the encoding functions can correct E errors, if the decoding function always correctly decodes C. We denote the maximum possible integer E given these two criteria by E detect (a,b) and E correct (a,b) respectively. We aim to find encoding functions that allows detecting and correcting the maximum possible number of errors. Among all possible computation strategies, we are particularly interested in linear encoding functions, as defined in Section 4.1. Given the above terminology, we define the following concepts. Definition B.1. For a distributed matrix multiplication problem of computing A | B using N workers, we define the maximum detectable errors and the maximum detectable errors, denoted by E ∗ detect and E ∗ correct respectively, as the maximum possible values of E detect (a,b) and E correct (a,b) over the set of all encoding functions that are linear. Our goal is to characterize the values of E ∗ detect and E ∗ correct , and to find optimal computation strategies to achieve these values. We are also interested in extending these characterizations to non-linear codes. B.1.2 Proof of Theorem 4.4 We start by defining some concepts, which allows connecting the fault-tolerant computing problem to the straggler mitigation problem. Definition B.2. We define the Hamming distance of any encoding functions (a,b), denoted by d(a,b), as the maximum integer d such that for any two pairs of input matrices whose products C are different, at least d workers compute different values of ˜ C i . Definition B.3. We define the Recovery threshold of any encoding functions (a,b), denoted by K(a,b), as the minimum possible recovery threshold given any decoding functions. We prove that all these three mentioned criteria for designing encoding functions are directly connected by the Hamming distance, which is formally stated as follows. Lemma B.1. For any (possibly non-linear) computation strategy, we have K(a,b) =N−d(a,b) + 1, (B.1) E detect (a,b) =d(a,b)− 1, (B.2) 169 E correct (a,b) = d(a,b)− 1 2 . (B.3) Remark B.1. Lemma B.1 essentially indicates that optimizing the straggler mitigation performance over any class of encoding designs is equivalently optimizing its performance in the fault tolerance setting. Furthermore, all these previously mentioned metrics can be simultaneously optimized by the codes with the maximum possible Hamming distance. Hence, there is no tension among these metrics. This result bridges the rich literature of coding theory and distributed computing. Remark B.2. In terms of achievability, Lemma B.1 also provides a large class of coding designs for fault-tolerant computing. Specifically, it indicates that given any computing scheme (e.g., the entangled polynomial code, or its improved version) that achieves a certain recovery threshold, denoted by K. Using the same encoding functions, we can obtain a fault-tolerant scheme that detects up to N−K errors, or correct up tob N−K 2 c errors. Proof of Lemma B.1. Lemma B.1 is a direct consequence of the classical coding theory, given that mitigating straggler effects is essentially correcting erasure errors, and tolerating false results in computing is essentially correcting arbitrary error. Hence, we only provide the proof of (B.1), where equations (B.2) and (B.3) can be proved using similar approaches. Specifically, we want to prove that for any integer K, a recovery threshold of K is achievable by some encoding functions if and only if their Hamming distance is greater or equal to N−K + 1. If K is achievable, it means that we can find decoding functions that uniquely determines the value of C given results from any K workers. Equivalently, for distinct values of C, at least N−K + 1 workers has to return distinct results. Recall that the recovery threshold is the minimum of such integer K, and the Hamming distance is the maximum integer that corresponds to N−K + 1. We have K(a,b) =N−d(a,b) + 1. Now we continue to prove Theorem 4.4 using Lemma B.1. As mentioned in Remark 4.9, the proof of Theorem 4.2 essentially completely characterizes the optimum recovery threshold over all linear encoding functions for m = 1 or n = 1, which is given by K entangled-poly . Hence, using Lemma B.1, we directly obtain that if m = 1 or n = 1, we have E ∗ detect =N−K entangled-poly , (B.4) E ∗ correct = N−K entangled-poly 2 . (B.5) This concludes the proof of Theorem 4.4. 170 Appendix C Supplement to Chapter 6 C.1 Algorithmic Illustration of LCC Algorithm C.1 LCC Encoding (Precomputation) 1: procedure Encode(X 1 ,X 2 ,...,X K ,T ) . Encode inputs variables according to LCC 2: generate uniform random variables Z K+1 ,...,Z K+T 3: jointly compute ˜ X i ← P j∈[K] X j · Q k∈[K+T]\{j} α i −β k β j −β k + P K+T j=K+1 Z j · Q k∈[K+T]\{j} α i −β k β j −β k for i = 1, 2,...,N using fast polynomial interpolation 4: return ˜ X 1 ,..., ˜ X N . The coded variable assigned to worker i is ˜ X i Algorithm C.2 Computation Stage 1: procedure WorkerComputation( ˜ X) . Each worker i takes ˜ X i as input 2: return f( ˜ X) . Compute as if no coding is taking place 1: procedure Decode(S,A) . Executed by master 2: wait for a subset of fastest N−S workers 3: N← identities of the fastest workers 4: { ˜ Y i } i∈N ← results from the fastest workers 5: recover Y 1 ,...,Y K from{ ˜ Y i } i∈N using fast interpolation or Reed-Solomon decoding . See Appendix C.2 6: return Y 1 ,...,Y K β 1 ,...,β K+T and α 1 ,...,α N are global constants inF, satisfying 1 1. β i ’s are distinct, 2. α i ’s are distinct, 3. {α i } i∈[N] ∩{β j } j∈[K] =? (this requirement is alleviated if T = 0). 1 A variation of LCC is presented in Appendix C.3, by selecting different values of αi’s. 171 C.2 Coding Complexities of LCC By exploiting the algebraic structure of LCC, we can find efficient encoding and decoding algorithms with almost linear computational complexities. The encoding of LCC can be viewed as interpolating degree K +T− 1 polynomials, and then evaluating them at N points. It is known that both operations only require almost linear complexities: interpolating a polynomial of degree k has a complexity of O(k log 2 k log logk), and evaluating it at any k points requires the same [39]. Hence, the total encoding complexity of LCC is at most O(N log 2 (K +T ) log log(K +T ) dimV), which is almost linear to the output size of the encoder O(N dimV). Similarly, when no security requirement is imposed on the system (i.e., A = 0), the decoding of LCC can also be completed using polynomial interpolation and evaluation. An almost linear complexity O(R log 2 R log logR dimU) can be achieved, where R denotes the recovery threshold. A less trivial case is to consider the decoding algorithm when A> 0, where the goal is essentially to interpolate a polynomial with at mostA erroneous input evaluations, or decoding a Reed-Solomon code. An almost linear time complexity can be achieved using additional techniques developed in [174–177]. Specifically, the following 2A−1 syndrome variables can be computed with a complexity ofO((N−S) log 2 (N−S) log log(N−S) dimU) using fast algorithms for polynomial evaluation and for transposed-Vandermonde-matrix multiplication [178]. S k , X i∈N ˜ Y i α k i Q j∈N\{i} (α i −α j ) ∀k∈{0, 1,..., 2A− 1}. (C.1) According to [174,175], the location of the errors (i.e., the identities of adversaries in LCC decoding) can be determined given these syndrome variables by computing its rational function approximation. Almost linear time algorithms for this operation are provided in [176,177], which only requires a complexity of O(A log 2 A log logA dimU). After identifying the adversaries, the final results can be computed similar to the A = 0 case. This approach achieves a total decoding complexity of O((N−S) log 2 (N−S) log log(N−S) dimU), which is almost linear with respect to the input size of the decoder O((N−S) dimU). Finally, note that the adversaries can only affect a fixed subset ofA workers’ results for all entries. This decoding time can be further reduced by computing the final outputs entry-wise: for each iteration, ignore computing results from adversaries identified in earlier steps, and proceed decoding with the rest of the results. 172 C.3 The Uncoded Version of LCC In Section 6.3.2, we have described the LCC scheme, which provides an S-resilient, A-secure, andT -private scheme as long as (K +T− 1) degf +S + 2A + 1≤N. Instead of explicitly following the same construction, a variation of LCC can be made by instead selecting the values of α i ’s from the set{β j } j∈[K] (not necessarily distinctly). We refer to this approach as the uncoded version of LCC, which essentially recovers the uncoded repetition scheme, which simply replicates each X i onto multiple workers. By replicating every X i betweenbN/Kc anddN/Ke times, it can tolerate at mostS stragglers andA adversaries, whenever S + 2A≤bN/Kc− 1, (C.2) which achieves the optimum resiliency and security when the number of workers is small and no data privacy is required (specifically, N <K degf− 1 and T = 0, see Section 6.4). When privacy is taken into account (i.e.,T > 0), an alternative approach in place of repetition is to instead store each input variable using Shamir’s secret sharing scheme [117] overbN/Kc todN/Ke machines. This approach achieves any (S,A,T ) tuple whenever N≥K(S + 2A + deg f·T + 1). However, it does not improve LCC. C.4 Proof of Lemma 6.1 We start by defining the following notations. For any multilinear function f defined on V with degree d, let X i,1 ,X i,2 ,...,X i,d denote its d input entries (i.e., X i = (X i,1 ,X i,2 ,...,X i,d ) and f is linear with respect to each entry). Let V 1 ,...,V d be the vector space that contains the values of the entries. For brevity, we denote degf by d in this appendix. We first provide the proof of inequality (6.3). Proof of inequality (6.3). Without loss of generality, we assume both the encoding and decoding functions are deterministic in this proof, as the randomness does not help with decodability. 2 Similar to [16], we define the minimum recovery threshold, denoted byR ∗ (N,K,f), as the minimum number of workers that the master has to wait to guarantee decodability, among all linear encoding schemes. Then we essentially need to prove thatR ∗ (N,K,f)≥R ∗ LCC (N,K,f), i.e.,R ∗ (N,K,f)≥ (K−1)d+1 when N≥Kd− 1, and R ∗ (N,K,f)≥N−bN/Kc + 1 when N <Kd− 1. 2 Note that this argument requires the assumption that the decoder does not have access to the random keys, as assumed in Section 6.1. 173 Obviously R ∗ (N,K,f) is a non-decreasing function with respect to N. Hence, it suffices to prove thatR ∗ (N,K,f)≥N−bN/Kc + 1 whenN≤Kd− 1. We prove this converse bound by induction. (a) Ifd = 1, thenf is a linear function, and we aim to prove R ∗ (N,K,f)≥N + 1 forN≤K− 1. This essentially means that no valid computing schemes can be found when N <K. Assuming the opposite, suppose we can find a valid computation design using at most K− 1 workers, then there is a decoding function that computes all f(X i )’s given the results from these workers. Because the encoding functions are linear, we can thus find a non-zero vector (a 1 ,...,a K )∈F K such that when X i =a i V for any V ∈V, the coded variable ˜ X i stored by any worker equals the padded random key, which is a constant. This leads to a fixed output from the decoder. On the other hand, because f is assumed to be non-zero, the computing results{f(X i )} i∈[K] is variable for different values of V , which leads to a contradiction. Hence, we have prove the converse bound for d = 1. (b) Suppose we have a matching converse for any multilinear function with d =d 0 . We now prove the lower bound for any multilinear function f of degree d 0 + 1. Similar to part (a), it is easy to prove that R ∗ (N,K,f)≥N + 1 for N≤K− 1. Hence, we focus on N≥K. The proof idea is to construct a multilinear function f 0 with degree d 0 based on function f, and to lower bound the minimum recovery threshold of f using that off 0 . More specifically, this is done by showing that given any computation design for function f, a computation design can also be developed for the corresponding f 0 , which achieves a recovery threshold that is related to that of the scheme for f. In particular, for any non-zero function f(X i,1 ,X i,2 ,...,X i,d 0 +1 ), we let f 0 be a function which takes inputs X i,1 ,X i,2 ,...,X i,d 0 and returns a linear map, such that given any X i,1 ,X i,2 ,...,X i,d 0 +1 , we have f 0 (X i,1 ,X i,2 ,...,X i,d 0 )(X i,d 0 +1 ) = f(X i,1 ,X i,2 ,...,X i,d 0 +1 ). One can verify that f 0 is a multilinear function with degree d 0 , Given parameters K and N, we now develop a computation strategy for f 0 for a dataset of K inputs and a cluster of N 0 ,N−K workers, which achieves a recovery threshold of R ∗ (N,K,f)− (K− 1). We construct this computation strategy based on an encoding strategy of f that achieves the recovery threshold R ∗ (N,K,f). For brevity, we refer to these two schemes as the f 0 -scheme and f-scheme respectively. Because the encoding functions are linear, we consider the encoding matrix, denoted byG∈F K×N , and defined as the coefficients of the encoding functions ˜ X i = P K j=1 X j G ji + ˜ z i , where ˜ z i denotes the value of the random key padded to variable ˜ X i . Following the same arguments we used in the d = 1 case, the left null space of G must be{0}. Consequently, the rank of G equals K, and we can find a subsetK of K workers such that the corresponding columns of G form a basis ofF K . Hence, 174 we can construct the f 0 -scheme by letting each of the N 0 ,N−K workers store the coded version of (X i,1 ,X i,2 ,...,X i,d 0 ) that is stored by a unique respective worker in [N]\K in f-scheme. 3 Now it suffices to prove that the above construction achieves a recovery threshold ofR ∗ (N,K,f)− (K−1). Equivalently, we need to prove that given any subsetS of [N]\K of sizeR ∗ (N,K,f)−(K−1), the values of f(X i,1 ,X i,2 ,...,X i,d 0 ,x) for any i∈ [K] and x∈V are decodable from the computing results of workers inS. We exploit the decodability of the computation design for function f. For any j∈K, the set S∪K\{j} has sizeR ∗ (N,K,f). Consequently, for any vector (x 1,d 0 +1 ,...,x K,d 0 +1 )∈V K d 0 +1 , we have that{f(X i,1 ,X i,2 ,...,X i,d 0 ,x i,d 0 +1 )} i∈[K] is decodable given the results from workers inS∪K\{j} computed in f-scheme, if each x i,d 0 +1 is used as the (d 0 + 1)th entree for each input. Because columns of G with indices inK form a basis of F K , we can find values for each input X i,d 0 +1 such that workers inK would store 0 for theX i,d 0 +1 entry in thef-scheme. We denote these values by ¯ x 1,d 0 +1 ,..., ¯ x K,d 0 +1 . Note that if these values are taken as inputs, workers inK would return constant 0 due to the multilinearity of f. Hence, decoding f(X i,1 ,X i,2 ,...,X i,d 0 , ¯ x i,d 0 +1 ) only requires results from workers not inK, i.e., it can be decoded given computing results from workers inS using the f-scheme. Note that these results can be directly computed from corresponding results in thef 0 -scheme. We have proved the decodability off(X i,1 ,X i,2 ,...,X i,d 0 ,x) forx = ¯ x i,d 0 +1 . Now it remains to prove the decodability of f(X i,1 ,X i,2 ,...,X i,d 0 ,x) for each i for general x∈V. For any j∈K, let a (j) ∈ F K be a non-zero vector that is orthogonal to all columns of G with indices inK\{j}. If a (j) i x + ¯ x i,d 0 +1 is used for each input X i,d 0 +1 in the f-scheme, then workers inK\{j} would store 0 for the X i,d 0 +1 entry, and return constant 0 due to the multilinearity of f. Recall that f(X i,1 ,X i,2 ,...,X i,d 0 ,a (j) i x + ¯ x i,d 0 +1 ) is assumed to be decodable in the f-scheme given results from workers inS∪K\{j} . Following the same arguments above, one can prove that f(X i,1 ,X i,2 ,...,X i,d 0 ,a (j) i x + ¯ x i,d 0 +1 ) is also decodable using the f 0 -scheme. Hence, the same applies for a (j) i f(X i,1 ,X i,2 ,...,X i,d 0 ,x) due to multilinearity of f. Because columns ofG with indices inK form a basis ofF K , the vectorsa (j) forj∈K also from a basis. Consequently, for anyi there is a non-zeroa (j) i , and thusf(X i,1 ,X i,2 ,...,X i,d 0 ,x) is decodable. This completes the proof of decodability. To summarize, we have essentially proved that R ∗ (N,K,f)− (K− 1)≥R ∗ (N−K,K,f 0 ). We can verify that the converse bound R ∗ (N,K,f)≥N−bN/Kc + 1 under the condition N≤Kd− 1 can be derived given the above result and the induction assumption, for any function f with degree d 0 + 1. 3 For breivity, in this proof we instead index these N−K workers also using the set [N]\K, following the natural bijection. 175 (c) Thus, a matching converse holds for any d∈N + , which proves inequality (6.3). Now we proceed to prove the rest of Lemma 6.1, explicitly, we aim to prove that the recovery threshold of any T -private encoding scheme is at least R LCC (N,K,f) +T· degf. Inequality (6.3) essentially covers the case for T = 0. Hence, we focus on T > 0. To simplify the proof, we prove a stronger version of this statement: when T > 0, any valid T -private encoding scheme uses at least N≥R LCC (N,K,f)+T·degf workers. Equivalently, we aim to show thatN≥ (K +T−1) degf +1 for any such scheme. We prove this fact using an inductive approach. To enable an inductive structure, we prove a even stronger converse by considering a more general class of computing tasks and a larger class of encoding schemes, formally stated in the following lemma. Lemma C.1. Consider a dataset with inputs X , (X 1 ,...,X K )∈ (F d ) K , and an input vector Γ, (Γ 1 ,..., Γ K ) which belongs to a given subspace of F K with dimension r> 0; a set of N workers where each can take a coded variable inF d+1 and return the product of its elements; and a computing task where the master aim to recover Y i , X i,1 ·...·X i,d · Γ i . If the inputs entries are encoded separately such that each of the first d entries assigned to each worker are some T X > 0-privately linearly coded version of the corresponding entries of X i ’s, and the (d + 1)th entry assigned to each worker is a T -privately 4 linearly coded version of Γ, moreover, if each Γ i (as a variable) is non-zero, then any valid computing scheme requires N≥ (T X +K− 1)d +T +r. Proof. Lemma C.1 is proved by induction with respect to the tuple (d,T,r). Specifically, we prove that (a) Lemma C.1 holds when (d,T,r) = (0, 0, 1); (b) If Lemma C.1 holds for any (d,T,r) = (d 0 , 0,r 0 ), then it holds when (d,T,r) = (d 0 , 0,r 0 + 1); (c) If Lemma C.1 holds for any (d,T,r) = (d 0 , 0,r 0 ), then it holds when (d,T,r) = (d 0 ,T,r 0 ) for anyT ; (d) If Lemma C.1 holds for anyd =d 0 and arbitrary values of T and r, then it holds if (d,T,r) = (d 0 + 1, 0, 1). Assuming the correctness of these statements, Lemma C.1 directly follows by induction’s principle. Now we provide the proof of these statements as follows. (a). When (d,T,r) = (0, 0, 1), we need to show that at least 1 worker is needed. This directly follows from the decodability requirement, because the master aims to recover a variable, and at least one variable is needed to provide the information. (b). Assuming that for any (d,T,r) = (d 0 , 0,r 0 ) and any K and T X , any valid computing scheme requires N≥ (T X +K− 1)d 0 +r workers, we need to prove that for (d,T,r) = (d 0 , 0,r 0 + 1), at least (T X +K− 1)d 0 +r 0 + 1 workers are needed. We prove this fact by fixing an arbitrary valid computing scheme for (d,T,r) = (d 0 , 0,r 0 + 1). For brevity, let ˜ Γ i denotes the coded version of Γ 4 For this lemma, we assume that no padded random variable is used for a 0-private encoding scheme. 176 stored at worker i. We consider the following two possible scenarios: (i) there is a worker i such that ˜ Γ i is not identical (up to a constant factor) to any variable Γ j , or (ii) for any worker i, ˜ Γ i is identical (up to a constant factor) to some Γ j . For case (i), similar to the ideas we used to prove inequality (6.3), it suffices to show that if the given computing scheme uses N workers, we can construct another computation scheme achieving the same T X , for a different computing task with parameters d = d 0 and r = r 0 , using at most N− 1 workers. Recall that we assumed that there is a worker i, such that ˜ Γ i is not identical (up to a constant factor) to any Γ j . We can always restrict the value of Γ to a subspace with dimension r 0 , such that ˜ Γ i becomes a constant 0. After this operation, from the computation results of the rest N− 1 workers, the master can recover a computing function with r =r 0 and non-zero Γ j ’s, which provides the needed computing scheme. For case (ii), because each Γ j is assumed to be non-zero, we can partition the set of indices j into distinct subsets, such that any j and j 0 are in the same subset iff Γ j is a constant multiple of Γ j 0. We denote these subsets byJ 1 ,...,J m . Moreover, for any k∈ [m], letI k denote the subset of indices i such that ˜ Γ i is identical (up to a constant factor) to Γ j for j inJ k . Now for anyk∈ [m], we can restrict the value of Γ to a subspace with dimension r 0 , such that Γ j is zero for any j∈J k . After applying this operation, from the computation results of workers in [N]\I k , the master can recover a computing function withr =r 0 , whereK 0 =K−|J k | sub-functions has non-zero Γ j ’s. By applying the induction assumption on this provided computing scheme, we have N−|I k |≥ (T X +K−|J k |− 1)d 0 +r 0 . By taking the summation of the this inequality over k∈ [m], we have Nm− m X k=1 |I k |≥ (T X m +Km−K−m)d 0 +r 0 m. (C.3) Recall that for any workeri, ˜ Γ i is identical (up to a constant factor) to some Γ j , we have∪ k∈[m] I k = [N]. Thus, P k |I k |≥N. Consequently, inequality (C.3) implies that Nm−N≥ (T X m +Km−K−m)d 0 +r 0 m. (C.4) Note that r 0 + 1> 1, which implies that at least two Γ j ’s are not identical up to a constant factor. Hence, m− 1> 0, and (C.4) is equivalently N≥ (T X m +Km−K−m)d 0 +r 0 m m− 1 (C.5) 177 = (T X +K− 1)d 0 +r 0 + ((T X − 1)d 0 +r 0 ) 1 m− 1 . (C.6) SinceT X andr 0 are both positive, we have (T X −1)d 0 +r 0 > 0. Consequently, ((T X − 1)d 0 +r 0 ) 1 m−1 > 0, and we have N≥ (T X +K− 1)d 0 +r 0 + 1, (C.7) which proves the induction statement. (c). Assuming that for any (d,T,r) = (d 0 , 0,r 0 ), any valid computing scheme requires N≥ (T X + K−1)d 0 +r 0 workers, we need to prove that for (d,T,r) = (d 0 ,T 0 ,r 0 ),N≥ (T X +K−1)d 0 +T 0 +r 0 . Equivalently, we aim to show that for anyT 0 > 0, in order to provideT 0 -privacy to thed 0 +1th entry, T 0 extra worker is needed. Similar to the earlier steps, we consider an arbitrary valid computing scheme for (d,T,r) = (d 0 ,T 0 ,r 0 ) that uses N workers. We aim to construct a new scheme for (d,T,r) = (d 0 , 0,r 0 ), for the same computation task and the same T X , which uses at most N−T 0 workers. Recall that if an encoding scheme is T 0 private, then given any subset of at most T 0 workers, denoted byT , we have I(Γ; ˜ Γ T ) = 0. Consequently, conditioned on ˜ Γ T = 0, the entropy of the variable Γ remains unchanged. This indicates that Γ can be any possible value when ˜ Γ T = 0. Hence, we can let the values of the padded random variables be some linear combinations of the elements of Γ, such that worker inT returns constant 0. Now we construct an encoding scheme as follows. Firstly it is easy to show that when the master aims to recover a non-constant function, at least T 0 + 1 workers are needed to provide non-zero information regarding the inputs. Hence, we can arbitrarily select a subset of T 0 workers, denoted byT . As we have proved, we can find fix the values of the padded random variables such that ˜ Γ T = 0. Due to multilinearity of the computing task, these workers inT also returns constant 0. Conditioned on these values, the decoder essentially computes the final output only based on the rest N−T 0 workers, which provides the needed computing scheme. Moreover, as we have proved that the values of the padded random variables can be chosen to be some linear combinations of the elements of Γ, our obtained computing scheme encodes Γ linearly. This completes the proof for the induction statement. (d). Assuming that for any d =d 0 and arbitrary values of T and r, any valid computing scheme requires N≥ (T X +K− 1)d 0 +T +r workers, we need to prove that for (d,T,r) = (d 0 + 1, 0, 1), N≥ (T X +K− 1)(d 0 + 1) + 1. Observing that for any computing task with r = 1, by fixing an non-zero Γ, it essentially computes K functions where each multiplies d 0 variables. Moreover, for each function, by viewing the first (d 0 − 1) entries as a vector X 0 i and by viewing the last entry 178 as a scalar Γ 0 i , it essentially recovers the case where the parameter d is reduced by 1, K remain unchanged, and r equals K. By adapting any computing scheme in the same way, we have T X remain unchanged, and T becomes T X . Then by induction assumption, any computing scheme for (d,T,r) = (d 0 + 1, 0, 1) requires at least (T X +K− 1)d 0 +T X +K = (T X +K− 1)(d 0 + 1) + 1 workers. Remark C.1. Using exactly the same arguments, Lemma C.1 can be extended to the case where the entries of X are encoded under different privacy requirements. Specifically, if the ith entry is T i -privately encoded, then at least P d i=1 T i + (K− 1)d +T +r worker is needed. Lemma C.1 and this extended version are both tight, in the sense for any parameter values of d, K and r, there are computing tasks where a computing scheme that uses the matching number of workers can be found, using constructions similar to the Lagrange coded computing. Now using Lemma C.1, we complete the proof of Lemma 6.1 for T > 0. Similar to the proof ideas for inequality (6.3) part (a), we consider any multilinear function f with degree d, and we find constant vectors V 1 ,...,V d , such that f(V 1 ,...,V d ) is non-zero. Then by restricting the input variables to be constant multiples of V 1 ,...,V d , this computing task reduces to multiplying d scalars, given K inputs. As stated in Lemma C.1 and discussed in part (d) of its induction proof, such computation requires (T +K− 1)d + 1 workers. This completes the proof of Lemma 6.1. C.5 Optimality on the Resiliency-Security-Privacy Tradeoff for Multilinear Functions In this appendix, we prove the first part of Theorem 6.2 using Lemma 6.1. Specifically, we aim to prove that LCC achieves the optimal trade-off between resiliency, security, and privacy for any multilinear function f. By comparing Lemma 6.1 and the achievability result presented in Theorem 6.1 and Appendix C.3, we essentially need to show that for any linear encoding scheme that can tolerates A adversaries and S stragglers, it can also tolerate S + 2A stragglers. This converse can be proved by connecting the straggler mitigation problem and the adversary tolerance problem using the extended concept of Hamming distance for coded computing, which is defined in [16]. Specifically, given any (possibly random) encoding scheme, its hamming distance is defined as the minimum integer, denoted by d, such that for any two instances of input X whose outputs Y are different, and for any two possible realizations of the N encoding functions, the computing results given the encoded version of these two inputs, using the two lists of encoding functions respectively, differs for at least d workers. 179 It was shown in [16] that this hamming distance behaves similar to its classical counter part: an encoding scheme isS-resilient andA-secure wheneverS+2A≤d−1. Hence, for any encoding scheme that is A-secure and S-reselient, it has a hamming distance of at least S + 2A + 1. Consequently it can tolerate S + 2A stragglers. Combining the above and Lemma 6.1, we have completed the proof. C.6 Optimality on the Resiliency-Privacy Tradeoff for General Multivariate Polynomials In this appendix, we prove the second part of Theorem 6.2 using Lemma 6.1. Specifically, we aim to prove that LCC achieves the optimal trade-off between resiliency and privacy, for general multivariate polynomial f. The proof is carried out by showing that for any function f that allows S-resilient T-private designs, there exists a multilinear function with the same degree for which a computation scheme can be found that achieves the same requirement. Specifically, given any function f with degree d, we aim to provide an explicit construction of an multilinear function, denoted by f 0 , which achieves the same requirements. The construction satisfies certain properties to ensure this fact. Both the construction and the properties are formally stated in the following lemma (which is proved in Appendix C.7): Lemma C.2. Given any function f of degree d, let f 0 be a map from V d → U such that f 0 (Z 1 ,...,Z d ) = P S⊆[d] (−1) |S| f( P j∈S Z j ) for any{Z j } j∈[d] ∈ V d . Then f 0 is multilinear with respect to the d inputs. Moreover, if the characteristic of the base field F is 0 or greater than d, then f 0 is non-zero. Assuming the correctness of Lemma C.2, it suffices to prove that f 0 enables computation designs that tolerates at least the same number of stragglers, and provides at least the same level of data privacy, compared to that of f. We prove this fact by constructing such computing schemes for f 0 given any design for f. Note that f 0 is defined as a linear combination of functions f( P j∈S Z j ), each of which is a composition of a linear map and f. Given the linearity of the encoding design, any computation scheme of f can be directly applied to any of these functions, achieving the same resiliency and privacy requirements. Since the decoding functions are linear, the same scheme also applies to linear combinations of them, which includes f 0 . Hence, the resiliency-privacy tradeoff achievable for f can also be achieved by f 0 . This concludes the proof. 180 C.7 Proof of Lemma C.2 We first prove that f 0 is multilinear with respect to the d inputs. Recall that by definition, f is a linear combination of monomials, and f 0 is constructed based on f through a linear operation. By exploiting the commutativity of these these two linear relations, we only need to show individually that each monomial in f is transformed into a multilinear function. More specifically, let f be the sum of monomials h k ,U k · d k Q `=1 h k,` (·) where k belongs to a finite set, U k ∈ U, d k ∈{0, 1,...,d}, and each h k,` is a linear map from V to F. Let h 0 k denotes the contribution of h k in f 0 , then for any Z = (Z 1 ,...,Z d )∈V d we have h 0 k (Z) = X S⊆[d] (−1) |S| h k X j∈S Z j = X S⊆[d] (−1) |S| U k · d k Y `=1 h k,` X j∈S Z j . (C.8) By utilizing the linearity of each h k,` , we can write h 0 k as h 0 k (Z) =U k · X S⊆[d] (−1) |S| d k Y `=1 X j∈S h k,` (Z j ) =U k · X S⊆[d] (−1) |S| d k Y `=1 d X j=1 1(j∈S)·h k,` (Z j ) (C.9) Then by viewing each subsetS of [d] as a map from [d] to{0, 1}, we have 5 h 0 k (Z) =U k X s∈{0,1} d d Y m=1 (−1) sm ! · d k Y `=1 d X j=1 s j ·h k,` (Z j ) =U k X j∈[d] d k X s∈{0,1} d d Y m=1 (−1) sm ! · d k Y `=1 (s j ` ·h k,` (Z j ` )). (C.10) 5 Here we define 0 0 = 1. 181 Note that the product d k Q `=1 s j ` can be alternatively written as d Q m=1 s #(m in j) m , where #(m inj) denotes the number of elements inj that equals m. Hence h 0 k (Z) =U k · X j∈[d] d k X s∈{0,1} d d Y m=1 (−1) sm s #(m in j) m ! · d k Y `=1 h k,` (Z j ` ) =U k · X j∈[d] d k d Y m=1 X s∈{0,1} (−1) s s #(m in j) · d k Y `=1 h k,` (Z j ` ). (C.11) The sum P s∈{0,1} (−1) s s #(m in j) is non-zero only if m appears inj. Consequently, among all terms that appear in (C.11), only the ones with degree d k =d and distinct elements inj have non-zero contribution. More specifically, 6 h 0 k (Z) = (−1) d · 1(d k =d)·U k · X g∈S d d Y j=1 h k,g(j) (Z j ). (C.12) Recall that f 0 is a linear combination of h 0 k ’s. Consequently, it is a multilinear function. Now we prove that f 0 is non-zero. From equation (C.12), we can show that when all the elements Z j ’s are identical, f 0 (Z) equals the evaluation of the highest degree terms of f multiplied by a constant (−1) d d! with Z j as the input for any j. Given that the highest degree terms can not be zero, and (−1) d d! is non-zero as long as the characteristic of the fieldF is greater than d, we proved that f 0 is non-zero. 6 Here S d denotes the symmetric group of degree d. 182 Appendix D Supplement to Chapter 7 D.1 Proof of Lemma 7.1 To prove Lemma 7.1, it suffices to find a linear combination of{g( ˜ X (i,j) )} i∈[d−1] for eachj∈ [K] that computes Q j . As mentioned in Section 7.3, we construct this function by viewing{g(X (i,j) )} i∈[d−1] , g(X j ), g(P j−1 ), and g(P j ) as evaluations of a degree d polynomial at d + 2 distinct points. Specifically, we view the coded variables{X (i,j) } i∈[d−1] as evaluations of a scalar input linear function, each at point β i c−j+1 c , which evaluates X j and P j−1 at 0 and 1. One can verify that the same linear function givesP j at point c−j+1 c−j . Moreover, thesed + 2 evaluations are at distinct points due to the requirements we imposed on the values of c and β i ’s. After applying function g, the corresponding results become evaluations of a degree d polynomial at d + 2 points. Hence, any one of them can be interpolated using the other d + 1 values. By interpolating g(X j ), we have g(X j ) = X i∈[d−1] g( ˜ X (i,j) ) c−j+1 c−j (1−β i c−j+1 c )( c−j+1 c−j −β i c−j+1 c ) Y i 0 ∈[d−1]\{i} β i 0 β i 0−β i +g(P j−1 ) c−j+1 c−j c−j+1 c−j − 1 Y i∈[d−1] β i c−j+1 c β i c−j+1 c − 1 +g(P j ) 1 1− c−j+1 c−j Y i∈[d−1] β i c−j+1 c β i c−j+1 c − c−j+1 c−j . (D.1) 183 Recall the definition of Q j , equation (D.1) is equivalently Q j = X i∈[d−1] g( ˜ X (i,j) ) c−j+1 c−j (1−β i c−j+1 c )( c−j+1 c−j −β i c−j+1 c ) Y i 0 ∈[d−1]\{i} β i 0 β i 0−β i . (D.2) Thus, Q j can be computed using a linear combination of{g( ˜ X (i,j) )} i∈[d−1] . D.2 Proof of Lemma 7.2 When deg g = 1, we need to prove that at least two workers are needed to compute f with data-privacy. The converse for this case is trivial, because we require that every single worker stores a random value that is independent of the input dataset, at least 1 extra worker is needed to provide any information. Hence, we focus on the case where degg> 1. The rest of the proof relies on a converse bound we proved in [20] for a polynomial evaluation problem, stated as follows. Lemma D.1 (Yu et al.). Consider a dataset with inputs X 1 ,...,X K , and a multilinear function g with degree d≥ 1. Any data-private linear encoding scheme that computes g(X 1 ),...,g(X K ) requires at least Kd + 1 workers. Based on Lemma D.1, we essentially need to prove that computing any gradient-type function with a multilinear g with degree d requires at least one more worker compared to evaluating multilinear functions with degree d− 1. We prove this fact by constructing a degree d− 1 multilinear function g 0 and a corresponding scheme that evaluates g 0 usingN− 1 workers, given any multilinear function g with degree d and any scheme for gradient-type computation which uses N workers. Specifically, consider any fixed multilinear function g :V 1 ×...×V d →U with d inputs denoted by X i,1 ,...,X i,d , we let g 0 be a function that maps V 1 ×...×V d−1 to a space of linear functions, such that g 0 (X i,1 ,...,X i,d−1 )(X i,d ) =g(X i,1 ,...,X i,d ) for any values of X i,1 ,...,X i,d . One can verify that g 0 is a multilinear map with degree d− 1. Then consider any fixed valid data-private encoding scheme for a gradient-type computation task specified by g that uses N workers, we construct an encoding scheme for evaluating g 0 using N− 1 workers, such that each worker i encodes the input variables as if encoding only the first d− 1 entries using the encoding function of worker i in the gradient-type computation scheme. For brevity, we refer to these two schemes as g-scheme and g 0 -scheme. 184 Due to the linearity of encoding functions, one can show thatg 0 -scheme is data-private ifg-scheme is data-private. Hence, it remains to prove the validity of g 0 -scheme. We first prove that for any fixed X 1,d ,...,X K,d , g 0 -scheme can compute K P i=1 g 0 (X i,1 ,...,X i,d−1 )(X i,d ). This decodability can be achieved in 2 steps. First, due to the data-privacy requirement of g-scheme, given any fixed X 1,d ,...,X K,d , one can find values of the padded random variables for the dth entry, such that the coded dth entry stored by worker N equals 0. We denote the corresponding coded entry stored by any other worker i by ˜ X i,d . In this case, workerN would return 0 ifg-scheme is used, due to the multilinearity of g. Second, given the results from N− 1 workers, the master can evaluate them at points ˜ X 1,d ,..., ˜ X N−1,d , respectively, which essentially recovers the computing results of workers 1,...,N− 1 in g-scheme. Knowing that the absent worker N would return 0, the needed gradient function K P i=1 g 0 (X i,1 ,...,X i,d−1 )(X i,d ) can thus be directly computed using the decoding function of g-scheme. Now to recover eachg 0 (X i,1 ,...,X i,d−1 ) individually, it is equivalent to recoverg 0 (X i,1 ,...,X i,d−1 )(X i,d ) individually for any X i,d ∈V d . This can be done by simply zeroing all the rest X i 0 ,d ’s, which com- pletes the proof of validity. Now thatg 0 -scheme is valid and data-private, according to Lemma D.1, it uses at leastK(d−1)+1 workers. Hence g-scheme uses at least K(d− 1) + 2 workers. As the above proof holds true for any possible g-scheme, we have proved the converse bound in Lemma 7.2. 185 Appendix E Supplement to Chapter 9 E.1 Proof of Lemma 9.1 Proof. Forq∈{1,...,Q},n∈{1,...,N}, we letV q,n be i.i.d. random variables uniformly distributed onF 2 T . We let the intermediate values v q,n be the realizations of V q,n . For anyQ⊆{1,...,Q}, and N⊆{1,...,N}, we define V Q,N ,{V q,n :q∈Q,n∈N}. (E.1) Since each message X k is generated as a function of the intermediate values that are computed at node k, the following equation holds for all k∈{1,...,K}: 1 H(X k |V [Q],M k ) = 0. (E.2) The validity of the shuffling scheme requires that for all k∈{1,...,K}, the following equation holds : H(V W k ,[N] |X [K] ,V [Q],M k ) = 0. (E.3) GivenM andW, for any disjoint subsets of usersS andD, we denote the number of intermediate values that are exclusively available at servers inS, and exclusively needed by (but not available at) 1 [Q],{1,...,Q}. 186 servers inD, by a S,D , i.e.: a S,D =|((∩ k∈S M k )\(∪ i/ ∈S M i ))∩ ((∩ k∈D W k )\( ∪ i/ ∈D∪S W i ))|. (E.4) For any subsetC⊆{1,...,K}, letC { ={1,...,K}\C. We define Y C { , (V W C { ,[N] ,V [Q],M C { ). (E.5) We denote the number of intermediate values that are exclusively available at s servers inC, and exclusively needed by (but not available at) d users inC, by a s,d,C , i.e.: a s,d,C = X S⊆C |S|=s X D⊆C\S |D|=d a S,D . (E.6) Then we prove the following statement by induction: Claim E.1. For any subsetC⊆{1,...,K}, we have H(X C |Y C { )≥T |C| P s=1 |C|−s P d=1 a s,d,C · d s+d−1 . a. IfC =∅, obviously H(X ∅ |Y ∅ c)≥ 0 =T 0 X s=1 0−s X d=1 a s,d,∅ · d s +d− 1 . (E.7) b. Suppose the statement is true for all subsets of size C 0 . For anyC⊆{1,...,K} of size|C| =C 0 + 1, and all k∈C, the subset version of (E.2) and (E.3) can be derived: H(X k |V [Q],M k ,Y C { ) = 0, (E.8) H(V W k ,[N] |X C ,V [Q],M k ,Y C { ) = 0. (E.9) Consequently, the following equation holds: H(X C |V [Q],M k ,Y C { ) =H(X C |V W k ,[N] ,V [Q],M k ,Y C { ) +H(V W k ,[N] |V [Q],M k ,Y C { ). (E.10) 187 Next we lower bound H(X C |Y C { ) as follows: H(X C |Y C { ) = 1 |C| X k∈C H(X C ,X k |Y C { ) (E.11) = 1 |C| X k∈C (H(X C |X k ,Y C { ) +H(X k |Y C { )) (E.12) ≥ 1 |C| X k∈C H(X C |X k ,Y C { ) + 1 |C| H(W C |Y C { ). (E.13) From (E.13), we can derive a lower bound on H(W C |Y C { ) that equals the LHS of (E.10) scaled by 1 C 0 : H(X C |Y C { )≥ 1 |C|− 1 X k∈C H(X C |X k ,Y C { ) (E.14) ≥ 1 C 0 X k∈C H(X C |X k ,V [Q],M k ,Y C { ) (E.15) = 1 C 0 X k∈C H(X C |V [Q],M k ,Y C { ). (E.16) The first term on the RHS of (E.10) is lower bounded by the induction assumption: H(X C |V W k ,[N] ,V [Q],M k ,Y S c) =H(X C\{k} |Y (C\{k}) { ) (E.17) ≥T C 0 X s=1 C 0 −s X d=1 a s,d,C\{k} · d s +d− 1 (E.18) =T X S⊆C\{k} |S|≥1 X D⊆C\{k}\S |D|≥1 a S,D · |D| |S| +|D|− 1 (E.19) =T X S⊆C |S|≥1 X D⊆C\S |D|≥1 a S,D · |D|· 1(k / ∈S∪D) |S| +|D|− 1 . (E.20) The second term on the RHS of (E.10) can be calculated based on the independence of intermediate values: H(V W k ,[N] |V [Q],M k ,Y C { ) (E.21) =H(V W k ,[N] |V [Q],M k ,V W C { ,[N] ,V [Q],M C { ) (E.22) =T X S⊆C\{k} X D⊆C\S k∈D a S,D (E.23) 188 ≥T X S⊆C\{k} |S|≥1 X D⊆C\S k∈D a S,D (E.24) =T X S⊆C\{k} |S|≥1 X D⊆C\S |D|≥1 a S,D · 1(k∈D). (E.25) Thus by (E.10), (E.16), (E.20) and (E.25), we have H(W C |Y C { )≥ 1 C 0 X k∈C H(X C |V [Q],M k ,Y C { ) (E.26) = 1 C 0 X k∈C (H(X C |V W k ,[N] ,V [Q],M k ,Y C { ) +H(V W k ,[N] |V [Q],M k ,Y C { )) (E.27) ≥ T C 0 X k∈C X S⊆C |S|≥1 X D⊆C\S |D|≥1 a S,D ( |D|· 1(k / ∈S∪D) |S| +|D|− 1 + 1(k∈D)) (E.28) = T C 0 X S⊆C |S|≥1 X D⊆C\S |D|≥1 a S,D X k∈C ( |D|· 1(k / ∈S∪D) |S| +|D|− 1 + 1(k∈D)) (E.29) = T C 0 X S⊆C |S|≥1 X D⊆C\S |D|≥1 a S,D ( |D|· (|C|−|S|−|D|) |S| +|D|− 1 +|D|) (E.30) = T C 0 X S⊆C |S|≥1 X D⊆C\S |D|≥1 a S,D |D|· (|C|− 1) |S| +|D|− 1 (E.31) =T X S⊆C |S|≥1 X D⊆C\S |D|≥1 a S,D |D| |S| +|D|− 1 . (E.32) From the definition of a s,d,C and (E.32) , we have: H(W C |Y C { )≥T |C| X s=1 |C|−s X d=1 a s,d,C d s +d− 1 . (E.33) c. Thus for all subsetsC⊆{1,...,K}, the following equation holds: H(X C |Y C { )≥T |C| X s=1 |C|−s X d=1 a s,d,C d s +d− 1 , (E.34) which proves Claim 1. 189 Then by Claim 1, letC ={1,...,K} be the set of all K users, L≥ H(X C |Y C { ) QNT ≥ 1 QN K X s=1 K−s X d=1 a s,d d s +d− 1 . (E.35) This completes the proof of Lemma 1. 190 Appendix F Supplement to Chapter 10 F.1 Proof of Lemma 10.2 Proof. The Proof of Lemma 10.2 is organized as follows: We start by proving a lower bound of the communication rate required for a single demand, i.e., R ∗ (d,M). By averaging this lower bound over a single demand typeD s , we automatically obtain a lower bound for the rate R ∗ (s,M). Finally we bound the minimum possible R ∗ (s,M) over all prefetching schemes by solving for the minimum value of our derived lower bound. We first use a genie-aided approach to derive a lower bound of R ∗ (d,M) for any demandd and for any prefetchingM: Given a demandd, letU ={u 1 ,...,u Ne(d) } be an arbitrary subset with N e (d) users that request distinct files. We construct a virtual user whose cache is initially empty. Suppose for each `∈{1,...,N e (d)}, a genie fills the cache with the value of bits that are cached by u ` , but not from files requested by users in{u 1 ,...,u `−1 }. Then with all the cached information provided by the genie, the virtual user should be able to inductively decode all files requested by users inU upon receiving the message X. Consequently, a lower bound on the communication rate R ∗ (d,M) can be obtained by applying a cut-set bound on the virtual user. Specifically, we prove that the virtual user can decode allN e (d) requested files with high probability, by inductively decoding each filed u ` using the decoding function of useru ` , from` = 1 to` =N e (d): Recall that any communication rate is -achievable if the error probability of each decoding function is at most . Consequently, the probability that all N e (d) decoding functions can correctly decode the requested files is at least 1−N e (d). In this scenario, the virtual user can correctly decode all the files, given that at every single step of induction, all bits necessary for the decoding function have either been provided by the genie, or decoded in previous inductive steps. 191 Given this decodability, we can lower bound the needed communication load using Fano’s inequality: R ∗ (d,M)F≥ H {W du ` } Ne(d) `=1 Bits cached by the virtual user − (1 +N 2 e (d)F ). (F.1) Recall that all bits in the library are i.i.d. and uniformly random, the cut-set bound in the above inequality essentially equals the number of bits in the N e (d) requested files that are not cached by the virtual user. This set includes all bits in each file d u ` that are not cached by any users in {u 1 ,...,u ` }. Hence, the above lower bound is essentially R ∗ (d,M)F≥ Ne(d) X `=1 F X j=1 1 B du ` ,j is not cached by any user in{u 1 ,...,u ` } − (1 +N 2 e (d)F ). (F.2) where B i,j denotes the jth bit in file i. To simplify the discussion, we letK i,j denote the subset of users that caches B i,j . The above lower bound can be equivalently written as R ∗ (d,M)F≥ Ne(d) X `=1 F X j=1 1 K du ` ,j ∩{u 1 ,...,u ` } =∅ − (1 +N 2 e (d)F ). (F.3) Using the above inequality, we derive a lower bound of the average rates as follows: For any positive integer i, we denote the set of all permutations on{1,...,i} byP i . Then, for each p 1 ∈P K and p 2 ∈P N given a demand d, we define d(p 1 ,p 2 ) as a demand satisfying, for each user k, d k (p 1 ,p 2 ) =p 2 (d p −1 1 (k) ). We can then apply the above bound to any demandd(p 1 ,p 2 ): R ∗ (d(p 1 ,p 2 ),M)F≥ Ne(d) X `=1 F X j=1 1 K p 2 (du ` ),j ∩{p 1 (u 1 ),...,p 1 (u ` )} =∅ − (1 +N 2 e (d)F ). (F.4) It is easy to verify that by taking the average of (F.4) over all pairs of (p 1 ,p 2 ), only the rates for demands in typeD s(d) are counted, and each of them is counted the same number of times due to 192 symmetry. Consequently, this approach provides us with a lower bound on the average rate within typeD s(d) , which is stated as follows: R ∗ (s(d),M) = 1 K!N! X p 1 ∈P K X p 2 ∈P N R ∗ (d(p 1 ,p 2 ),M) (F.5) ≥ 1 K!N!F X p 1 ∈P K X p 2 ∈P N Ne(d) X `=1 F X j=1 1 K p 2 (du ` ),j ∩{p 1 (u 1 ),...,p 1 (u ` )} =∅ − 1 F +N 2 e (d) . (F.6) We aim to simplify the above lower bound, in order to find its minimum to prove Lemma 10.2. To simplify this result, we first exchange the order of the summations and evaluate 1 K! P p 1 ∈P K 1 K p 2 (du ` ),j ∩{p 1 (u 1 ),...,p 1 (u ` )} =∅ . This is essentially the probability of selecting ` distinct users{p 1 (u 1 ),...,p 1 (u ` )} uniformly at random, such that none of them belongs toK p 2 (du ` ) . Out of the K ` subsets, K−|K p 2 (du ` ),j | ` of them satisfy this condition, 1 which gives the following identity: 1 K! X p 1 ∈P K 1 K p 2 (du ` ),j ∩{p 1 (u 1 ),...,p 1 (u ` )} =∅ = K−|K p 2 (du ` ),j | ` K ` . (F.7) Hence, inequality (F.6) can be simplified based on (F.7) and the above discussion. R ∗ (s(d),M)≥ 1 N!F X p 2 ∈P N Ne(d) X `=1 F X j=1 1 K! X p 1 ∈P K 1 K p 2 (du ` ),j ∩{p 1 (u 1 ),...,p 1 (u ` )} =∅ − 1 F +N 2 e (d) (F.8) = 1 N!F X p 2 ∈P N Ne(d) X `=1 F X j=1 K−|K p 2 (du ` ),j | ` K ` − 1 F +N 2 e (d) . (F.9) We further simplify this result by computing the summation over p 2 and j, and evaluating 1 N!F P p 2 ∈P N F P j=1 K−|K p 2 (du ` ),j | ` . This is essentially the expectation of K−|K i,j | ` over a uniformly randomly selected bit B i,j . Let a n denote the number of bits in the database that are cached by 1 Recall that we define n k = 0 when k>n. 193 exactly n users, then|K i,j | =n holds for an NF fraction of the bits. Consequently, we have 1 N!F X p 2 ∈P N F X j=1 K−|K p 2 (du ` ),j | ` ! = K X n=0 a n NF · K−n ` ! . (F.10) We simplify (F.9) using the above identity: R ∗ (s(d),M)≥ Ne(d) X `=1 1 N!F X p 2 ∈P N F X j=1 K−|K p 2 (du ` ),j | ` K ` − 1 F +N 2 e (d) (F.11) = Ne(d) X `=1 K X n=0 a n NF · K−n ` K ` − 1 F +N 2 e (d) (F.12) It can be easily shown that K−n ` K ` = K−` n K n (F.13) and Ne(d) X `=1 K−` n ! = K n + 1 ! − K−N e (d) n + 1 ! . (F.14) Thus, we can rerwite (F.12) as R ∗ (s(d),M)≥ K X n=0 a n NF · K n+1 − K−Ne(d) n+1 K n − 1 F +N 2 e (d) . (F.15) Hence for anys∈S, by arbitrarily selecting a demandd∈D s and applying the above inequality, the following bound holds for any prefetchingM: R ∗ (s,M)≥ K X n=0 a n NF · K n+1 − K−Ne(s) n+1 K n − 1 F +N 2 e (s) . (F.16) 194 After proving a lower bound of R ∗ (s,M), we proceed to bound its minimum possible value over all prefetching schemes. Let c n denote the following sequence c n = K n+1 − K−Ne(s) n+1 K n . (F.17) We have R ∗ (s,M)≥ K X n=0 a n NF ·c n − 1 F +N 2 e (s) . (F.18) We denote the lower convex envelope of c n , i.e., the lower convex envelope of points{(t,c t )| t∈ {0, 1,...,K}}, by Conv(c t ). Note that c n is a decreasing sequence, so its lower convex envelope is a decreasing and convex function. Because the following holds for every prefetching: K X n=0 a n =NF, (F.19) K X n=0 na n ≤NFt, (F.20) we can lower bound (F.18) using Jensen’s inequality and the monotonicity of Conv(c t ): R ∗ (s,M)≥ Conv(c t )− 1 F +N 2 e (s) . (F.21) Consequently, min M R ∗ (s,M)≥ min M Conv(c t )− 1 F +N 2 e (s) (F.22) =Conv(c t )− 1 F +N 2 e (s) (F.23) =Conv K t+1 − K−Ne(s) t+1 K t − 1 F +N 2 e (s) . (F.24) 195 F.2 Minimum Peak Rate for Centralized Caching Consider a caching problem with K users, a database ofN files, and a local cache size of M files for each user. We define the rate-memory tradeoff for the peak rate as follows: Similar to the average rate case, for each prefetchingM, let R ∗ ,peak (M) denote the peak rate, defined as R ∗ ,peak (M) = max d R ∗ (d,M). We aim to find the minimum peak rate R ∗ peak , where R ∗ peak = sup >0 lim sup F→+∞ min M R ∗ ,peak (M), which is a function of N, K, and M. Now we prove Corollary 10.1, which completely characterizes the value of R ∗ peak . Proof. It is easy to show that the rate stated in Corollary 10.1 can be exactly achieved using the caching scheme introduced in Section 10.3. Hence, we focus on proving the optimality of the proposed coding scheme. Recall the definitions of statistics and types (see section 10.4). Given a prefetching M and statisticss, we define the peak rate within typeD s , denoted by R ∗ ,peak (s,M), as R ∗ ,peak (s,M) = max d∈Ds R ∗ (d,M). (F.25) Note that R ∗ peak = sup >0 lim sup F→+∞ min M max s R ∗ ,peak (s,M) (F.26) ≥ sup >0 lim sup F→+∞ max s min M R ∗ ,peak (s,M). (F.27) Hence, in order to lower bound R ∗ , it is sufficient to bound the minimum value of R ∗ ,peak (s,M) for each typeD s individually. Using Lemma 10.2, the following bound holds for each s∈S: min M R ∗ ,peak (s,M)≥ min M R ∗ (s,M) (F.28) ≥Conv K t+1 − K−Ne(s) t+1 K t − 1 F +N 2 e (s) . (F.29) 196 Consequently, R ∗ peak ≥ sup >0 lim sup F→+∞ max s Conv K t+1 − K−Ne(s) t+1 K t − 1 F +N 2 e (s) (F.30) =Conv K t+1 − K−min{N,K} t+1 K t . (F.31) Remark F.1 (Universal Optimality of Symmetric Batch Prefetching - Peak Rate). Inequality (F.29) characterizes the minimum peak rate given a typeD s , if the prefetchingM can be designed based ons. However, for (F.27) to be tight, the peak rate for each different type has to be minimized on the same prefetching. Surprisingly, such an optimal prefetching exists, an example being the symmetric batch prefetching, according to Section 10.3. This indicates that the symmetric batch prefetching is also universally optimal for all types in terms of peak rates. F.3 Proof of Theorem 10.2 To completely characterizeR, we propose decentralized caching schemes to achieve all points in R. We also prove a matching information-theoretic outer bound of the achievable regions, which implies that none of the points outsideR are achievable. F.3.1 The Optimal Decentralized Caching Scheme To prove the achievability ofR, we need to provide an optimal decentralized prefetching scheme P M;F , an optimal delivery scheme for every possible user demandd that achieves the corner point in R, and a valid decoding algorithm for the users. The main idea of our proposed achievability scheme is to first design a decentralized prefetching scheme, such that we can view the resulting content delivery problem as a list of sub-problems that can be individually solved using the techniques we already developed for the centralized setting. Then we optimally solve this delivery problem by greedily applying our proposed centralized delivery and decoding scheme. We consider the following optimal prefetching scheme: all users cache MF N bits in each file uniformly and independently. This prefetching scheme was originally proposed in [144]. For convenience, we refer to this prefetching scheme as uniformly random prefetching scheme. Given this prefetching scheme, each bit in the database is cached by a random subset of the K users. 197 During the delivery phase, we first greedily categorize all the bits based on the number of users that cache the bit, then within each category, we deliver the corresponding messages in an opportunistic way using the delivery scheme described in Section 10.3 for centralized caching. For any demandd where K users are making requests, and any realization of the prefetching on these K users, we divide the bits in the database into K + 1 sets: For each j∈{0, 1,...,K}, letB j denote the bits that are cached by exactly j users. To deliver the requested files to the K users, it is sufficient to deliver all the corresponding bits in eachB j individually. Within eachB j , first note that with high probability for large F , the number of bits that belong to each file is approximately K j ( M N ) j (1− M N ) K−j F +o(F), which is the same across all files. Furthermore, for any subsetK⊆{1,...,K} of size j, a total of ( M N ) j (1− M N ) K−j F +o(F) bits in file i are exclusively cached by users inK, which is 1/ K j fraction of the bits inB j that belong to file i. This is effectively the symmetric batch prefetching, and hence we can directly apply the same delivery and decoding scheme to deliver all the requested bits within this subset. Recall that in the centralized setting, when each file has a size F and each bit is cached by exactly t users, our proposed delivery scheme achieves a communication load of ( K t+1 )−( K−Ne(d) t+1 ) ( K t ) F. Then to deliver all requested bits withinB j , where the equivalent file size approximately equals K j ( M N ) j (1− M N ) K−j F , we need a communication rate of ( M N ) j (1− M N ) K−j K j+1 − K−Ne(d) j+1 . Consequently, by applying the delivery scheme for all j ∈ {0, 1,...,K}, we achieve a total communication rate of R K = K X j=0 M N j 1− M N K−j · K j + 1 ! − K−N e (d) j + 1 !! (F.32) = N−M M 1− 1− M N Ne(d) ! (F.33) for any demandd. Hence, for each K we achieve an average rate of E[ N−M M (1− 1− M N Ne(d) )], which dominates all points inR. This provides a tight inner bound for Theorem 10.2. F.3.2 Converse To prove an outer bound ofR, i.e., bounding all possible rate vectors{R K } K∈N that can be achieved by a prefetching scheme, it is sufficient to bound each entry of the vector individually, by providing a lower bound ofR ∗ K (P M;F ) that holds for all prefetching schemes. To obtain such a lower bound, for 198 each K∈N we divide the set of all possible demands into types, and derive the minimum average rate within each type separately. For any statisticss, we let R ∗ ,K (s,P M ) denote the average rate within typeD s . Rigorously, R ∗ ,K (s,P M ) = 1 |D s | X d∈Ds R ∗ ,K (d,P M ). (F.34) The minimum value of R ∗ ,K (s,P M ) is lower bounded by the following lemma: Lemma F.1. Consider a decentralized caching problem with N files and a local cache size of M files for each user. For any typeD s , where K users are making requests, the minimum value of R ∗ ,K (s,P M ) is lower bounded by min P M R ∗ ,K (s,P M )≥ M−N M 1− 1− M N Ne(s) ! − 1 F +N 2 e (s) . (F.35) Remark F.2. As proved in Appendix F.3.1, the rate R ∗ ,K (s,P M ) for any statisticss and anyK can be simultaneously minimized using the uniformly random prefetching scheme. This demonstrates that the uniformly random prefetching scheme is universally optimal for the decentralized caching problem in terms of average rates. Proof. To prove Lemma F.1, we first consider a class of generalized demands, where not all users in the caching systems are required to request a file. We define generalized demandd = (d 1 ,...,d K )∈ {0, 1,...,N} K , where a nonzerod k denotes the index of the file requested byk, whiled k = 0 indicates that user k is not making a request. We define statistics and their corresponding types in the same way, and let R ∗ ,K (s,M) denote the centralized average rate on a generalized typeD s given prefetchingM. For a centralized caching problem, we can easily generalize Lemma 10.2 to the following lemma for the generalized demands: Lemma F.2. Consider a caching problem with N files, K users, and a local cache size of M files for each user. For any generalized typeD s , the minimum value of R ∗ ,K (s,M) is lower bounded by min M R ∗ ,K (s,M)≥Conv K t+1 − K−Ne(s) t+1 K t − 1 F +N 2 e (s) , (F.36) 199 where Conv(f(t)) denotes the lower convex envelope of the following points: {(t,f(t)) | t ∈ {0, 1,...,K}}. The above lemma can be proved exactly the same way as we proved Lemma 10.2, and the universal optimality of symmetric batch prefetching still holds for the generalized demands. For a decentralized caching problem, we can also generalize the definition of R ∗ ,K (s,P M ) corre- spondingly. We can easily prove that, when a decentralized caching scheme is used, the expected value of R ∗ ,K (s,M) is no greater than R ∗ ,K (s,P M ). Consequently, R ∗ ,K (s,P M )≥E M [R ∗ ,K (s,M)] (F.37) ≥Conv K t+1 − K−Ne(s) t+1 K t − 1 F +N 2 e (s) , (F.38) for any generalized typeD s and for any P M . Now we prove that value R ∗ ,K (s,P M ) is independent of parameter K givens and P M : Consider a generalized statistics. Let K s = N P i=1 s i , which equals the number of active users for demands in D s . For any caching system with K >K s users, and for any subsetK of K s users, letD K denote the set of demands inD s where only users inK are making requests. Note thatD s equals the union of disjoint setsD K for all subsetsK of size K s . Thus we have, R ∗ ,K (s,P M ) = 1 |D s | X d∈Ds R ∗ ,K (d,P M ) (F.39) = 1 |D s | X K:|K|=Ks X d∈D K R ∗ ,K (d,P M ) (F.40) = 1 |D s | X K:|K|=Ks |D K |R ∗ ,Ks (s,P M ) (F.41) =R ∗ ,Ks (s,P M ). (F.42) Consequently, R ∗ ,Ks (s,P M ) = lim K→+∞ R ∗ ,K (s,P M ) (F.43) ≥ lim K→+∞ Conv K t+1 − K−Ne(s) t+1 K t − 1 F +N 2 e (s) (F.44) 200 = M−N M 1− 1− M N Ne(s) ! − 1 F +N 2 e (s) . (F.45) Because the above lower bound is independent of the prfetching distribution P M , the minimum value of R ∗ ,Ks (s,P M ) over all possible prefetchings is also bounded by the same formula. This completes the proof of Lemma F.1. From Lemma F.1, the following bound holds by definition R ∗ K (P M;F ) = sup >0 lim sup F 0 →+∞ E s [R ∗ ,K (s,P M;F (F =F 0 ))] (F.46) ≥E d " M−N M 1− 1− M N Ne(d) !# (F.47) for any K∈N and for any prefetching scheme P M;F . Consequently, any vector{R K } K∈N inR satisfies R K ≥ min P M;F R ∗ K (P M;F ) (F.48) ≥E d " M−N M 1− 1− M N Ne(d) !# , (F.49) for any K∈N. Hence, R⊆ ( {R K } K∈N R K ≥E d " N−M M 1− N−M N Ne(d) !#) . (F.50) F.4 Proof of Corollary 10.2 Proof. It is easy to show that all points inR peak can be achieved using the decentralized caching scheme introduced in Appendix F.3.1. Hence, we focus on proving the optimality of the proposed decentralized caching scheme. Similar to the average rate case, we prove an outer bound ofR peak by bounding R ∗ K,peak (P M;F ) for each K∈N individually. To do so, we divide the set of all possible demands into types, and derive the minimum average rate within each type separately. Recall the definitions of statistics and types (see section 10.4). Given a caching system with N files, K users, a prefetching distribution P M , and a statistics, we define the peak rate within type 201 D s , denoted by R ∗ ,K,peak (s,P M ), as R ∗ ,K,peak (s,P M ) = max d∈Ds R ∗ ,K (d,P M ). (F.51) Note that any point{R K } K∈N inR peak satisfies R K ≥ inf P M;F R ∗ K,peak (P M;F ) (F.52) = inf P M;F sup >0 lim sup F 0 →+∞ max s∈D [R ∗ ,K,peak (s,P M;F (F =F 0 ))] (F.53) for any K∈N. We have the following from min-max inequality R K ≥ sup >0 lim sup F→+∞ max s∈D [min P M R ∗ ,K,peak (s,P M )]. (F.54) Hence, in order to outer boundR peak , it is sufficient to bound the minimum value ofR ∗ ,K,peak (s,P M ) for each typeD s individually. Using Lemma F.1, the following bound holds for each s∈S: min P M R ∗ ,K,peak (s,P M )≥ min P M R ∗ ,K (s,P M ) (F.55) ≥ M−N M 1− 1− M N Ne(s) ! − 1 F +N 2 e (s) . (F.56) Hence for any{R K } K∈N , R K ≥ sup >0 lim sup F→+∞ max s " M−N M 1− 1− M N Ne(s) ! − 1 F +N 2 e (s) # (F.57) = M−N M 1− 1− M N min{N,K} ! . (F.58) Consequently, R peak ⊆ ( {R K } K∈N R K ≥ N−M M 1− N−M N min{N,K} !) . (F.59) 202 Remark F.3. According to the above discussion, the rate R ∗ ,K,peak (s,P M ) for any statisticss and any K can be simultaneously minimized using the uniformly random prefetching scheme. This indicates that the uniformly random prefetching scheme is universally optimal for all types in terms of peak rates. F.5 Upper Bounds on Decoding Complexity As a bench mark, we first consider the complexity of the standard decoding approach for leaders. For any centralized caching scheme with symmetric batch prefetching given parameter t∈{0, 1,...,K} and demandd, each leader needs to recoverO( K−t K F ) bits, and each bit can be decoded by removing t bits of locally avaliable information. Hence, the decoding complexity required by any leader user is O( (K−t)(t+1) K F ). Now consider a non-leader user k, using a straight forward approach, it suffice to first recover all missing messages needed by user k with equation (10.14), then decode the requested file based on that. In particular, for any needed message Y A , it suffice to compute Y A = ⊕ V∈V F Y 0 B\V , (F.60) whereB =A∪U, and Y 0 A 0 , 1[A 0 ∩U6=∅]·Y A 0 for each setA 0 . Without loss of generality, assume for the given demand only filesW 1 ,W 2 ,...,W Ne(d) are requested, users 1,...,N e (d) are the corresponding leaders, and user k requested file W 1 . Each Y A can be computed using the following intermediate steps, where the summation of Y 0 B\V ’s is first taken over setsV’s with common set of users that request the first few files. In other words, we aim to compute Y A,V i , ⊕ V∈V F ,V⊇V i Y 0 B\V , (F.61) for any i = 0, 1,...,N e (d) and any subsetV i ⊆B that contains exactly i users each requesting a different file in W 1 ,W 2 ,...,W i . Note that Y A =Y A,∅ =Y A,V 0 . Each Y A,V i can be computed recursively. Recall that for any j = 1,...,N e (d), user j is the leader that requested file W j andB j is the collection of users inB that request the same file. For any set V i with i =j− 1 we have Y A,V i = ⊕ b∈B j Y A,V i ∪{b} . (F.62) 203 On the other hand, for any setV i withi =N e (d), there is a unique set inV F that includesV i , which is itself. Hence, the base cases for i =N e (d) are directly provided by available messages as follows. Y A,V i =Y 0 B\V i . (F.63) Using dynamic programming, we jointly compute the above steps for all needed Y A ’s. Note that some of the intermediate variables Y A,V i are identical for different sets ofA, so they do not need to be recomputed. 2 Precisely, each Y A,V i is uniquely determined by only two parameters, the integer i and the set 3 A∪ [i]\V i . Thus, we can define a set of variables, denoted Y S,i , which is identical to any Y A,V i withS =A∪ [i]\V i . Note that eachY S,i is only needed ifS∩{1,k}6=∅,|S| =t + 1, andS∩{i + 1,...,N e (d)} =∅. For brevity, we denote the collection of such setsS for eachi = 0, 1,...,N e (d) byS F,i . Moreover, wheni is decremented to i− 1, only a subset of Y S,i ’s needs to be updated. We denoted the corresponding family of sets ofS byS 0 F,i , which is given by{S∈S F,i−1 |∃ b∈S, s.t. d b =i}. The intermediate variables Y S,i ’s and needed messages Y A ’s can be computed using the following algorithm. Algorithm F.1 Example Algorithm for Computing Needed Messages 1: initialize ˜ Y S ←Y 0 S for anyS∈S F,Ne(d) . ˜ Y S stores Y S,Ne(d) 2: for i =N e (d),..., 1 do 3: forS∈S 0 F,i do . Compute Y S,i−1 at ˜ Y S 4: for b∈S with d b =i do 5: ˜ Y S ← ˜ Y S ⊕ ˜ Y S∪{i}\{b} 6: return Y A ← ˜ Y A for anyS∈S F,0 . Y A =Y A,0 The decoding complexity for recovering Y A ’s can be computed as follows. The complexity for initialization is linear to the size ofS F,Ne(d) multiplied by the length of a single Y A . Because |S F,Ne(d) |≤ 2 K−1 t , the complexity for initialization is O( K−1 t · F ( K t ) ) =O( K−t K F ). In the following computation stage, the complexity is identical to the number of xor operations multiplied by the length of Y A . Note that each xor operation can be injectively mapped to a set of tuples (b,S) that characterizes the operation 4 , where the elements satisfies b∈S∈S F,Ne(d) . The size of this set is upper bounded by|S|·|S F,Ne(d) | =O( K−1 t (t + 1)). Hence, the complexity for this stage is O( (K−t)(t+1) K F ). The complexity of returning the needed messages is no greater than the complexity of initialization due toS F,0 ⊆S F,Ne(d) , which would not increase the overal complexity. Finally, 2 The required storage overhead can be made negligible for large F , by partitioning the subfiles and messages into smaller fractions and decoding them separately using the same steps. 3 [i],{1, 2,...,i}. 4 The injectivity is based on the fact that the iteration number i is determined by b. 204 the complexity of recovering the needed subfiles given all messages is no greater than that for leader users. To conclude, we have demonstrated a decoding algorithm with an overall complexity of O( (K−t)(t+1) K F), the same as the decoding complexity for leader users with standard decoding approaches. Remark F.4. The same decoding approach, complexity results, as well as examples to further reduce decoding computational costs, can be directly obtained for decentralized caching, by observing that the required decoding steps are essentially a memory-shared version of centralized caching with symmetric batch prefetching. 205 Appendix G Supplement to Chapter 11 G.1 Proof of Theorem 11.3 for peak rate In this section, we prove Theorem 11.3 assuming the correctness of Theorem 11.4. The proof of Theorem 11.4 can be found in Appendix G.2. For brevity, we only prove Theorem 11.3 for the peak rate (i.e., R ∗ =R u (N,K,r) for large N) within this section. The proof for the average rate (i.e., R ∗ ave =R u (N,K,r) for large N) can be found in Appendix G.7. As mentioned previously, the rate R u (N,K,r) can be exactly achieved using the caching scheme proposed in [11]. Hence, to prove Theorem 11.3, it is sufficient to show that R ∗ ≥R u (N,K,r) for large N (i.e., N→ +∞) when K≤ 5. This statement can be easily proved using the following lemma: Lemma G.1. For a caching problem with parameters K, N, and M, we have R ∗ ≥R u (N,K,r) for large N, if r≤ 1 or r≥d K−3 2 e. Assuming the correctness of Lemma G.1, and noting that the condition in Lemma G.1 (i.e., r≤ 1 or r≥d K−3 2 e) always holds true for K≤ 5, we have R ∗ ≥ R u (N,K,r) for large N and for all possible values of M, in any caching system with no more than 5 users. Hence, to prove Theorem 11.3, it suffices to prove Lemma G.1. We prove this lemma as follows, using Theorem 11.2 and Theorem 11.4. Proof of Lemma G.1. We start by focusing on two easier cases, r≤ 1 and r≥ K− 1. When r≤ 1, the inequality R ∗ ≥ R u (N,K,r) is already proved in Section 11.3 and given by (11.23), for N≥ K(K+1) 2 . When r≥ K− 1, we have R ∗ ≥ 1− M N = R u (N,K,r), which can be proved by choosing s = 1 and α = 1 for Theorem 11.2. Hence, we only need to focus on the case where 206 r∈ h max{d K−3 2 e, 1},K− 1 , and show that for large N, the maximum possible gap between R ∗ and R u (N,K,r) approaches 0. We prove this result using Theorem 11.4. Essentially, we need to find parametern∈{1,...,K− 1} for Theorem 11.4, such that the corresponding converse bound approaches R u (N,K,r) for large N. Let n =br + 1c, we have R u (N,K,r) = 2K−n + 1 n + 1 − K(K + 1) n(n + 1) · M N (G.1) by definition, for sufficiently large N (more specifically, N≥K−n + 1). Under the same condition for large N, we have n∈{max{1,K−N + 1},...,K− 1} given r∈ [1,K− 1). Hence, we can use n as the parameter of Theorem 11.4. Now we prove the tightness of this converse bound by considering the following two possible cases: If n> K−1 2 , we have K− 2n− 1< 0. Recall that α =b N−1 K−n c and β =N−α(K−n). We can prove that when N is sufficiently large (i.e. N≥ 2(K−n) 2 2n+1−K + 1), the condition β +α K−2n−1 2 ≤ 0 is always satisfied. Consequently, R ∗ ≥ 2K−n + 1 n + 1 − K(K + 1) n(n + 1) · M N =R u (N,K,r). (G.2) If n≤ K−1 2 , because we are considering the case where r≥d K−3 2 e, we have n = K−1 2 . Hence, we can verify that β +α K−2n−1 2 ≤ 0 does not hold for any N. Consequently, R ∗ ≥ 2K−n + 1 n + 1 − 2K(K−n) n(n + 1) · M N−β = 2K−n + 1 n + 1 − K(K + 1) n(n + 1) · M N−β . (G.3) As N approaches infinity, β is upper bounded by a constant. Hence, we have lim N→+∞ N N−β = 1. Therefore, from (G.1) and (G.3), we have lim N→+∞ (R ∗ −R u (N,K,r))≥ lim N→+∞ K(K + 1) n(n + 1) · M N − M N−β = lim N→+∞ r(K + 1) n(n + 1) · 1− N N−β =0. (G.4) 207 G.2 Proof of Theorem 11.4 Before proving the converse bounds stated in Theorem 11.4, we first present the following key lemma, which gives a lower bound on any -achievable rate given any prefetching scheme. Lemma G.2. Consider a coded caching problem with parameters N and K. Given a certain prefetching scheme, any -achievable rate R is lower bounded by 1 RF≥H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α αn(K−n)F−nH ∗ (Z 1 |W {1,...,β} ) − K−n−1 X i=0 H ∗ (Z 1 |W {1,...,β+iα} ) ! − 2K−n + 1 n + 1 (1 +F ) (G.5) for any integer n∈{max{1,K−N + 1},...,K− 1}, where α =b N−1 K−n c and β =N−α(K−n). We postpone the proof of the above lemma to Appendix G.6, and continue to prove Theorem 11.4 assuming its correctness. To simplify the discussion, we define R B (F,) = 1 F H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α αn(K−n)F −nH ∗ (Z 1 |W {1,...,β} ) − K−n−1 X i=0 H ∗ (Z 1 |W {1,...,β+iα} ) !! . (G.6) Using Lemma G.2, we have R≥R B (F,)− 2K−n + 1 n + 1 ( 1 F +) (G.7) ifR is-achievable. Recall thatR ∗ is defined to be the minimum-achievable rate over all prefetching scheme φ for large F for any > 0, we have the following lower bound on R ∗ : R ∗ ≥ sup >0 lim sup F→∞ min {R B (F,)− 2K−n + 1 n + 1 ( 1 F +)} = sup >0 lim sup F→∞ min R B (F,) ≥ inf F∈N + min R B (F,). (G.8) 1 Here we adopt the notation of H ∗ (WA,ZB) which is defined in the proof of Theorem 11.2. 208 Hence, to prove Theorem 11.4, we only need to prove that for any prefetching scheme, R B (F,) is lower bounded by the converse bounds given in Theorem 11.4 for any valid parameter n. Now consider any n∈{max{1,K−N + 1},...,K− 1}. For brevity, we define θ = Kβ + (K−n)(K−n− 1) 2 α . (G.9) Equivalently, we have θ =nβ + K−n−1 X i=0 (β +iα). (G.10) Hence, θH ∗ (W 1 |Z 1 )≥nH ∗ (W {1,...,β} |Z 1 ) + K−n−1 X i=0 H ∗ (W {1,...,β+iα} |Z 1 ) =θF +nH ∗ (Z 1 |W {1,...,β} ) + K−n−1 X i=0 H ∗ (Z 1 |W {1,...,β+iα} )−KH ∗ (Z 1 ). (G.11) From (G.6) and (G.11), we have R B (F,)F≥ 1− 2θ n(n + 1)α H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α (θF−KH ∗ (Z 1 ) +αn(K−n)F ). (G.12) Depending on the value of θ, we bound H ∗ (W 1 |Z 1 ) in 2 different ways: When 1≥ 2θ n(n+1)α , this is exactly the case where β +α K−2n−1 2 ≤ 0 holds. We use the following bound: H ∗ (W 1 |Z 1 )≥F− H ∗ (Z 1 ) N . (G.13) Consequently, R B (F,)F≥ 1− 2θ n(n + 1)α F− H ∗ (Z 1 ) N 209 + 2 n(n + 1)α (θF−KH ∗ (Z 1 ) +αn(K−n)F ). (G.14) Given θ defined in (G.9), and β =N−α(K−n) as defined in Lemma G.2, we have R B (F,)F = 2K−n + 1 n + 1 F− K(K + 1) n(n + 1) · H ∗ (Z 1 ) N ≥ 2K−n + 1 n + 1 F− K(K + 1) n(n + 1) · M N F. (G.15) Hence we have the follows from (G.8): R ∗ ≥ 2K−n + 1 n + 1 − K(K + 1) n(n + 1) · M N . (G.16) On the other hand, when 1< 2θ n(n+1)α , this is exactly the case where β +α K−2n−1 2 ≤ 0 does not hold. We use H ∗ (W 1 |Z 1 )≤F . Similarly, R B (F,)F≥ 1− 2θ n(n + 1)α F + 2 n(n + 1)α (θF−KH ∗ (Z 1 ) +αn(K−n)F ) = 2K−n + 1 n + 1 F− 2K(K−n) n(n + 1) · H ∗ (Z 1 ) N−β ≥ 2K−n + 1 n + 1 F− 2K(K−n) n(n + 1) · M N−β F. (G.17) Hence, R ∗ ≥ inf F∈N + min R B (F,φ) ≥ 2K−n + 1 n + 1 − 2K(K−n) n(n + 1) · M N−β . (G.18) To conclude, we have proved that the converse bound given in Theorem 11.4 holds for any valid parameter n. 210 G.3 Proof of Lemma 11.1 In this appendix, we aim to prove that for any (M,R)∈S Lower ∪{(0,J)}, R dec (M)≤ 2.00884R. Note that if (M,R) = (0,J), we have R dec (M) =J≤ 2.00884R. Hence, it suffices to consider the case where (M,R)∈S Lower . In this case, we can find s∈{1,...,J} and `∈{1,...,s} such that (M,R) = N−` + 1 s , s− 1 2 + `(`− 1) 2s . (G.19) Based on the parameter values, we prove R dec (M)≤ 2.00884R by considering the following 3 possible scenarios: a). If N≥ 9s, we first have the follows given (11.20): R dec (M)≤ N−M M . (G.20) Due to (G.19), the above inequality is equivalent to R dec (M)≤s− 1 + s(`− 1) N−` + 1 . (G.21) Recall that s≥` and N≥ 9s, we have R dec (M)≤s− 1 + s(`− 1) N−s ≤s− 1 + `− 1 8 . (G.22) Since s≥`, we have `−1 ` ≤ s−1 s . Consequently, R dec (M)≤s− 1 + √ `− 1 8 · s (s− 1) s ·` =s− 1 + 2· r s− 1 256 · s `(`− 1) s . (G.23) Applying the AM-GM inequality to the second term of the RHS, we have R dec (M)≤s− 1 + s− 1 256 + `(`− 1) s . (G.24) Because `≥ 1, we can thus upper bound R dec (M) as a function of R, which is given in (G.19): R dec (M)≤ (2 + 1 128 )( s− 1 2 + `(`− 1) 2s ) 211 ≤ 2.00884R. (G.25) b). If N < 9s and N≤ 81, we upper bound R dec (M) as follows: R dec (M)≤ N−M M (1− (1− M N ) N ). (G.26) Note that both the above bound and R are functions of N, s and `, which can only take values from{1,..., 81}. Through a brute-force search, we can show that R dec (M)≤ 2.000R≤ 2.00884R. c). If N < 9s and N > 81, recall that M = N−`+1 s from (G.19), we have M≤ N s < 9. (G.27) Similarly, R can be lower bounded as follows given (G.19): R = s− 1 2 + (N−sM)(N−sM + 1) 2s = (1 +M 2 )s 2 + N(N + 1) 2s − (N + 1 2 )M− 1 2 . (G.28) Applying the AM-GM inequality to the first two terms of the RHS, we have R≥ q (1 +M 2 )N(N + 1)− (N + 1 2 )M− 1 2 . (G.29) From (G.27), N > 81>M 2 , we have p N(N + 1)≥ p M 2 (M 2 + 1) +N−M 2 . Consequently, R≥ p 1 +M 2 ( q M 2 (M 2 + 1) +N−M 2 ) − (N + 1 2 )M− 1 2 =(N− 81)( p 1 +M 2 −M) + (81−M 2 )( p 1 +M 2 −M) + M− 1 2 . (G.30) On the other hand, we upper bound R dec (M) as follows: R dec (M)≤ N−M M (1− (1− M N ) N ) = N−M M (1−e ln(1− M N )N ). (G.31) 212 From (G.27), M N < 9 81 = 1 9 , it is easy to show that ln(1− M N )≥− M N − 9 16 M N 2 . Hence, R dec (M)≤ N−M M (1−e −M− 9 16 M 2 N ) ≤ N−M M 1−e −M (1− 9 16 M 2 N ) ! ≤ N−M M (1−e −M ) + N M e −M 9 16 M 2 N =(N− 81) 1−e −M M + 81−M M (1−e −M ) + 9 16 Me −M . (G.32) Numerically, we can verify that the following inequalities hold for M∈ [0, 9): 1−e −M M ≤ 2.00884( p 1 +M 2 −M), (G.33) 81−M M (1−e −M ) + 9 16 Me −M ≤ 2.00884 (81−M 2 )( p 1 +M 2 −M) + M− 1 2 . (G.34) Hence when N > 81, by computing (N− 81)× (G.33) + (G.34), we have R dec (M)≤ 2.00884R. To conclude, R dec (M)≤ 2.00884R holds for any (M,R)∈S Lower for all three cases. This completes the proof of Lemma 11.1. G.4 Proof of Lemma 11.2 IfR is-achievable, we can find messageX d such that for each userk,W d k can be decoded fromZ k and X d with probability of error of at most . Using Fano’s inequality, the following bound holds: H(W d k |Z k ,X d )≤ 1 +F ∀k∈{1,...,K}. (G.35) Equivalently, H(X d |Z k )≥H(W d k |Z k ) +H(X d |W d k ,Z k ) − (1 +F ) ∀k∈{1,...,K}. (G.36) Note that the LHS of the above inequality lower bounds the communication load. If we lower bound the term H(X d |W d k ,Z k ) on the RHS by 0, we obtain the single user cutset bound. However, we enhance this cutset bound by bounding H(X d |W d k ,Z k ) with non-negative functions. On a high 213 level, we view H(X d |W d k ,Z k ) as the communication load on an enhanced caching system, where W d k andZ k are known by all the users. Using similar approach, we can lower boundH(X d |W d k ,Z k ) by the sum of a single cutset bound on this enhanced system, and another entropy function that can be interpreted as the communication load on a further enhanced system. We can recursively apply this bounding technique until all user demands are publicly known. From (G.35), we have H(W d k |Z {1,...,k} ,X d ,W {d 1 ,...,d k−1 } )≤ 1 +F ∀k∈{1,...,K}. (G.37) Equivalently, H(X d |Z {1,...,k} ,W {d 1 ,...,d k−1 } )≥ H(W d k |Z {1,...,k} ,W {d 1 ,...,d k−1 } ) +H(X d |Z {1,...,k} ,W {d 1 ,...,d k } ) − (1 +F ) ∀k∈{1,...,K}. (G.38) Adding the above inequality for k∈{1,..., min{N,K}}, we have H(X d |Z {1} )≥ min{N,K} X k=1 H(W d k |Z {1,...,k} ,W {d 1 ,...,d k−1 } )− (1 +F ) +H(X d |Z {1,...,min{N,K}} ,W {d 1 ,...,d min{N,K} } ) ≥ min{N,K} X k=1 H(W d k |Z {1,...,k} ,W {d 1 ,...,d k−1 } ) − min{N,K}(1 +F ). (G.39) Thus, R is bounded by R≥ 1 F H(X d |Z {1} ) ≥ 1 F ( min{N,K} X k=1 H(W d k |Z {1,...,k} ,W {d 1 ,...,d k−1 } )) − min{N,K}( 1 F +). (G.40) One can show that this approach strictly improves the compound cutset bound, which was used in most of the prior works. 214 G.5 Proof of Lemma 11.3 In this appendix, we prove that for any prefetching scheme, the rateR A (F,) is lower bounded by the RHS of (11.33), for any parameters s and α. Now we consider any such s∈{1,..., min{N,K}} andα∈ [0, 1]. From the definition ofR A (F,) and the non-negativity of entropy functions, we have R ∗ A (F,)F≥ s−1 X k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} ) ! +αH ∗ (W s |Z {1,...,s} ,W {1,...,s−1} ). (G.41) Each term in the above lower bound can be bounded in the following 2 ways: 2 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} ) ≥ H ∗ (W {k,...,N} |Z {1,...,k} ,W {1,...,k−1} ) N−k + 1 ≥F− H ∗ (Z {1,...,k} |W {1,...,k−1} ) N−k + 1 (G.42) H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} ) =F−H ∗ (Z {1,...,k} |W {1,...,k−1} ) +H ∗ (Z {1,...,k} |W {1,...,k} ) ≥F−H ∗ (Z {1,...,k} |W {1,...,k−1} ) + k k + 1 H ∗ (Z {1,...,k+1} |W {1,...,k} ) (G.43) We aim to use linear combinations of the above two bounds in (G.41), such that the coefficient of each H ∗ (Z {1,...,k} |W {1,...,k−1} ) in the resulting lower bound is 0 for all but one k. To do so, we construct the following sequences: a x = 2αs +s(s− 1)− (x + 1)x 2x(N−x) , (G.44) b x = 2αs +s(s− 1)−x(x− 1) 2x(N−x + 1) . (G.45) We can verify that these sequences satisfy the following equations: 1−a x N−x + 1 +a x =b x , (G.46) x x + 1 a x =b x+1 . (G.47) 2 Rigorously, (G.43) requiresk<K. However, we will only apply this bound fork<s, which satisfies this condition. 215 Let `∈{1,...,s} be the minimum value such that (11.12) holds, we can prove that a x ∈ [0, 1] for x∈{`,...,s− 1}. Because ` is the minimum of such values, we can also prove that b l ≥ `−1 ` . Using the above properties of sequencesa andb, we lower bound R A (F,) as follows: For each x∈{`,...,s− 1}, by computing (1−a x )× (G.42) +a x × (G.43), we have H ∗ (W x |Z {1,...,x} ,W {1,...,x−1} ) ≥(1−a x ) F− H ∗ (Z {1,...,k} |W {1,...,k−1} ) N−k + 1 ! +a x (F−H ∗ (Z {1,...,k} |W {1,...,k−1} ) + k k + 1 H ∗ (Z {1,...,k+1} |W {1,...,k} )) =F− ( 1−a x N−x + 1 +a x )H ∗ (Z {1,...,x} |W {1,...,x−1} ) +a x x x + 1 H ∗ (Z {1,...,x+1} |W {1,...,x} ) =F−b x H ∗ (Z {1,...,x} |W {1,...,x−1} ) +b x+1 H ∗ (Z {1,...,x+1} |W {1,...,x} ). (G.48) Moreover, we have the follows from (G.42): αH ∗ (W s |Z {1,...,s} ,W {1,...,s−1} ) ≥α F− H ∗ (Z {1,...,s} |W {1,...,s−1} ) N−s + 1 ! =αF−b s H ∗ (Z {1,...,s} |W {1,...,s−1} ). (G.49) Consequently, s−1 X k=` H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} ) +αH ∗ (W s |Z {1,...,s} ,W {1,...,s−1} )≥ (s−` +α)F−b ` H ∗ (Z {1,...,`} |W {1,...,`−1} ). (G.50) On the other hand, `−1 X k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} ) 216 ≥ `−1 X k=1 (F−H ∗ (Z {1,...,k} |W {1,...,k−1} ) + k k + 1 H ∗ (Z {1,...,k+1} |W {1,...,k} )) = `−1 X k=1 (F− 1 k H ∗ (Z {1,...,k} |W {1,...,k−1} )) + `− 1 ` H ∗ (Z {1,...,`} |W {1,...,`−1} ) ≥(`− 1)F− (`− 1)MF + `− 1 ` H ∗ (Z {1,...,`} |W {1,...,`−1} ). (G.51) Combining (G.41), (G.50), and (G.51), we have R A (F,)F≥(`− 1)F− (`− 1)MF + `− 1 ` H ∗ (Z {1,...,`} |W {1,...,`−1} ) + (s−` +α)F−b ` H ∗ (Z {1,...,`} |W {1,...,`−1} ) =(s− 1 +α)F− (`− 1)MF + `− 1 ` −b l H ∗ (Z {1,...,`} |W {1,...,`−1} ). (G.52) Recall that b ` ≥ `−1 ` , we have R A (F,)F≥(s− 1 +α)F− (`− 1)MF − (b ` − `− 1 ` )`MF =(s− 1 +α)F − s(s− 1)−`(`− 1) + 2αs 2(N−` + 1) MF. (G.53) This completes the proof of Lemma 11.3. G.6 Proof of Lemma G.2 To simplify the discussion, we adopt the notation of H ∗ (W A ,Z B ) which is defined in the proof of Theorem 11.2. Moreover, we generalize this notation to include the variables for the messages X d . For any permutations p∈P N , q∈P K and for any demandd∈{1,...,N} K , we defined(p,q) be a demand where for each k∈{1,...,K}, user q(k) requests file p(d k ). Then for any subset for demandsD⊆{1,...,N} K , we defineD(p,q) ={d(p,q)|d∈D}. Now for any subsetsA⊆{1,...,N}, 217 B⊆{1,...,K}, andD⊆{1,...,N} K , we define H ∗ (X D ,W A ,Z B ) , 1 N!K! X p∈P N ,q∈P K H(X D(p,q) ,W pA ,Z qB ). (G.54) For any i∈{1,..,n} and j∈{1,...,α} letd i,j be a demand satisfying d i,j l = l−i + (j− 1)(K−n) +β if i + 1≤l≤i +K−n, 1 otherwise. (G.55) Note that for all demandsd i,j , user 1 requests file 1, hence we have H(W 1 |X d i,j,Z 1 )≤ 1 +F (G.56) using Fano’s inequality. Consequently, RF≥H(X d i,j) ≥H(X d i,j|Z 1 ) +H(W 1 |X d i,j,Z 1 )− (1 +F ) =H(W 1 |Z 1 ) +H(X d i,j|W 1 ,Z 1 )− (1 +F ). (G.57) Due to the homogeneity of the problem, we have RF≥H ∗ (W 1 |Z 1 ) +H ∗ (X d i,j|W 1 ,Z 1 )− (1 +F ). (G.58) For each i∈{1,...,n}, j∈{1,...,α}, and k∈{1,...,i}, we have the following identity: H ∗ (X d i,j|W 1 ,Z 1 ) =H ∗ (X d i,j|W 1 ,Z k ). (G.59) Hence, we have RF≥H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α n X k=1 n X i=k α X j=1 H ∗ (X d i,j|W 1 ,Z k ) − (1 +F ). (G.60) 218 For k∈{1,...,n}, letD k andD + k denote the following set of demands: D k ={d k,j |j∈{1,...,α}}, (G.61) D + k = n [ i=k D i , (G.62) we have RF≥H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α n X k=1 H ∗ (X D + k |W 1 ,Z k ) − (1 +F ) ≥H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α n X k=1 H ∗ (X D + k |W {1,...,β} ,Z k ) − (1 +F ) ≥H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α n X k=1 H ∗ (Z k ,X D + k |W {1,...,β} ) −H ∗ (Z k |W {1,...,β} ) − (1 +F ). (G.63) To further bound R, we only need a lower bound for n P k=1 H ∗ (Z k ,X D + k |W {1,...,β} ), which is derived as follows: For each i∈{1,...,K−n}, letS i be subset of files defined as follows: S i ={i + (j− 1)(K−n) +β| j∈{1,...,α}}. (G.64) From the decodability constraint, for any k∈{1,...,n}, each file inS i can be decoded by user i +k given X D k . Using Fano’s inequality, we have H ∗ (W S i |X D k ,Z i+k )≤α(1 +F ). (G.65) LetS − i be subset of files defined as follows S − i = i [ j=1 S j [ {1,...,β}. (G.66) We have 0≥H ∗ (W S i |X D + k ,Z i+k ,W S − i−1 )−α(1 +F ) 219 =H ∗ (X D + k ,Z i+k |W S i ,W S − i−1 ) +H ∗ (W S i |W S − i−1 ) −H ∗ (X D + k ,Z i+k |W S − i−1 )−α(1 +F ) =H ∗ (X D + k ,Z i+k |W S − i ) +αF−H ∗ (X D + k ,Z i+k |W S − i−1 ) −α(1 +F ). (G.67) Consequently, 0≥ n X k=1 K−n X i=1 H ∗ (X D + k ,Z i+k |W S − i ) +αF −H ∗ (X D + k ,Z i+k |W S − i−1 )−α(1 +F ) = n X k=1 K−n X i=1 H ∗ (X D + k ,Z i+k−1 |W S − i−1 ) −H ∗ (X D + k ,Z i+k |W S − i−1 ) +H ∗ (X D + k ,Z K−n+k |W S − n ) −H ∗ (X D + k ,Z k |W S − 0 ) ! +αn(K−n)(F− 1−F ) ≥ n X k=1 K−n X i=1 H ∗ (X D + k ,Z i+k−1 |W S − i−1 ) −H ∗ (X D + k ,Z i+k |W S − i−1 ) −H ∗ (X D + k ,Z k |W S − 0 ) ! +αn(K−n)(F− 1−F ). (G.68) Hence, we obtain the following lower bound: n X k=1 H ∗ (X D + k ,Z k |W S − 0 )) ≥ n X k=1 K−n X i=1 H ∗ (X D + k ,Z i+k−1 |W S − i−1 ) −H ∗ (X D + k ,Z i+k |W S − i−1 ) +αn(K−n)(F− 1−F ) = K−n X i=1 n X k=1 H ∗ (X D + k ,Z i+k−1 |W S − i−1 ) −H ∗ (X D + k ,Z i+k |W S − i−1 ) 220 +αn(K−n)(F− 1−F ) = K−n X i=1 n X k=1 H ∗ (Z i+k−1 |X D + k ,W S − i−1 ) −H ∗ (Z i+k |X D + k ,W S − i−1 ) +αn(K−n)(F− 1−F ) (G.69) Note thatD + k ⊆D + k−1 , we have H ∗ (Z i+k |X D + k+1 ,W S − i−1 )≥H ∗ (Z i+k |X D + k ,W S − i−1 ). Consequently, n X k=1 H ∗ (X D + k ,Z k |W S − 0 )) ≥ K−n X i=1 H ∗ (Z i |X D + 1 ,W S − i−1 )−H ∗ (Z i+n |X D + n ,W S − i−1 ) +αn(K−n)(F− 1−F ) ≥− K−n X i=1 H ∗ (Z i+n |W S − i−1 ) +αn(K−n)(F− 1−F ) =− K−n X i=1 H ∗ (Z 1 |W {1,...,β+iα} ) +αn(K−n)(F− 1−F ). (G.70) Applying (G.70) to (G.63), we have RF≥H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α αn(K−n)(F− 1−F ) − n X k=1 H ∗ (Z k |W {1,...,β} ) − K−n−1 X i=0 H ∗ (Z 1 |W {1,...,β+iα} ) ! − (1 +F ) =H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α αn(K−n)F −nH ∗ (Z 1 |W {1,...,β} ) − K−n−1 X i=0 H ∗ (Z 1 |W {1,...,β+iα} ) ! − 2K−n + 1 n + 1 (1 +F ). (G.71) 221 G.7 Proof of Theorem 11.1 for average rate Here we prove Theorem 11.1 for the average rate (i.e. inequalities (11.8) and (11.10)). The upper bounds of R ∗ ave in these inequalities can be achieved using the caching scheme provided in [11], hence we only need to prove their lower bounds. To do so, we define the following terminology: We divide the set of all demands, denoted byD, into smaller subsets, and refer them to as types. We use the same definition in [11], which are stated as follows: Given an arbitrary demandd, we define its statistics, denoted by s(d), as a sorted array of length N, such that s i (d) equals the number of users that request the ith most requested file. We denote the set of all possible statistics byS. Grouping by the same statistics, the set of all demandsD can be broken into many subsets. For any statisticss∈S, we define typeD s as the set of queries with statisticss. Note that for each demandd, the valueN e (d) only depends on its statisticss(d), and thus the value is identical across all demands inD s . For convenience, we denote that value by N e (s). Given a prefetching scheme and a typeD s , we say a rate R is -achievable for typeD s if we can find a function R(d) that is -achievable for any demand d inD s , satisfying R = E d [R(d)], whered is uniformly random inD s . Hence, to characterize R ∗ ave , it is sufficient to lower bound the -achievable rates for each type individually, and show that for each type, the caching scheme provided in [11] is within the given constant factors optimal for large F and small . We first lower bound any -achievable rate for each type as follows : Within a typeD s , we can find a demandd, such that users in{1,...,N e (s)} requests different files. We can easily generalize Lemma 11.2 to this demand, and any achievable rate of this demand, denoted by R d , is lower bounded by the following inequality: R d ≥ 1 F Ne(s) X k=1 H(W d k |Z {1,...,k} ,W {d 1 ,...,d k−1 } ) −N e (s)( 1 F +). (G.72) Applying the same bounding technique to all demands in typeD s . We can prove that any rate that is -achievable forD s , denoted by R s , is bounded by the follows: R s ≥ 1 F Ne(s) X k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} ) −N e (s)( 1 F +), (G.73) where function H ∗ (·) is defined in the proof of Theorem 11.2. 222 Following the same steps in the proof of Theorem 11.2, we can prove that R s ≥s− 1 +α− s(s− 1)−`(`− 1) + 2αs 2(N−` + 1) M−N e (s)( 1 F +), (G.74) for arbitrary s∈{1,...,N e (s)}, α∈ [0, 1], where `∈{1,...,s} is the minimum value such that s(s− 1)−`(`− 1) 2 +αs≤ (N−` + 1)`. (G.75) On the other hand, the caching scheme provided in [11] achieves an average rate of Conv ( K r+1 )−( K−Ne(s) r+1 ) ( K r ) within each typeD s . Using the results in [11], we can easily prove that this average rate can be upper bounded by R dec (M,s), defined as R dec (M,s), N−M M (1− (1− M N ) Ne(s) ). (G.76) Hence, in order to prove (11.8) and (11.10), it suffices to prove that for large F and small , any -achievable rate R s for any typeD s satisfies R s ≥ R dec (M,s)/2.00884 in the general case, and R s ≥R dec (M,s)/2 when N≥ K(K+1) 2 . Note that the above characterization of R s exactly matches a characterization of R ∗ for a caching system with N files and N e (s) users. Specifically, the lower bound of R s given by (G.74) exactly matches Theorem 11.2, and the upper bound R dec (M,s) defined in (G.76) exactly matches the upper bound R dec (M) defined in (11.20). Thus, by reusing the same arguments in the proof of Theorem 11.1 for the peak rate, we can easily prove that R s ≥R dec (M,s)/2.00884 holds for the general case, and R s ≥ R dec (M,s)/2 holds for sufficiently large N when Ne(s)M N > 1. Hence, to prove Theorem 11.1 for the average rate, we only need R s ≥R dec (M,s)/2 for sufficiently large N to also hold when Ne(s)M N ≤ 1, which can be easily proved as follows: Using the same arguments in the proof of Theorem 11.1 for the peak rate, the following inequality can be derived from (G.74) for large N, large F and small : R s ≥N e (s)− N e (s)(N e (s) + 1) 2 · M N , (G.77) which is a linear function ofM. Furthermore, sinceR dec (M,s) is convex, we only need to check that R dec (M,s) 2 ≤N e (s)− N e (s)(N e (s) + 1) 2 · M N (G.78) holds at Ne(s)M N ∈{0, 1}. 223 For Ne(s)M N = 0, we have R dec (M,s) 2 = N e (s) 2 ≤N e (s) = N e (s)− N e (s)(N e (s) + 1) 2 · M N . (G.79) For Ne(s)M N = 1, we have R dec (M,s) 2 = N e (s)− 1 2 1− 1− 1 N e (s) Ne(s) ! ≤ N e (s)− 1 2 =N e (s)− N e (s)(N e (s) + 1) 2 · M N . (G.80) This completes the proof of Theorem 11.1. G.8 The exact rate-memory tradeoff for two-user case As mentioned in Remark 11.4, we can completely characterize the rate-memory tradeoff for average rate for the two-user case, for any possible values of N and M. We formally state this result in the following corollary: Corollary G.1. For a caching system with 2 users, a database of N files, and a local cache size of M files at each user, we have R ∗ ave =R u,ave (N,K,r), (G.81) where R u,ave (N,K,r) is defined in Definition 11.1. Proof. For the single-file case, only one possible demand exists. The average rate thus equals the peak rate, which can be easily characterized. Hence, we omit the proof and focus on cases where N≥ 2. Note that R u,ave can be achieved using the scheme provided in [11], we only need to prove that R ∗ ave ≥R u,ave (N,K,r). As shown in Appendix G.7, the average rate within each typeD s is bounded by (G.73). Hence, the minimum average rate under uniform file popularity given a prefetching scheme φ, denoted by R(), is lower bounded by R()≥E s 1 F Ne(s) X k=1 H ∗ (W k |Z {1,...,k} ,W {1,...,k−1} ) 224 −N e (s)( 1 F +) . (G.82) Note that for the two-user case, N e (s) equals 1 with probability 1 N , and 2 with probability N−1 N . Consequently, R()≥ 1 F H ∗ (W 1 |Z 1 ) + N− 1 N ·H ∗ (W 2 |Z {1,2} ,W 1 ) − 2N− 1 N · ( 1 F +). (G.83) Using the technique developed in proof of Theorem 11.2, we have the following two lower bounds R()≥ 1 F H ∗ (W 1 |Z 1 )− 2N− 1 N · ( 1 F +) ≥1− M N − 2N− 1 N · ( 1 F +), (G.84) R()≥ 1 F H ∗ (W 1 |Z 1 ) + N− 1 N ·H ∗ (W 2 |Z {1,2} ,W 1 ) − 2N− 1 N · ( 1 F +) ≥ 1 F H ∗ (W 1 |Z 1 ) + 1 N · ((N− 1)F− 2H ∗ (Z 1 |W 1 )) − 2N− 1 N · ( 1 F +) ≥ 2N− 1 N − 3N− 2 N · M N − 2N− 1 N · ( 1 F +). (G.85) Hence we have R ∗ ave ≥ max 1− M N , 2N− 1 N − 3N− 2 N · M N =R u,ave (N,K,r). (G.86) G.9 Proof of Theorem 11.3 for average rate To prove Theorem 11.3 for the average rate, we need to show that R ∗ ave =R u (N,K,r) for large N, for any caching system with no more than 5 users. Note that when N is large, with high probability all users will request distinct files. Hence, we only need to prove that the minimum average rate within the type of the worst case demands (i.e., the set of demands where all users request distinct 225 files) equals R u (N,K,r). Since R u (N,K,r) can already be achieved according to [11], it suffices to prove that this average rate is lower bounded by R u (N,K,r). Similar to the peak rate case, we prove that this fact holds if KM N ≤ 1 or KM N ≥d K−3 2 e for large N. When KM N ≤ 1 or KM N ≥K− 1, this can be proved the same way as Lemma G.1, while for the other case (i.e. KM N ∈ h max{d K−3 2 e, 1},K− 1 ), we need to prove a new version of Theorem 11.4, which lower bounds the average rate within the type of the worst case demands. To simplify the discussion, we adopt the notation of H ∗ (X D ,W A ,Z B ) which is defined in (G.54). We also adopt the corresponding notation for conditional entropy. Suppose rate R is achievable for the worst case type, we start by proving converse bounds of R for large N. Recall that r = KM N , and let n =br + 1c. Because r∈ [1,K− 1), we have n∈{2,...,K− 1}. Let α =b N−K K−n c and β = N−α(K−n). Suppose N is large enough, such that α > 0. For any i∈{1,..,n} and j∈{1,...,α} letd i,j be a demand satisfying d i,j l = l−i + (j− 1)(K−n) +β if i + 1≤l≤i +K−n, l otherwise. (G.87) Note that the above demands belong to the worst case type, so we have RF≥H ∗ (X d i,j) for any i and j. Following the same steps of proving Lemma G.2, we have RF≥H ∗ (W 1 |Z 1 ) + 2 n(n + 1)α αn(K−n)F −nH ∗ (Z 1 |W {1,...,β} )− K−n−1 X i=0 H ∗ (Z 1 |W {1,...,β+iα} ) ! − 2K−n + 1 n + 1 (1 +F ). (G.88) Then following the steps of proving Theorem 11.4, we have R≥ 2K−n + 1 n + 1 − K(K + 1) n(n + 1) · M N − 2K−n + 1 n + 1 ( + 1 F ) (G.89) if the following inequality holds: Kβ +α (K−n)(K−n− 1) 2 ≤ n(n + 1)α 2 . (G.90) 226 Otherwise, we have R≥ 2K−n + 1 n + 1 − 2K(K−n) n(n + 1) · M N−β − 2K−n + 1 n + 1 ( + 1 F ). (G.91) Similar to the proof of Lemma G.1, we have proved that R≥R u (N,K,r) from the above bounds if r∈ h max{d K−3 2 e, 1},K− 1 for large N, large F, and small . Consequently, we proved that R ∗ ave =R u (N,K,r) if r≤ 1 or r≥d K−3 2 e for large N. For systems with no more than 5 users, this gives the exact characterization. G.10 Convexity of R u (N,K,r) and R u,ave (N,K,r) In this appendix, we prove the convexity of R u (N,K,r) and R u,ave (N,K,r) as functions of r, given parameters N and K. We start by proving the convexity of R u (N,K,r). Recall that for any non-integer r, the value of R u (N,K,r) is defined by linear interpolation. Hence, it suffices to show thatR u (N,K,r) is convex onr∈{0, 1,...,K}. Equivalently, we only need to prove 2R u (N,K,r)−R u (N,K,r− 1)−R u (N,K,r + 1)≤ 0 (G.92) for any r∈{1,...,K− 1}. The proof is as follows. We first observer that R u (N,K,r) can be written as R u (N,K,r) = K r+1 − K−min{K,N} r+1 K r (G.93) = P min{K,N} i=1 K−i r K r (G.94) = min{K,N} X i=1 K−r i K i . (G.95) Consequently, the LHS of inequality (G.92) can be written as 2R u (N,K,r)−R u (N,K,r− 1)−R u (N,K,r + 1) = min{K,N} X i=1 2 K−r i − K−r−1 i − K−r+1 i K i (G.96) 227 = min{K,N} X i=1 K−r−1 i−1 − K−r i−1 K i (G.97) = min{K,N} X i=2 − K−r−1 i−2 K i . (G.98) Since both K−r−1 i−2 and K i are non-negative, we have proved inequality (G.92). This guarantees the convexity of R u (N,K,r). Note that by substituting the variable min{K,N} in function R u (N,K,r) by N e (d), and taking expectation over a uniformly random demandd, we exactly obtain function R u,ave (N,K,r). Conse- quently, by applying the same substitution in the above proof, we obtain a proof for the convexity of R u,ave (N,K,r). 228
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Efficient delivery of augmented information services over distributed computing networks
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Fundamental limits of caching networks: turning memory into bandwidth
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Improving efficiency to advance resilient computing
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Distributed interference management in large wireless networks
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
Asset Metadata
Creator
Yu, Qian
(author)
Core Title
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/05/2020
Defense Date
03/24/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
caching,coding theory,distributed computing,information theory,matrix multiplication,OAI-PMH Harvest,privacy,Security
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Avestimehr, Salman (
committee chair
), Luo, HaiPeng (
committee member
), Ortega, Antonio (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
qianyu0929@gmail.com,qyu880@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-361776
Unique identifier
UC11666713
Identifier
etd-YuQian-8883.pdf (filename),usctheses-c89-361776 (legacy record id)
Legacy Identifier
etd-YuQian-8883.pdf
Dmrecord
361776
Document Type
Dissertation
Rights
Yu, Qian
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
caching
coding theory
distributed computing
information theory
matrix multiplication