Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
(USC Thesis Other)
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Theoretical Foundations for Dealing with Data Scarcity and Distributed Computing in Modern Machine Learning by Mohammadreza Mousavi Kalan A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2022 Copyright 2022 Mohammadreza Mousavi Kalan Dedication To My Dearest Parents ii Table of Contents Dedication ii List of Tables vi List of Figures vii Abstract x Chapter 1: Minimax Lower Bounds for Transfer Learning with Linear and One-hidden Layer Neural Networks 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Transfer Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Minimax Framework for Transfer Learning . . . . . . . . . . . . . . . . . . . 5 1.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Experiments and Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.1 ImageNet Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 Proof outline and proof of Theorem 3.1 in the linear model . . . . . . . . . . . . . . 19 Chapter 2: Statistical Minimax Lower Bounds for Transfer Learning in Linear Binary Classification 23 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Prior Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Transfer Learning model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Minimax Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 3: Fundamental Limits of Transfer Learning in Multiclass Classification 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Prior Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.2 Extension to Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . 44 iii 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Proof outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter 4: Near-Optimal Straggler Mitigation for Distributed Gradient Methods 54 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 The Batched Coupon’s Collector (BCC) Scheme . . . . . . . . . . . . . . . . . . . . 64 4.3.1 Description of BCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.2 Near Optimal Performance Guarantees for BCC . . . . . . . . . . . . . . . . 66 4.3.3 Empirical Evaluations of BCC . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Extension to Heterogeneous Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.2 Lower and Upper Bounds on Optimal Value ofP 1 . . . . . . . . . . . . . . . 74 4.4.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 5: Lagrange Coded Computing: Optimal Design for Resiliency, Security, and Privacy 78 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Problem Formulation and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4 Lagrange Coded Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.1 Illustrating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.2 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.6 Application to Linear Regression and Experiments on AWS EC2 . . . . . . . . . . . 90 Chapter 6: Fitting ReLUs via SGD and Quantized SGD 94 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2 Algorithms: SGD and QSGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2.1 Stochastic Gradient Descent (SGD) for Fitting ReLUs . . . . . . . . . . . . . 96 6.2.2 Reducing the Communication Overhead via Quantized SGD . . . . . . . . . . 96 6.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.1 SGD for Fitting ReLUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.2 Quantized SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Numerical Results and Experiments on Amazon EC2 . . . . . . . . . . . . . . . . . . 101 6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Bibliography 108 Chapter 7: Appendices 118 7.1 Appendix for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.1.1 Calculating the Generalization Errors ( Proof of Proposition 1.1) . . . . . . . 118 7.1.2 Calculating KL-Divergences for the Linear Model (Proof of Lemma 1.1) . . . 123 iv 7.1.3 Lower Bound for Minimax Risk When ∆ ≥ q σ 2 D log 2 r T n T and ∆ < q σ 2 D log 2 r T n T ( Proof of Lemma 1.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.1.4 Lower Bound for Minimax Risk When ∆ ≤ 1 45 q σ 2 D r S n S +r T n T ( Proof of Lemma 1.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.1.5 Proof of Theorem 1.1 (One-hidden layer neural network with fixed hidden- to-output layer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.1.6 Bounding the KL-Divergences in the Neural Network Model (Proof of Lemma 7.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.1.7 Proof of Theorem 1.1 (One-hidden layer neural network model with fixed input-to-hidden layer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.2 Appendix for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.2.1 Calculating the Error of Estimated Classifier (Proof of Proposition 2.3.2) . . 133 7.2.2 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.3 Appendix for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.3.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.3.2 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.3.3 Proof of Theorem 3.3 (Extension to Multiclass Classification) . . . . . . . . . 150 7.3.4 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.4 Appendix for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.4.1 The MDS property of U bottom . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.4.2 The Uncoded Version of LCC . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.4.3 Proof of Theorem 5.2a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.4.4 Proof of Theorem 5.2b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.4.5 Proof of Theorem 5.2c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.4.6 Proof of Lemma 7.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.4.7 Proof of Lemma 7.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.5 Appendix for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.5.1 Convergence Analysis for Fitting ReLUs via SGD (Proof of Theorem 6.1) . . 164 7.5.1.1 Proof of Lemma 7.11 . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.5.2 Convergence of Quantized SGD (Theorem 6.2) . . . . . . . . . . . . . . . . . 172 7.5.3 Ensemble Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 v List of Tables 1.1 Transfer distance and noise level for various source-target pairs. . . . . . . . . . . . . 12 2.1 Three pairs of source and target tasks along with corresponding semantic distance . 32 3.1 Three pairs of source and target tasks for video recognition . . . . . . . . . . . . . . 48 3.2 Transfer distance of pairs of source and target on UCF101 action recognition. . . . . 48 3.3 Three pairs of source and target tasks for image classification . . . . . . . . . . . . . 48 3.4 Transfer distance of pairs of source and target on DomainNet image classifications. . 48 4.1 Breakdowns of the running times of the uncoded, the cyclic repetition, and the BCC schemes in scenario one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Breakdowns of the running times of the uncoded, the cyclic repetition, and the BCC schemes in scenario two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1 Comparison between BGW based designs and LCC. The computational complexity is normalized by that of evaluating f; randomness, which refers to the number of random entries used in encoding functions, is normalized by the length of X i . . . . 86 6.1 Experiment scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Breakdowns of the run-times in the both scenarios. . . . . . . . . . . . . . . . . . . . 105 vi List of Figures 1.1 Sample images from the source/target datasets derived from ImageNet. Transfer distance increases from top to bottom. . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Train and test loss of a one-hidden layer network trained on cat breeds dataset. . . . 13 1.3 Target generalization error for a linear model ((a) and (b)) and a neural network model with fixed hidden-to-output layer ((c) and (d)). . . . . . . . . . . . . . . . . . 15 1.4 Target generalization error for a linear model (a) and a neural network model with fixed hidden-to-output layer (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1 Theoretical lower bound along with upper bounds for three pairs of source and target obtained by weighted empirical risk minimization. . . . . . . . . . . . . . . . . 32 2.2 Average λ used in weighted ERM 2.6 for three different pairs of source and target tasks shown in Table 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Theoretical lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Lower and upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 (a) depicts our lower bounds for three pairs of source and target tasks on action classification. (b) depicts the lower bounds along with the upper bounds obtained via weighted empirical risk minimization. . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Theoretical lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Lower and upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 (a) depicts our lower bounds for three pairs of source and target tasks on image classification. (b) depicts the lower bounds along with the upper bounds obtained via weighted empirical risk minimization. . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Average λ in weighted empirical risk minimization for three different pairs of source and target tasks for action recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . 51 vii 3.8 Average λ in weighted empirical risk minimization for three different pairs of source and target tasks for image classification. . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1 A master-worker distributed computing model for distributed gradient descent. . . . 55 4.2 The tradeoffs between the computational load and the average recovery threshold, for distributed GD using m = 100 training examples across n = 100 workers. . . . . 60 4.3 The data distribution of the proposed BCC scheme. The training dataset is evenly partition into m/r batches of size r, from which each worker independently selects one uniformly at random. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Running time comparison of the uncoded, the cyclic repetition, and the BCC schemes on Amazon EC2. In scenario one, we have n = 50 workers, and m = 50 data batches. In scenario two, we have n = 100 workers, andm = 100 data batches. Each data batch contains 100 data points. In both scenarios, the computational loads of the cyclic repetition scheme and the BCC scheme are chosen to minimize the total running times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Illustration of the performance gain achieved by generalized BCC scheme for a heterogeneous cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1 An overview of the problem considered in this chapter, where the goal is to evaluate a not necessarily linear function f on a given dataset X = (X 1 ,...,X K ) using N workers. Each worker applies f on a possibly coded version of inputs (denoted by ˜ X i ’s). By carefully designing the coding strategy, the master can decode all the required results from a subset of workers, in the presence of stragglers (workers s 1 ,...,s S ) and Byzantine workers (workers m 1 ,...,m A ), while keeping the dataset private to colluding workers (workers c 1 ,...,c T ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 Run-time comparison of LCC with other three schemes: uncoded, GC, and MVM. . . . . . 91 6.1 (a) This plot depicts the convergence behavior of mini-batch SGD iterates for various mini-batch sizes m.(b) This plot depicts the convergence behavior of QSGD iterates for various bits of quantization b with the mini-batch size fixed at m = 800. 103 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4 Empirical probability that (a) SGD and (b) QSGD with b = 7 finds the global optimum for different number of data points ( n) and feature dimensions (d). . . . . . 104 6.5 Run-time compassion of SGD and QSGD on Amazon EC2 clusters for the two scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.1 Configuration of the parameters of source and target distributions in Lemma 1.2. . . 125 viii 7.2 Configuration of the parameters of source and target distributions in Lemma 1.3. . . 127 7.3 Depicts the lower bounds along with the upper bounds obtained via weighted empirical risk minimization. In this setting the number of target samples is fixed at n T = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 ix Abstract In recent years machine learning has affected every aspect of human life to the extent that it is hard to find a realm in which it does not have any application. For instance, machine learning has dramatically improved the areas such as social networks, autonomous driving, computer vision, and medical diagnosis. When using and implementing machine learning techniques, however, there are some major challenges which can considerably limit their performance. One of the major challenges in many applications is insufficient data. Since the fuel of machine learning is data , and its performance depends mainly on the availability of sufficient amount of data, lack of adequate amountofdatacanbeaseriousissueinmanyapplicationswheredatacollectingprocessisexpensive and time consuming. The other serious challenge when implementing machine learning algorithms is related the system level. Dealing with the sheer size and complexity of today’s massive data sets in applications where there are plenty of data available requires computational platforms that can analyze data in a parallel and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes, a.k.a. stragglers, can significantly slow down computation as the slowest node may dictate the overall computational time. This document discusses the aforementioned challenges and proposes some effective solutions from the theoretical and practical perspectives. As previously mentioned, training data is the backbone of machine learning and the more data we have the better performance and accuracy we can obtain by machine learning algorithms. In x many applications, however, we suffer from the data scarcity as data collection procedures are expensive, time consuming, or impossible due to the sensitive nature of the data. One of the popular methods to tackle this challenge is called transfer learning. Suppose that the goal is to learn an inferential task in a specific domain, called target, where labeled training data may be scarce. For learning the target task, we can get help from another different but related task, called source, where plenty of labeled training data is available which can compensate for the data deficiency in the target domain. Despite recent empirical success of transfer learning approaches, the benefits and fundamental limits of transfer learning are poorly understood. In chapter 2, we develop a statistical minimax framework to characterize the fundamental limits of transfer learning in the context of regression with linear and one-hidden layer neural network models. Specifically, we derive a lower-bound for the target generalization error achievable by any algorithm as a function of the number of labeled source and target data as well as appropriate notions of similarity between the sourceand target tasks. Our lower boundprovidesnew insights into thebenefits andlimitations of transfer learning. We further corroborate our theoretical finding with various experiments. In chapter 3, we take a step towards foundational understanding of transfer learning by focusing on binary classification with linear models and develop statistical minimax lower bounds in terms of the number of source and target samples and an appropriate notion of similarity between source and target tasks. This serves as a stepping stone for more general models and provides guidelines for development of more effective transfer learning algorithms. In chapter 4, We develop minimax lower bounds for multiclass classification problems. Key features of our lower bound are that it applies to any arbitrary source/target data distributions and requires minimal assumptions that enables it application to a broad range of problems. We also consider a more general setting where there are more than one source domains for knowledge transfer to the target task and develop new bounds on generalization error in this setting xi At the system level, implementing machine learning algorithms faces some challenges in appli- cations where there is a massive amount of training data. In order to handle massive data sizes developing parallel/distributed implementations of machine learning algorithms over multiple cores or GPUs on a single machine, or multiple machines in computing clusters is crucial. As we dis- tribute computations across many servers, however, several fundamental challenges arise. Cheap commodity hardware tends to vary greatly in computation time, and it has been demonstrated [30, 61, 110] that a small fraction of servers, referred to as stragglers, can be 5 to 8 times slower than the average, thus creating significant delays in computations. Also, as we distribute computations across many servers, massive amounts of data must be moved between them to execute the com- putational tasks, often over many iterations of a running algorithm, and this creates a substantial bandwidth bottleneck. In chapters 5, 6, 7 we discuss the mentioned issues and propose efficient solutions to them. In chapter 5, we focus on stragglers issue specifically in gradient distributed algorithms. In these methods, each worker is assigned a partial data and computes partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a full gradient and the learning model is updated. However, a major performance bottleneck that arises is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. We propose a distributed computing scheme,called Batched Coupon’s Collector (BCC) to alleviate the effect of stragglers in gradient methods. We prove that our BCC scheme is robust to a near optimal number of random stragglers. We also empirically demonstrate that our proposed BCC scheme reduces the run-time by up to 85.4% over Amazon EC2 clusters when compared with other straggler mitigation strategies. We also generalize the proposed BCC scheme to minimize the xii completion time when implementing gradient descent-based algorithms over heterogeneous worker nodes. In chapter 6, considering the distributed platform, we propose Lagrange Coded Computing (LCC), a new framework to simultaneously provide (1) resiliency against stragglers that may pro- long computations; (2) security against Byzantine (or malicious) workers that deliberately modify the computation for their benefit; and (3) (information-theoretic) privacy of the dataset amidst possible collusion of workers. LCC, which leverages the well-known Lagrange polynomial to create computation redundancy in a novel coded form across workers, can be applied to any computation scenarioinwhichthefunctionofinterestisanarbitrary multivariate polynomial oftheinputdataset, hence covering many computations of interest in machine learning. LCC significantly generalizes prior works to go beyond linear computations. It also enables secure and private computing in distributed settings, improving the computation and communication efficiency of the state-of-the- art. Furthermore, we prove the optimality of LCC by showing that it achieves the optimal tradeoff between resiliency, security, and privacy. Finally, we show via experiments on Amazon EC2 that LCC speeds up the conventional uncoded implementation of distributed least-squares linear regres- sion by up to 13.43×, and also achieves a 2.36×-12.65× speedup over the state-of-the-art straggler mitigation strategies. Finally in chapter 7, we study quantized stochastic gradient descent method to resolve the communication bottlenecks in distributed computing settings. In this chapter, we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples thatisoptimaluptonumericalconstants. Nextwefocusonaparallelimplementationwhereineach iteration the mini-batch gradient is calculated in a distributed manner across multiple processors xiii and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme(QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globallyoptimalmodelwhilesignificantlyreducingthecommunicationcost. Wefurthercorroborate our numerical findings via various experiments including distributed implementations over Amazon EC2. xiv Chapter 1 Minimax Lower Bounds for Transfer Learning with Linear and One-hidden Layer Neural Networks 1.1 Introduction Deep learning approaches have recently enjoyed wide empirical success in many applications span- ning natural language processing to object recognition. A major challenge with deep learning techniques however is that training accurate models typically requires lots of labeled data. While for many of the aforementioned tasks labeled data can be collected by using crowd-sourcing, in many other settings such data collection procedures are expensive, time consuming, or impossible due to the sensitive nature of the data. Furthermore, deep learning techniques often are brittle and do not adapt well to changes in the data or the environment. Transfer learning approaches have emerged as a way to mitigate these issues. Roughly speaking, the goal of transfer learning is to borrow knowledge from a source domain, where lots of training data is available, to improve the learning process in a related but different target domain. Despite recent empirical success the benefits as well as fundamental limitations of transfer learning remains unclear with many open challenges: 1 What is the best possible accuracy that can be obtained via any transfer learning algorithm? How does this accuracy depend on how similar the source and target domain tasks are? What is a good way to measure similarity/distance between two source and target domains? How does the transfer learning accuracy scale with the number of source and target data? How do the answers to the above questions change for different learning models? At the heart of answering these questions is the ability to predict the best possible accuracy achievable by any algorithm and characterize how this accuracy scales with how related the source and target data are as well as the number of labeled data in the source and target domains. In this chapter we take a step towards this goal by developing statistical minimax lower bounds for transfer learning focusing on regression problems with linear and one-hidden layer neural network models. Specifically, we derive a minimax lower bound for the generalization error in the target task as a function of the number of labeled training data from source and target tasks. Our lower bound also explicitly captures the impact of the noise in the labels as well as an appropriate notion of transfer distance between source and target tasks on the target generalization error. Our analysis reveals that in the regime where the transfer distance between the source and target tasks is large (i.e. the source and target are dissimilar) the best achievable accuracy mainly depends on the number of labeled training data available from the target domain and there is a limited benefit to having access to more training data from the source domain. However, when the transfer distance between the source and target domains are small (i.e. the source and target are similar) both source and target play an important role in improving the target training accuracy. Furthermore, we provide various experiments on real data sets as well as synthetic simulations to empirically investigate the effect of the parameters appearing in our lower bound on the target generalization error. Related work. There is a vast theoretical literature on the problem of domain adaptation which is closely related to transfer learning [16, 112, 21, 109, 6, 98, 73]. The key difference is that in 2 domain adaptation there is no labeled target data while in transfer learning a few labeled target data is available in addition to source data. Most of the existing results in the domain adaptation literature give an upper bound for the target generalization error. For instance, the papers [8, 9] provide an upper bound on the target generalization error in classification problems in terms of quantities such as source generalization error, the optimal joint error of source and target as well as VC-dimension of the hypothesis class. A more recent work [77] generalizes these results to a broad family of loss functions using Rademacher complexity measures. Related, [121] derives a similar upper bound for target generalization error as in [8] but in terms of other quantities. Finally, the recent paper [120] generalizes the results of [8, 77] to multiclass classification using margin loss. More closely related to this work, there are a few interesting results that provide lower bounds for the target generalization error. For instance, focusing on domain adaptation the paper [10] provides necessary conditions for successful target learning under a variety of assumptions such as a covariate shift, similarity of unlabeled distributions, and existence of a joint optimal hypothesis. Morerecently, thepaper[43]definesanewdiscrepancymeasurebetweensourceandtargetdomains, calledtransfer exponent, andprovesaminimaxlowerboundonthetargetgeneralizationerrorunder a relaxed covariate-shift assumption and a Bernstein class condition. [121] derives an information theoretic lower bound on the joint optimal error of source and target domains defined in [8]. Most of the above results are based on a covariate shift assumption which requires the conditional distributions of the source and target tasks to be equal and the source and target tasks to have the same best classifier. In this chapter, however, we consider a more general case in which source and target tasks are allowed to have different optimal classifiers. Furthermore, these results do not specifically study a neural network model. To the extent of our knowledge this is the first work to develop minimax lower bounds for transfer learning with neural networks. 3 1.2 Problem Setup We now formalize the transfer learning problem considered in this chapter. We begin by describing the linear and one-hidden layer neural network transfer learning regression models that we study. We then discuss the minimax approach to deriving transfer learning lower bounds. 1.2.1 Transfer Learning Models Weconsider a transferlearning probleminwhichtherearelabeledtrainingdatafromasourceand a targettaskandthegoalistofindamodelthathasgoodperformanceinthetargettask. Specifically, we assumewe haven S labeled training datafrom thesource domaingenerated accordingto asource domain distribution (x S ,y S )∼ P with x S ∈ R d representing the input/feature and y S ∈ R k the corresponding output/label. Similarly, we assume we haven T training data from the target domain generated according to (x T ,y T )∼ Q with x T ∈ R d and y T ∈ R k . Furthermore, we assume that the features are distributed asx S ∼N (0, Σ S ),x T ∼N (0, Σ T ) with Σ S and Σ T ∈R d×d denoting the covariance matrices. We also assume that the labels y S /y T are generated from ground truth mappings relating the features to the labels as follows y S =f(θ S ;x S ) +w S and y T =f(θ T ;x T ) +w T (1.1) where θ S and θ T are the parameters of the function f and w S ,w T ∼ N (0,σ 2 I k ) represents source/target label noise. In this chapter we focus on the following linear and one-hidden layer neural network models. Linear model. In this case, we assume thatf(θ S ;x S ) :=f(W S ;x S ) =W S x S andf(θ T ;x T ) := f(W T ;x T ) =W T x T whereW S ,W T ∈R k×d are two unknown matrices denoting the source/target 4 parameters. The goal is to use the source and target training data to find a parameter ma- trix c W T with estimated label b y T = c W T x T that achieves the smallest risk/generalization error E[∥y T −b y T ∥ 2 ℓ 2 ]. One-hidden layer neural network models. We consider two different neural network models where in one the hidden-to-output layer is fixed and in the other the input-to-hidden layer is fixed. Specifically, in the first model, we assume that f(θ S ;x S ) := f(W S ;x S ) = Vφ(W S x S ) and f(θ T ;x T ) := f(W T ;x T ) = Vφ(W T x T ) where W S ,W T ∈ R ℓ×d are two unknown weight matrices,V ∈R k×ℓ is a fixed and known matrix, and φ is the ReLU activation function. Similarly in the second model, we assume that f(θ S ;x S ) := f(V S ;x S ) = V S φ(Wx S ) and f(θ T ;x T ) := f(V T ;x T ) = V T φ(Wx T ) with V S ,V T ∈ R k×ℓ two unknown weight matrices and W ∈ R ℓ×d a known matrix. In both cases the goal is to use the source and target training data to find the unknown target parameter weights ( c W T or c V T ) that achieve the smallest risk/generalization error E[∥y T −b y T ∥ 2 ℓ 2 ]. Here, b y T =Vφ( c W T x T ) in the first model and b y T = c V T φ(Wx T ) in the second. 1.2.2 Minimax Framework for Transfer Learning We now describe our minimax framework for developing lower bounds for transfer learning. As with most lower bounds, in a minimax framework we need to define a class of transfer learning problems for which the lower bound is derived. Therefore, we define (P θ S ,Q θ T ) as a pair of joint distributions of features and labels over a source and a target task, that is, (x S ,y S )∼ P θ S and (x T ,y T )∼ Q θ T with the labels obeying (1.1). In this notation, each pair of a source and target task is parametrized by θ S and θ T . We stress that over the different pairs of source and target tasks, Σ S , Σ T , and σ 2 are fixed and only the parameters θ S andθ T change. As mentioned earlier, in a transfer learning problem we are interested in using both source and target training data to find an estimate b θ T ofθ T with small target generalization error. In a 5 minimax framework,θ T is chosen in an adversarial way, and the goal is to find an estimate ˆ θ T that achieves the smallest worst case target generalization risk supE ∼ source and target samples E Q θ T [∥y T −b y T ∥ 2 ℓ 2 ] . Here, the supremum is taken over the class of transfer problems under study (possible (P θ S ,Q θ T ) pairs). We are interested in considering classes of transfer learning problems which properly reflect the difficulty of transfer learning. To this aim we need to have an appropriate notion of similar- ity or transfer distance between source and target tasks. To define the appropriate measure of transfer distance we are guided by the following proposition (see Section 7.1.1 for the proof) which characterizes the target generalization error for linear and one-hidden layer neural network models. Proposition 1.1. Let Q θ T be the data distribution over the target task with parameterθ T according to one of the models defined in Section 1.2.2. The target generalization error of an estimated model with parameter b θ T is given by: • Linear model: E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ] =||Σ 1 2 T ( c W T −W T ) T || 2 F +kσ 2 (1.2) • One-hidden layer neural network model with fixed hidden-to-output layer: E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ]≥ 1 4 σ 2 min (V )||Σ 1 2 T ( c W T −W T ) T || 2 F +kσ 2 (1.3) • One-hidden layer neural network model with fixed input-to-hidden layer: E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ] =|| e Σ 1 2 T ( c V T −V T ) T || 2 F +kσ 2 (1.4) 6 Here, e Σ T := 1 2 ∥a i ∥ ℓ 2 ∥a j ∥ ℓ 2 p 1−γ 2 ij +(π−cos −1 (γ ij ))γ ij π ij where a i is the ith row of the matrix W Σ 1 2 T and γ ij := a T i a j ∥a i ∥ ℓ 2 ∥a j ∥ ℓ 2 . Proposition 1.1 essentially shows how the generalization error is related to an appropriate dis- tance between the estimated and ground truth parameters. This in turn motivates our notion of transfer distance/similarity between source and target tasks discussed next. Definition 1.1. (Transfer distance) For a source and target task generated according to one of the models in Section 1.2.2 parametrized by θ S and θ T , we define the transfer distance between these two tasks as follows: • Linear model and one-hidden layer neural network model with fixed hidden-to-output layer: ρ(θ S ,θ T ) =ρ(W S ,W T ) :=||Σ 1 2 T (W S −W T ) T || F (1.5) • One-hidden layer neural network model with fixed input-to-hidden layer: ρ(θ S ,θ T ) =ρ(V S ,V T ) :=|| e Σ 1 2 T (V S −V T ) T || F (1.6) where e Σ T is defined in Proposition 1.1. With the notion of transfer distance in hand we are now ready to formally define the class of pairs of distributions over source and target tasks which we focus on in this chapter. Definition 1.2. (Class of pairs of distributions) For a given ∆ ∈ R + ,P ∆ is the class of pairs of distributions over source and target tasks whose transfer distance according to Definition 1.1 is less than ∆ . That is,P ∆ ={(P θ S ,Q θ T )| ρ(θ S ,θ T )≤ ∆ }. 7 Withtheseingredientsinplacewearenowreadytoformallystatethetransferlearningminimax risk. R T (P ∆ ) := inf b θ T sup (P θ S ,Q θ T )∈P ∆ E S P θ S ∼P 1:n S θ S E S Q θ T ∼Q 1:n T θ T E Q θ T [∥y T −b y T ∥ 2 ℓ 2 ] (1.7) Here, S P θ S and S Q θ T denote i.i.d. samples{(x (i) S ,y (i) S )} n S i=1 and{(x (i) T ,y (i) T )} n T i=1 generated from the source and target distributions. We would like to emphasize that b y T as defined in section 1.1, is a function of samples (S P θ S ,S Q θ ). 1.3 Main Results In this section, we provide a lower bound on the transfer learning minimax risk (1.7) for the three transfer learning models defined in Section 1.1. As with any other quantity related to generalization error this risk naturally depends on the size of the model and how correlated the features are in the target model. The following definition aims to capture the effective number of parameters of the model. Definition 1.3. (Effective dimension) The effective dimension of the three models defined in Section 1.1 are defined as follows: • Linear model: D :=rank(Σ T )k− 1, • One-hidden layer neural network model with fixed hidden-to-output layer: D :=rank(Σ T )ℓ− 1, • One-hidden layer neural network model with fixed input-to-hidden layer: D :=rank( e Σ T )k− 1. Our results also depend on another quantity which we refer to as the transfer coefficient. Roughlyspeakingthesequantitiesaremeanttocapturetherelativeeffectivenessofasourcetraining data from the perspective of the generalization error of the target task and vice versa. 8 Definition 1.4. (Transfer coefficients) Let n S and n T be the number of source and target training data. We define the transfer coefficients in the three models defined in Section 1.1 as follows • Linear model: r S := Σ 1 2 S Σ − 1 2 T 2 and r T := 1. • One-hidden layer neural net with fixed output layer: r S := Σ 1 2 S Σ − 1 2 T 2 ∥V∥ 2 and r T :=∥V∥ 2 . • One-hidden layer neural net model with fixed input layer: r S := e Σ 1 2 S e Σ − 1 2 T 2 and r T := 1. Here, e Σ S := 1 2 ∥c i ∥ ℓ 2 ∥c j ∥ ℓ 2 q 1−e γ 2 ij +(π−cos −1 (e γ ij ))e γ ij π ij where c i is the ith row of W Σ 1 2 S and e γ ij = c T i c j ∥c i ∥ ℓ 2 ∥c j ∥ ℓ 2 and e Σ T are defined per Proposition 1.1. In the above expressions||·|| stands for the operator norm. Furthermore, we define the effective number of source and target samples as r S n S and r T n T , respectively. With these definitions in place we now present our lower bounds on the transfer learning mini- max risk of any algorithm for the linear and one-hidden layer neural network models (see sections 1.5 and 7.1 for the proof). Theorem 1.1. Consider the three transfer learning models defined in Section 1.1 consisting of n S and n T source and target training data generated i.i.d. according to a class of source/target distributions with transfer distance at most ∆ per Definition 1.2. Moreover, let r S and r T be the source and target transfer coefficients per Definition 1.4. Furthermore, assume the effective dimension D per Definition 1.3 obeys D≥ 20. Then, the transfer learning minimax risk (1.7) obeys the following lower bounds: • Linear model:R T (P ∆ )≥B +kσ 2 . • One-hidden layer neural network with fixed hidden-to-output layer: R T (P ∆ )≥ 1 4 σ 2 min (V )B+kσ 2 . • One-hidden layer neural network model with fixed input-to-hidden layer: R T (P ∆ )≥B +kσ 2 . 9 Here, σ min (V ) denotes the minimum singular value ofV and B := σ 2 D 256r T n T , if ∆ ≥ q σ 2 D log 2 r T n T 1 100 ∆ 2 [1− 0.8 r T n T ∆ 2 σ 2 D ], if 1 45 q σ 2 D r S n S +r T n T ≤ ∆ < q σ 2 D log 2 r T n T ∆ 2 1000 + 6 1000 Dσ 2 r S n S +r T n T , if ∆ < 1 45 q σ 2 D r S n S +r T n T (1.8) Note that, the nature of the lower bound and final conclusions provided by the above theorem are similar for all three models. More specifically, Theorem 1.1 leads to the following conclusions: • Large transfer distance (∆ ≥ q Dσ 2 log 2 r T n T ). When the transfer distance between the source and target tasks is large, source samples are helpful in decreasing the target generalization error until the error reaches σ 2 D 256r T n T . Beyond this point, by increasing the number of source samples, target generalization error does not decrease further and it becomes dominated by the target samples. In other words, when the distance is large, source samples cannot compensate for target samples. • Moderate distance ( 1 45 q σ 2 D r S n S +r T n T ≤ ∆ < q σ 2 D log 2 r T n T ). The lower bound of this regime suggests that if the distance between the source and target tasks is strictly positive, i.e ∆ > 0, even if we have infinitely many source samples, target generalization error still does not go to zero and depends on the number of available target samples. In other words, source samples cannot compensate for the lack of target samples. • Small distance (∆ < 1 45 q σ 2 D r S n S +r T n T ). In this case, the lower bound on the target generalization error scales with 1 r S n S +n T r T wherer S n S andr T n T are the effective number of source and target samples per Definition 1.4. Hence, when ∆ is small, the target generalization error scales with the reciprocal of the total effective number of source and target samples which means that source samples are indeed helpful in reducing the target generalization error and every source sample is roughly equivalent to r S r T target samples. Furthermore, when the distance of source and target 10 is zero, i.e. ∆ = 0 , the lower bound reduces to 6 1000 Dσ 2 r S n S +r T n T . Conforming with our intuition, in this case the bound resembles a non-transfer learning scenario where a combination of source and target samples are used. Indeed, the lower bound is proportional to the noise level, effective dimension and the total number of samples matching typical statistical learning lower bounds. 1.4 Experiments and Numerical Results We demonstrate the validity of our theoretical framework through experiments on real datasets sampled from ImageNet as well as synthetic simulated data. The experiments on ImageNet data allow us to investigate the impact of transfer distance and noise parameters appearing in Theorem 1.1 on the target generalization error. However, since the source and target tasks are both image classification, they are inherently correlated with each other and we cannot expect a wide range of transfer distances between them. Therefore, we carry out a more in-depth study on simulated data to investigate the effect of the number of source and target samples on the target generalization error in different transfer distance regimes. Full source code to reproduce the results can be found at [1]. 1.4.1 ImageNet Experiments HereweverifyourtheoreticalformulationonasubsetofImageNet,awell-knownimageclassification dataset and show that our main theorem conforms with practical transfer learning scenarios. Sample datasets. We create five datasets by sub-sampling 2000 images in five classes from ImageNet (400 examples per class). As depicted in Figure 1.1, we deliberately compile datasets covering a spectrum of semantic distances from each other in order to study the utility/effect of transfer distance on transfer learning. The picked datasets are as follows: cat breeds, big cats, dog breeds, butterflies , planes. For details of the classes in each dataset please refer to the code 11 Cat br. Dog br. Big cats Butterf. Planes Semantic distance Figure 1.1: Sample images from the source/target datasets derived from ImageNet. Transfer distance increases from top to bottom. Source / target task ρ(source,target) Validation loss Noise level (σ) cat breeds / dog breeds 11.62 0.2194 0.2095 cat breeds / big cats 12.35 0.1682 0.1834 cat breeds / butterflies 13.48 0.1367 0.1653 cat breeds / planes 16.41 0.1450 0.1703 Table 1.1: Transfer distance and noise level for various source-target pairs. provided in [1]. We pass the images through a VGG16 network pretrained on ImageNet with the fully connected top classifier removed and use the extracted features instead of the actual images. We set aside 10% of the dataset as test set. Furthermore, 10% of the remaining data is used for validation and 90% for training. In the following we fix identifying cat breeds as the source task and the four other datasets as target tasks. Training. We trained a one-hidden layer neural network for each dataset. To facilitate fair comparison between the trained models in weight-space, we fixed a random initialization of the hidden-to-output layer shared between all networks and we only trained over the input-to-hidden layer (in accordance with the theoretical formulation). Moreover, we used the same initialization of input-to-hidden weights. We trained a separate model on each of the five datasets on MSE loss with one-hot encoded labels. We use an Adam optimizer with a learning rate of 0.0001 and train for 100 epochs or until the network reaches 99.5% accuracy whichever occurs first. The target noise 12 2 4 6 8 10 0.1 0.15 0.2 0.25 Epochs Source train loss 1 2 4 6 8 10 0.26 0.27 0.28 0.29 0.3 0.31 Epochs Target test loss big cats dog breeds butterflies planes 1 Figure 1.2: Train and test loss of a one-hidden layer network trained on cat breeds dataset. levels are calculated based on the average loss of the trained ground truth models on the target validation set (note that this average loss equals kσ 2 = 5σ 2 ). Results. First, we calculate the transfer distance from Definition 1.1 between the model trained on the source task (cat breeds) and the other four models trained on target tasks by fitting a ground truth model to each task using complete training data. Our results depicted in Table 1.1 demonstrate that the introduced metric strongly correlates with perceived semantic distance. The closest tasks, cat breeds and dog breeds, are both characterized by pets with similar features, and with humans frequently appearing in the images. Images in the second closest pair, cat breeds and big cats, include animals with similar features, but big cats have more natural scenes and less humans compared with dog breeds, resulting in slightly higher distance from the source task. As expected, cat breeds- butterflies distance is significantly higher than in case of the previous two targets, but they share some characteristics such as the presence of natural backgrounds. The largest distance is between cat breeds and planes, which is clearly the furthest task semantically as well. Our next set of experiments focuses on checking whether the transfer distance is indicative of transfer target risk/generalization error. To this aim we use a very simple transfer learning 13 approach where we use only source data to train a one-hidden layer network as described before and measure its performance on the target tasks. Note that the network has never seen examples of the target dataset. Figure 1.2 depicts how train and test loss evolved over the training process. We stop after 10 epochs when validation losses on target tasks have more or less stabilized. The results closely match our expectations from Theorem 1.1. Based on Table 1.1 the noise level of ground truth models for big cats, butterflies and planes are about the same and therefore their test loss follows the same ordering as their distances from the source task (see Table 1.1). Moreover, even though dog breeds has the lowest distance from the source task, it is also the noisiest. The lower bound in Theorem 1.1 includes an additive noise term, and therefore the change in ordering between dog breeds and butterflies is justified by our theory and demonstrates the effect of the target task noise level on generalization. 1.4.2 Numerical Results In this section we perform synthetic numerical simulations in order to carefully cover all regimes of transfer distance from our main theorem, and show how the target generalization error depends on the number of source and target samples in different regimes. Experimental setup 1. First, we generate data according to the linear model with parameters d = 200,k = 30,σ = 1, Σ S = 2·I d , Σ T = I d . Then we generate the source parameter matrix W S ∈R k×d with elements sampled fromN (0, 10). Furthermore, we generate two target parameter matricesW T 1 andW T 2 ∈R k×d fortasksT 1 andT 2 suchthatW T 1 =W S +M 1 andW T 2 =W S +M 2 where the elements of M 1 ,M 2 are sampled fromN (0, 10 −3 ) andN (0, 3.6× 10 5 ) respectively. Similarly for the one-hidden layer neural network model when the the output layer is fixed, we set the parameters k = 1,ℓ = 30,d = 200,σ = 1, Σ S = 2·I d , Σ T =I d and V = 1 k×ℓ . We also use the sameW S ,W T 1 ,W T 2 as in the linear model. We note that the transfer distance between the source 14 0 100 200 300 400 500 0 2 4 6 8 ·10 5 n S Generalization Error Small transfer distance Large transfer distance (a) Linear model with nT = 50 0 100 200 300 400 500 0 2 4 6 8 ·10 5 n T Generalization Error Small transfer distance Large transfer distance (b) Linear model with nS = 50 0 100 200 300 400 500 0 0.5 1 ·10 4 n S Generalization Error Small transfer distance Large transfer distance (c) Neural network model with nT = 50 0 100 200 300 400 500 600 0 0.5 1 ·10 4 n T Generalization Error Small transfer distance Large transfer distance (d) Neural network model with nS = 50 Figure 1.3: Target generalization error for a linear model ((a) and (b)) and a neural network model with fixed hidden-to-output layer ((c) and (d)). 15 task to target task T 1 is small but the transfer distance between the source task to target task T 2 is large (ρ(W S ,W T 1 ) =.0183 and ρ(W S ,W T 2 ) = 116.694). Training approach 1. We test the performance of a simple transfer learning approach. Given n S source samples and n T target samples, we estimate c W T by minimizing the weighted empirical risk min W 1 2n T n T X i=1 f(W ;x (i) T )−y (i) T 2 ℓ 2 + λ 2n S n S X j=1 f(W ;x (j) S )−y (j) S 2 ℓ 2 (1.9) We then evaluate the generalization error by testing the estimated model c W T on 200 unseen test data points generated by the target model. All reported plots are the average of 10 trials. Results 1. Figure 1.3 (a) depicts the target generalization error for target tasks T 1 and T 2 for the linear model for different n S values with λ = 1 and n T = 50. Figure 1.3 (b) depicts the target generalization error for tasks T 1 and target T 2 for the linear model for different n T values with the number of source samples fixed at n S = 50. Here, we set λ = 1 for target task T 1 , where the transfer distance from source is small, and λ =.001 for target task T 2 , where the transfer distance from source is large. Figures 1.3 (c) and 1.3 (d) have the same settings as in Figures 1.3 (a) and 1.3 (b) but we use a one-hidden layer neural network model with fixed hidden-to-output weights in lieu of the linear model. Figures 1.3 (a) and (c) clearly demonstrate that when the transfer distance between the source and target tasks is large, increasing the number of source samples is not helpful beyond a certain point. In particular, the target generalization error starts to saturate and does not decrease further. Stated differently, in this case the source samples cannot compensate for the target samples. This observation conforms with our main theoretical result. Indeed, when the transfer distance ∆ is large, B is lower bounded by σ 2 D 256r T n T which is independent of the number of source samples n S . 16 Furthermore, these figures also demonstrate that when the transfer distance is small, increasing the number of source samples is helpful and results in lower target generalization error. This also matches our theoretical results as when the transfer distance ∆ is small, the target generalization error is proportional to Dσ 2 r S n S +r T n T . Figures 1.3 (b) and (d) indicate that regardless of the transfer distance between the source and target tasks the target generalization error steadily decreases as the number of target samples increases. This is a good match with our theoretical results as n T appears in the denominator of our lower bound in all three regimes. To further investigate the effect of transfer distance between the source and target on the target generalization error we consider another set of experiments below. Experimental setup 2. For the linear model, we use the parameters d = 50,k = 30,σ = 0.3, Σ S = 2· I d , and Σ T = I d . We generate the target parameter W T ∈ R k×d with entries generated i.i.d.N (0, 10). To create different transfer distances between the source and target data we then generate the source parameter W S ∈ R k×d asW S =W T +i·M where the elements of the matrix M are sampled fromN (0, 10 −4 ) and i varies between 1 and 140000 in increments of 400. Similarly for the one-hidden layer neural network model when the the output layer is fixed, we pick parameter values k = 1,ℓ = 30,d = 50,σ = 0.3, Σ S = 2·I d , and Σ T =I d and set all of the entries ofV equal to one. Furthermore, we use the same source and target parametersW S and W T as in the linear model. Training approach 2. Given n S = 300 and n T = 20 source and target samples we minimize the weighted empirical risk (3.10). In this experiment we pick λ∈{0, 1 4 , 1 2 , 3 4 , 1} that minimizes a validation set consisting of 50 data points created from the same distribution as the target task. Finally we test the estimated model on 200 unseen target test data points. The reported numbers are based on an average of 20 trials . 17 0 50 100 150 0 0.2 0.4 0.6 0.8 1 ·10 5 Transfer Distance Generalization Error (a) Linear model withnS = 300 andnT = 20 0 50 100 150 0 500 1,000 1,500 Transfer Distance Generalization Error (b) Neural network model with nS = 300 and nT = 20 Figure 1.4: Target generalization error for a linear model (a) and a neural network model with fixed hidden-to-output layer (b). Results2. Figure2.1depictsthetargetgeneralizationerrorasafunctionofthetransferdistance between the source and target in the linear and neural network models. This figure clearly shows that when the transfer distance is small, the generalization error has a quadratic growth. However, as the distance increases the error saturates which matches the behavior of ∆ predicted by our lower bounds. 18 1.5 Proof outline and proof of Theorem 3.1 in the linear model In this section we present a sketch of the proof of Theorem 3.1 for the linear model. The proof for the neural network models follow a similar approach and appear in Sections 7.1.5 and 7.1.7. Note that by Proposition 2.3.2, the generalization error is given by E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ] =||Σ 1 2 T ( c W T −W T ) T || 2 F +kσ 2 . Therefore, in order to find a minimax lower bound on the target generalization error, it suffices to find a lower bound for the following quantity R T (P ∆ ;ϕ◦ρ) := inf c W T sup (P W S ,Q W T )∈P ∆ E S P W S ∼P 1:n P W S E S Q W T ∼Q 1:n Q W T ϕ(ρ( c W T (S P W S ,S Q W T ),W T )) (1.10) where ϕ(x) = x 2 for x∈ R and ρ is per Definition 1.1. By using well-known techniques from the statistical minimax literature we reduce the problem of finding a lower bound to a hypothesis testing problem (e.g. see [105, Chapter 15]). Since we are estimating the target parameter, i.e. W T , to apply this framework we need to pick N pairs of distributions from the setP ∆ such that their target parameters are 2δ-separated by the transfer distance per Definition 1.1. To be more precise, we pick N arbitrary pairs of distributions fromP ∆ : (P W (1) S ,Q W (1) T ),..., (P W (N) S ,Q W (N) T ) such that ρ(W (i) T ,W (j) T )≥ 2δ for each i̸=j∈ [N]× [N] (2δ-separated set) 19 and ρ(W (i) S ,W (i) T )≤ ∆ for each i∈ [N] (as they belong toP ∆ ) With these N distribution pairs in place we can follow a proof similar to that of [105, Proposition 15.1] to reduce the minimax problem to a hypothesis test problem. In particular, consider the following N-array hypothesis testing problem: • J is the uniform distribution over the index set [N] :={1, 2,...,N} • Given J =i, generate n S i.i.d. samples fromP W (i) S and n T i.i.d. samples fromQ W (i) T . Here the goal is to find the true index using n S +n T available samples by a testing functionψ from the samples to the indices. Let E and F be random variables such that E|{J = i}∼ P W (i) S and F|{J = i}∼ Q W (i) T . Furthermore, let Z P and Z Q consist of n S independent copies of random variable E and n T inde- pendent copies of random variable F, respectively. In this setting, by slightly modifying the [105, Proposition 15.1] we can conclude that R T (P ∆ ;ϕ◦ρ)≥ϕ(δ) 1 N N X i=1 Prob(ψ(Z P ,Z Q )̸=i) where ψ(Z P ,Z Q ) := n∈[N] ρ( c W T ,W n T ). 20 Furthermore, by using Fano’s inequality we can conclude that R T (P ∆ ;ϕ◦ρ)≥ϕ(δ) 1 N N X i=1 P n ψ(Z P ,Z Q )̸=i o ≥ϕ(δ) 1− I(J; (Z P ,Z Q )) + log 2 logN ≥ϕ(δ) 1− I(J;Z P ) +I(J;Z Q ) + log 2 logN ≥ϕ(δ) 1− n S I(J;E) +n T I(J;F ) + log 2 logN . (1.11) Here the third inequality is due to the fact that given J = i, Z P and Z Q are independent. To continue further, note that we can bound the mutual information by the following KL-divergences I(J;E)≤ 1 N 2 X i,j D KL (P W (i) S ||P W (j) S ) I(J;F )≤ 1 N 2 X i,t D KL (Q W (i) T ||Q W (j) T ). (1.12) In the next lemma, proven in Section 7.1.2, we explicitly calculate the above KL-divergences. Lemma 1.1. Suppose that P W (i) S and P W (j) S are the joint distributions of features and labels in a source task and Q W (i) T and Q W (j) T are joint distributions of features and labels in a target task as defined in Section 1.2.1 for the linear model. Then D KL (P W (i) S ||P W (j) S ) = ||Σ 1 2 S (W (i) S −W (j) S ) T || 2 F 2σ 2 and D KL (Q W (i) T ||Q W (j) T ) = ||Σ 1 2 T (W (i) T −W (j) T ) T || 2 F 2σ 2 . In the following two lemmas, we use local packing techniques to further simplify (1.11) using (7.15) and find minimax lower bounds in different transfer distance regimes. We defer the proof of these lemmas to Sections 7.1.3 and 7.1.4. 21 Lemma 1.2. Assume ∆ ≥ q σ 2 D log 2 r T n T , where n T is the number of target samples and D and r T are defined per Definitions 1.3 and 1.4. Then we have the following lowerbound R T (P ∆ ;ϕ◦ρ)≥ σ 2 D 256r T n T . (1.13) Furthermore, if ∆ < q σ 2 D log 2 r T n T then R T (P ∆ ;ϕ◦ρ)≥ 1 100 ∆ 2 1− 0.8 r T n T ∆ 2 σ 2 D ! . (1.14) Lemma 1.3. Assume we have access to n S source samples as well as n T target samples and the transfer distance obeys ∆ ≤ 1 45 q σ 2 D r S n S +r T n T , where D, r S , and r T are per Definitions 1.3 and 1.4. Then, R T (P ∆ ;ϕ◦ρ)≥ ∆ 2 1000 + 6 1000 Dσ 2 r S n S +r T n T . (1.15) The proof of the lower bound in Theorem 3.1 is complete by combining Lemmas 1.2 and 1.3. 22 Chapter 2 Statistical Minimax Lower Bounds for Transfer Learning in Linear Binary Classification 2.1 Introduction Modern machine learning models have achieved unprecedented success in numerous applications spanningcomputervisiontonaturallanguageprocessing. Mostofthesemodelsconsistofmillionsof parameters which require an abundance of labeled data for training. In many applications however duetoscarcityofdatatrainingmodelsthatalsogeneralizewellischallenging. Yetanotherchallenge is that these models do not adapt well to new environments. In particular, their performance degrades with modest changes in the data set and they may require as much data as training from scratch in the new environment. Transfer learning is a recent promising approach to tackle the aforementioned challenges by effectively utilizing the samples of a different but related source task, where there are typically many labeled samples, in order to improve the performance of the model on a target task with only a few available labeled samples for training. Indeed, in modern deep learning literature such transfer learning approaches that use pretrained models and fining tuning have enjoyed wide empirical success. Nevertheless, fundamental limits and benefits of transfer learning have not been 23 well understood and many key questions remain unanswered. How can we measure the similarity of two tasks to decide whether they are appropriate for transfer learning? Given access to a limited number of samples what would be the best possible accuracy we can achieve using any algorithms regardless of the computational complexity? How can we characterize the generalization error of the target task as a function of the number of source and target samples as well as a measure of similarity between them? In this chapter, we focus on answering these questions for linear models. This serves as a stepping stone for more general models and provides guidelines for development of more effective transfer learning algorithms. To this aim, we first define a measure to quantitatively capture the similarity distance of different tasks. We then derive statistical minimax lower bounds for binary classification with Gaussian features as a function of source and target samples as well as the measure of similarity. Our lower bounds consist of different regimes. When the distance of source and target tasks is high the corresponding lower bound is only a function of the target samples indicating that the target error is determined by the number of target samples and source samples are useful only up to a point. On the other hand, in the regime where the distance of source and target is low the target error depends on both the number of source and target samples which demonstrate that source samples are useful when source is similar to target. Finally, we perform various experiments on real data sets to corroborate our theoretical findings and investigate the utility of the measure defined in this chapter in practical scenarios. 2.2 Prior Works Related to transfer learning, [16, 112, 21, 109, 6, 98, 73] study domain adaptation problem where the goal is to adapt the hypothesis learned on the source domain to the target domain with small target generalization error. The common assumption in domain adaptation is that the source 24 and target share the same labeling function which is denoted as covariate shift and the marginal distributions have a small shift under an appropriate notion of similarity measure. There are numerous results in this literature that provide sufficiency results by finding upper bounds on the target generalization error. These upper bounds guarantee that some types of algorithms achieve a smalltargetgeneralizationerrornotexceedingathreshold. [8, 9]firstintroduceasimilaritymeasure of source and target which can be estimated by a finite number of unlabeled source and target samples. Then, they provide an upper bound for the target generalization error of a hypothesis in terms of the error of the hypothesis in the source domain as well as the distance of source and target using the introduced measure. [77] introduces a new discrepancy distance utilizing Radamacher complexity which extends the results of [8, 9] for a broad family of loss functions. More recently, [59] develops novel algorithms that achieve near optimal minimax risk in linear regression problems for two scenarios: 1- source and target share the same conditional distribution which is also denoted as covariate shift 2- Marginal distributions of the source and target are the same which is denoted as model shift. Therearealsoafewresultsprovidinglowerboundsforthetargetgeneralizationerror. [10]under the assumptions of covariate shift, existence of a joint optimal hypothesis, as well as similarity of distributions provides impossibility results. [43] by defining a new discrepancy measure, called transfer exponent, derives a minimax lower bound for target generalization error. [43] makes a relaxed covariate-shift assumption as well as Bernstein Class Condition assumption on label noise. More closely related to this work, [81] derives minimax lower bounds for transfer learning with one-hidden layer neural networks for regression problems. [81] unlike the most of results in this literature does not make the covariate shift assumption. Our work without making the covariate shift assumption derives minimax lower bounds for classification with linear models. 25 2.3 Problem Formulation In this section we formalize the transfer learning in binary classification. First we introduce the model considered in this chapter and then describe the minimax framework to derive the desired lower bounds. 2.3.1 Transfer Learning model We consider a problem where there are n S and n T number of labeled samples from a source and a target domain. Specifically, we denote the labeled source and target samples by (x S ,y S )∼ P and (x T ,y T ) ∼ Q where x S ,x T ∈ R d as well as y S ,y T ∈ {−1, 1} denote the features/inputs and labels/outputs. Moreover, P and Q denote the underlying joint distributions of the source and target samples. We also assume that features are generated by normal distributions, namely x S ,x T ∼N (0,I d ), and the distribution of the labels are as follows Prob(y S = 1|x S ) = 1 1 +exp(−x T S θ S ) (2.1) Prob(y T = 1|x T ) = 1 1 +exp(−x T T θ T ) (2.2) whereθ S ,θ T ∈R d are the ground truth parameters of the source and target tasks. Then, The optimal Bayes classifier of the target is as follows C θ T (x T ) = 1 ifx T T θ T ≥ 0 −1 o.w. (2.3) 26 In a transfer learning problem we aim at finding ˆ θ T by exploiting both the source and target samples such that the corresponding classifier, i.e. C ˆ θ T , is close to the optimal Bayes classifier by the following risk Prob C ˆ θ T (x T )̸=C θ T (x T ) Note that the Bayes and estimated classifiers do not depend on the magnitude of θ T and ˆ θ T . Therefore, without loss of generality we can assume that they lie on the unit sphere in R d . 2.3.2 Minimax Framework Inordertodevelopaminimaxframeworkforthetransferlearningproblem, weneedtodefineaclass of transfer learning problems consisting of pairs of source and target tasks. Here we denote each pair of source and target task by (P θ S ,Q θ T ) parametrized byθ S as well asθ T , and the distributions P θ S and Q θ T denote the joint distributions of features and labels over the source and target, i.e. (x S ,y S )∼ P θ S and (x T ,y T )∼ P θ T . In a transfer learning problem we use the source and target samples to find an estimate ˆ θ T ofθ T . In other words, ˆ θ T is a function of source and target samples. In a minimax framework, the target parameter, θ T , is chosen in an adversarial manner, and we are interested in minimizing the risk supE ∼ source and target samples Prob C ˆ θ T (x T )̸=C θ T (x T ) where the supremum is taken over an appropriate class of source and target tasks within a distance to reflect the difficulty of transfer learning. In order to define the classes of source and target tasks, we need to define an appropriate notion of transfer distance between them. First we state the following proposition regarding the considered risk to measure the performance of the estimate of the target parameter. Letθ T be the target parameter in Equation 2.2 andC θ T be the optimal Bayes classifier 27 defined in 2.3. Furthermore, let C ˆ θ T be an estimate of the Bayes classifier using an estimate ˆ θ T of θ T . Then the risk measuring the performance of the estimation is given by Prob C ˆ θ T (x T )̸=C θ T (x T ) = 1 π arccos( ˆ θ T T θ T ) Proposition 2.3.2 motivates us to define the transfer distance between a source and target as follows. Definition 2.1. (Transfer distance) For a source and target with parameters θ S and θ T , we define the transfer distance between them as ρ(θ S ,θ T ) := 1 π arccos(θ T S θ T ) Equipped with the notion of transfer distance, we can now state the transfer learning minimax risk. R ∆ T := inf b θ T sup ρ(θ S ,θ T )≤∆ E S P θ S ∼P 1:n S θ S S Q θ T ∼Q 1:n T θ T Prob C ˆ θ T (x T )̸=C θ T (x T ) (2.4) Here, S P θ S and S Q θ T denote i.i.d. samples{(x (i) S ,y (i) S )} n S i=1 and{(x (i) T ,y (i) T )} n T i=1 generated from the source and target distributions. Moreover, the parameter 0≤ ∆ ≤ 1 captures the class of pairs of source and target tasks within a distance over which the supremum is taken. 2.4 Main Results We now present our lower bounds for the transfer learning minimax risk 2.4 28 Theorem 2.1. Consider the transfer learning model defined in Section 2.3.1 consisting of n S and n T source and target training data generated i.i.d. according to a class of source/target tasks with transfer distance at most ∆ per Definition 2.1. Furthermore, assume the dimension d obeysd≥ 300 and n T > d 800 . Then, the transfer learning minimax risk 2.4 obeys the following lower bounds. R ∆ T ≥ c d n T , if ∆ ≥B1 (1− cos 2 (∆)) 1− n T (1−cos 2 (∆))+log 2 .04d , if B2≤ ∆ <B1 c d n S +n T , if ∆ <B2 (2.5) where B1 = 1 π arccos s 1− d 200n T [ 1 4 − 100 log 2 d ] B2 = .04d− log 2 16π(n S +n T ) and c is a numerical constant. Theorem 2.1 consists of three regimes: • Large transfer distance (∆ ≥B1). In this regime, the lower bound is independent of number of source samples, n S , which indicates that source samples are helpful until the target error for estimating the Bayes classifier reaches c d n T . Beyond this point, increasing n S is no longer helpful in reducing the target error, since in this regime source is far from the target and the similarity between them is low. 29 • Moderate distance (B2≤ ∆ <B1). The distance between source and target in this regime is lower than that in the previous regime. In this regime, the lower bound also does not depend onn S which shows that even in the case that the distance is not high but it is strictly positive, i.e. ∆ > 0, number of source samples cannot compensate for the target samples. Because even if there are infinitely many source samples, the error does not go to zero. • Small distance (∆ <B2). In this regime, since the source and target are similar to each other, source samples are as useful as target samples in reducing the target error. Furthermore, in a non-transferlearningsetting, theminimaxriskisproportionaltothedimensionandreciprocal of the number of samples. Similarly in this regime the risk is proportional to the dimension and reciprocal of combination of the source and target samples as if the source samples are as effective as target samples. 2.5 Experimental Results In this section we evaluate our theoretical formulations on a subset of DomainNet data set [88]. By plotting the theoretical lower bounds as well as upper bounds obtained by weighted empirical risk minimization we investigate the sharpness of the lower bounds. Furthermore, we investigate that the semantic transfer distance defined in 2.1 conforms with practical settings. Experimental setup. We use DomainNet to perform image classification task. We first pick three pairs of source and target tasks as described in Table 2.1. Then we extract features of di- mension 2048 by passing the raw images through a ResNet50 network pretrained on Imagenet. 30 Training. For each pair, We train linear networks separately for source and target tasks. Using the estimated parameters appearing in Definition 2.1 we calculate the semantic distance for each pair as shown in Table 2.1. For finding the corresponding upper bounds, we run weighted empirical risk minimization using the following formulation min θ 1 ,θ 2 1−λ n T n T X i=1 Cost(C θ 1 (x T ),y (i) T ) + λ n S n S X i=1 Cost(C θ 2 (x S ),y (i) S ) (2.6) where λ∈{0, 0.2, 0.4, 0.6, 0.8, 1} and the cost function is logistic regression. We run each ex- periment for 5 times and report the average of the results. Results. As Table 2.1 shows, the pair (Source1, Target) has the lowest transfer distance among other pairs since both the source and target share the same objects, namely Clock and Ambulance. The semantic distance of pair 2 is less than that of pair 2 because in pair 2 the source and target share at least one common object which is Clock. For plotting the theoretical lower bounds one needs to know the numerical constants appearing in Theorem 2.1. We will discuss it in Appendix 7.2. Fig 2.1 demonstrates that pairs with small semantic distance have lower target generalization error when the number of target samples is small. Because source samples would be more useful and compensate for the target samples. Furthermore, Fig 2.2 shows that pairs with lower semantic distance have higher λ used in 2.6 which suggests the effectiveness of source samples when the distance is small. 31 Tasks Transfer distance Target: Clock vs. Ambulance (Clipart) - Source1: Clock vs. Ambulance (Sketch) 0.35 Source2: Clock vs. apple (Sketch) 0.41 Source3:apple vs. animal-migration (Sketch) 0.48 Table 2.1: Three pairs of source and target tasks along with corresponding semantic distance 0 20 40 60 80 100 0 0.2 0.4 Number of Target Samples Generalization Error Lower Bound Source1-Target Source2-Target Source3-Target Figure 2.1: Theoretical lower bound along with upper bounds for three pairs of source and target obtained by weighted empirical risk minimization. 0 0.1 0.2 0.3 0.4 Average λ Pair1 Pair2 Pair3 Figure 2.2: Average λ used in weighted ERM 2.6 for three different pairs of source and target tasks shown in Table 2.1. 32 Chapter 3 Fundamental Limits of Transfer Learning in Multiclass Classification 3.1 Introduction Modern machine learning models such as deep neural networks have enjoyed wide success in many domains [54]. The success of such deep models critically relies on an enormous amount of data required for training these massive models. For instance, GPT3 which is the state of the art model for natural language process has 175 billion parameters and requires a data set of size 45 terabytes for training. However, in new or emerging application domains it is often extremely difficult or costly to gather such large labeled training data. A promising approach to this problem has been via transfer learning which aims at leveraging abundant available labeled data from a related source task to reduce the amount of labeled data required for the target task [87, 108]. From a practical perspective transfer learning has been rather successful empirically. In particular, state of the art transfer learning approaches based on pre- trained models and fine tuning has led to significant improvements on various benchmark datasets. Despite this empirical success however there is a huge gap between theory and practice in transfer learning and the fundamental limits and benefits of transfer learning are not well understood. Key challenging questions include: What is an appropriate notion of similarity between different tasks and how can it be quantitatively defined and computed on real data? What is the best achievable 33 accuracy of any transfer learning algorithm with only a limited number of source and target sam- ples? How does this accuracy depend on the number of samples and the similarity between the source and target tasks? While the answer to these challenging questions are still not fully understood, they have indeed attracted a lot of interesting theoretical work in this area [36]. We will discuss this literature in thoroughdetailinSection3.2. Inthischapter, wetakeasteptowardsansweringtheaforementioned key questions enabling a better understanding of the fundamental limits of transfer learning. We first focus on binary classification where the goal is to learn a classifier from a hypothesis class with finite VC dimension and then develop extensions to multiclass classification. This covers most contemporary classification models including the training deep neural networks for classification tasks. In this setting, we first define a natural notion of similarity between source and target tasks viatheperformanceofthebestsourcehypothesisonthetargettask. Thenequippedwiththisnotion of similarity, we derive a statistical minimax lower bound on the target generalization error in terms of the number of labeled data from source and target tasks as well as the VC dimension of the hypothesis class for binary classification and Natarajan dimension [83] for multiclass classification and the similarity between source and target tasks. Furthermore, we extend this result to the case wheretherearemultiplesourceswithdifferentsimilaritytothetarget. Ourresultsdemonstratethat sources with high similarity to the target are more effective at reducing the target generalization error. Towards bridging the theory-practice gap in transfer learning we also demonstrate the utility of our theoretical result in concrete applications. Indeed, a key feature of our result is that our lower bounds can be easily and efficiently computed on real data sets and apply to a broad class of practical settings. In summary our key contributions are as follows: 34 • We develop a novel statistical minimax lower bound on the generalization error that can be achieved for binary classification by any transfer learning algorithm as a function of the amount of source and target samples and a natural notion of similarity between source and target tasks. • We also develop extensions of our results for multiclass classification problems based on a generalization of the VC dimension called Natarajan dimension. • A key features of our lower bound (including our notion of similarity) is that it can be easily computed on real world data sets. Furthermore, our lower bound holds for any source/target distribution and applies with minimal assumptions to a wide variety of contemporary learning models including deep neural networks. • We investigate the sharpness of our lower bounds and demonstrate their utility via experi- ments on action recognition and image classification. 3.2 Prior Works A closely related literature to transfer learning is domain adaptation where there is no or very few labeled target data and the goal is to adapt the hypothesis learned on the source domain to achieve a low target generalization error [20, 15, 5, 73, 98]. Most of this literature assume that source and target share a common labeling rule but there is a shift in the marginal distributions. There are many upper bounds for the target generalization error in this setting this setting. For instance, [9, 8] gives an upper bound for the target generalization error in terms of source generalization error and a divergence measure between the domains that can be estimated by finitely many unlabeled data from the source and target. In another work [77] introduces a new discrepancy distance and generalizes the results of [9] for a wide family of loss functions using Rademacher complexity. 35 Similar to this setting, but for multiple source domain adaption scheme, [76] proposes a family of algorithms based on the idea of model selection under the assumption that target distribution is close to some convex combination of sources. A more recent work [59] studies linear regression under shift distribution including covariate shift (i.e. conditional distributions of source and target are the same) as well as model shift (i.e. only distributions of the features of the source and target are the same) and develops algorithms achieving near optimal minimax risk in this setting. In addition to upper bounds, there are also a few results which provide lower bounds for target generalization error. [28] provides impossibility results under the assumption of covariate shift and small discrepancy of unlabeled distributions. [82] studies transfer learning with one hidden layer neural networks for regression problems. This result defines a notion of similarity between the source and target tasks based on a distance between the ground truth parameters of the source and target networks. Using this distance this chapter develops a statistical minimax lower bound for the target generalization error in terms of the number of source and target samples as well as the defined similarity of the source and target under the distribution shift with the assumption that the features are generated by Gaussian distributions. Compared to [82] our result has quite a few unique advantages: (1) We do not assume that the source and target data are generated according to a planted (teacher) network and our results now even hold in the agnostic setting. (2) [82] applies to regression problems but this result covers classification (3) [82] only considered one-hidden layer neural networks for predicting the labels of extracted features. In this result we can handle arbitrary deep neural networks. (4) Our notion of similarity between the source and target distributions can be much more easily estimated by using only a few target data without the need for estimating the ground truth target parameters which requires lots of labeled target data. More closely related to this work [43] derives a minimax lower bound for target generalization error in binary classification under the assumption of a relaxed version of covariate shift and small 36 transfer exponent parameter which is defined to measure the discrepancy of the source and target distributions. Our work differs from this previous work as except for assuming the VC dimension of the model is finite we do not make any further assumptions. This makes our results applicable in a much broader set of classifications or decision making problems. Furthermore, our lower bound can be evaluated on real data sets and serve as a guideline to practitioners helping them decide when utilizing additional knowledge from a source domain is useful for a given target task. Most of the literature in transfer learning try to provide sufficiency and necessity results by deriving upper and lower bounds for target generalization error in a relatively general setting. However, these papers often require a variety of assumptions to find the optimal classifier in a target domain in closed form. For instance, [52, 51] defines a joint prior distribution of source and target domains using a Wishart distribution which relate the source and target tasks and then makesitpossibletostudyandunderstandthetransferabilitybetweendomains. Furthermore,inthis setting, the authors develop a closed form optimal Bayesian transfer learning and demonstrate its advantage over a classifier obtained by only target data. Related to this setting but for regressions, [51] obtains the optimal Bayesian transfer learning under setting of joint Gaussian feature/label distribution. In contrast with the above in our work we do not make any assumptions about the distribution of the data. 3.3 Problem Formulation We consider a transfer learning problem where there are some labeled training data from a source task and a target task with the goal of inferring a hypothesis function with small generalization error in the target task. More specifically, we assume have n S and n T source and target labeled data where each training data consists of an input/feature as well as an output/label. We denote the source and training data by (x S ,y S )∼P and (x T ,y T )∼Q, respectively, where y S ,y T ∈{0, 1} 37 andP,Q are the joint feature-label distributions of source and target data. Additionally, we assume that source and target features/inputs share a same domain, x S ,x T ∈ χ, andH⊂ 2 χ denotes a fixed hypothesis class with d H VC-dimension. In transfer learning the goal is to find a hypothesis from H that minimizing the target excess risk defined below based on a combination of source and target data. Definition 3.1. (Excess risk) For a hypothesis function h∈H and source and target label-feature data generated according to distributions P and Q ((x S ,y S )∼P and (x T ,y T )∼Q), we define the source and target excess risks as follows E T (h) =Q[h(x T )̸=y T ]−Q[h ∗ T (x T )̸=y T ] and E S (h) =P[h(x S )̸=y S ]−P[h ∗ S (x S )̸=y S ] where h ∗ T = arg min h∈H Q[h(x T )̸=y T ] and h ∗ S = arg min h∈H P[h(x S )̸=y S ] Next, we need to define an appropriate notion of distance between the source and target. In the literature of domain adaptation, where the conditional expectation remains unchanged and there is only a shift in input distributions, it is common to define the distance as the error of performance of the best source hypothesis in the target task. We also define the distance between source and target as the target excess risk of the best source hypothesis. 38 Definition 3.2. (Transfer distance) We define the transfer distance between a source and a target with distributions P and Q as follows ρ(P,Q) :=Q[h ∗ S (x T )̸=y T ]−Q[h ∗ T (x T )̸=y T ] (3.1) Since we aim to derive a minimax lower bound for transfer learning in binary classifications, we considerthe classofpairs ofdistributionswhosetransferdistanceiswithinafixednumber ∆ . As we will elaborate further in Remark 3.7 below this notion of distance can be easily estimated/computed in practice. 3.4 Main Results 3.4.1 Binary Classification In this section we characterize the fundamental limits of transfer learning in binary classification by deriving a minimax lower bound via information-theoretic arguments. Theorem 3.1. Consider a transfer learning problem where there are n S andn T number of source as well as target data and the hypothesis classH has VC dimensiond H obeyingd H ≥ 10. Furthermore, suppose that ˆ h = ˆ h(S P ,S Q ) is an estimated hypothesis for the target task using source and target data in which S P and S Q denote i.i.d. feature-label data pairs{(x (i) S ,y (i) S )} n S i=1 and{(x (i) T ,y (i) T )} n T i=1 generated according to the source and target distributionsP andQ. Fix a transfer distance ∆ < 0.99. Then for any ˆ h there exists (P,Q) with ρ(P,Q)≤ ∆ and a universal constant c such that Prob S P ,S Q E T ( ˆ h)>c·ϵ(n S ,n T ,d H , ∆) ! ≥ 3− 2 √ 2 8 , 39 where ϵ(n S ,n T ,d H , ∆) = s 1 n T d H + n S d H +n S ∆ . (3.2) This also implies that inf ˆ h sup ρ(P,Q)≤∆ E S P ,S Q h E T ( ˆ h) i ≥c·ϵ(n S ,n T ,d H , ∆) . (3.3) Remark 3.1. The bound above characterizes the fundamental limits of transfer learning by providing a lower bound on the excess risk of any algorithm (regardless of computational tractability) as a function of the number of source and target training data, the similarity/distance between the source and target tasks and the dimension of the hypothesis class used. Remark 3.2. The assumption ∆ < 0.99 in the statement of Theorem 3.1 is just made for simplifying the analysis and the upper bound of 0.99 can be replaced by any constant in the interval (0, 1). Remark 3.3. One can show that the numerical constant c in (3.3) obeys c> 3−2 √ 2 48 . Remark 3.4. (Connection to PAC learning) We note that the well-known agnostic PAC learning result for a single task gives a lower bound of c· q d H n where n is the number of samples of the task. Theorem 3.1 recovers this result when there is not any source task, namely n S = 0, and the transfer learning problem reduces to learning a task without any prior knowledge from the source. Remark 3.5. (Identical source and target) When the source and target tasks are identical, then the transfer learning problem reduces to learning a single task with n S +n T training data. Theorem 3.1, also leads to the same conclusion in this special case as when the source and target data are identical ∆ = 0 and thusϵ = q d H n S +n T which states that the lower bound is proportional to reciprocal of combination of source and target samples as expected. 40 Remark 3.6. (Sharpness in a special case) We note that the above lower bound is known to be tight in special cases. For instance when there is a small amount of source data and ∆ is rather large, the lower bound reduces to q d H n T which is known to be tight based on known agnostic PAC learning bounds. Remark 3.7. (How to apply Theorem 3.1 in practical settings.) In this remark we explain how Theorem 3.1 can be applied when using contemporary machine learning models involving artificial neural networks. In this case, the hypothesis class corresponds to all neural networks with a fixed architecture but different parameters. It is known that the class of neural networks with a fixed architecture has finite VC dimension and [7] gives upper and lower bounds for VC dimension of neural networks with ReLU activation functions. Thus, to apply Theorem 3.1, one only needs to have an estimate of the transfer distance per Definition 3.2. We note that the transfer distance (3.1) consists of two terms: To estimate the first term, we note that h ∗ S can be easily estimated due to the abundance of source data in most applications. Also with an estimate of h ∗ S in hand one can estimate Q[h ∗ S (x T )̸= y T ] rather accurately using a simple empirical average with a few target test data as well-known concentration of bounded functions imply that this empirical average is well concentrated aroundQ[h ∗ S (x T )̸=y T ]. Up on first glance it seems that estimating the second term which corresponds to the lowest possible error in the target domain among the hypothesis class, requires a large amount of labeled target data which is not available in a practical problem. However, in an overparametrized setting, it is typical to assume that there exists a network which achieves very small target generalization error so we can ignore the second term in most practical problems. Finally we note that as stated earlier the lower bound on the target excess risk gives an estimate of what generalization performance we can expect with a certain number of source and target samples. Furthermore, by comparing the estimated transfer distance of different pairs of tasks, we can find 41 the pairs that are more suitable for transfer learning. This knowledge can in turn significantly reduce the required number of target samples to achieve a certain accuracy. Next, we extend our result to a multiple source transfer learning setup where instead of only one source task there are several source tasks available and the goal is to transfer knowledge from multiple sources to a given target task to achieve a small target generalization error. Theorem 3.2. Suppose that there are n S 1 ,n S 2 ,...,n S N number of samples from N source tasks as well as n T number of samples from a target task and the hypothesis classH has VC dimension d H obeying d H ≥ max (N + 9,N/2). Furthermore, suppose that ˆ h = ˆ h(S P 1 ,S P 2 ,...,S P N ,S Q ) is an estimated hypothesis for the target task using N sources and target data where S P j and S Q denote i.i.d. data{(x (i) S j ,y (i) S j )} n S j i=1 and{(x (i) T ,y (i) T )} n T i=1 generated according to souce and target distributions P j andQ for j = 1,...,N. Fix transfer distances{∆ j } N j=1 where 0≤ ∆ j ≤ 1. Then for any ˆ h there exists (P 1 ,...,P M ,Q) with ρ(P j ,Q)≤ ∆ j and a universal constant c such that Prob S P 1 ,...,S P N ,S Q E T ( ˆ h)>c·ϵ(n S 1 ,...,n S N ,n T ,d H , ∆ 1 ,..., ∆ N ) ! ≥ 3− 2 √ 2 8 , where ϵ(n S 1 ,...,n S N ,n T ,d H , ∆ 1 ,..., ∆ N ) = v u u t 1 n T d H + n S 1 d H +n S 1 ∆ 1 +... + n S N d H +n S N ∆ N . (3.4) This in turn implies that inf ˆ h sup ρ(P j ,Q)≤∆ j j=1,...N E S P 1 ,..., P N ,S Q h E T ( ˆ h) i ≥c·ϵ(n S 1 ,...,n S N ,n T ,d H , ∆ 1 ,..., ∆ N ). (3.5) Remark 3.8. Similar to the previous theorem, Theorem 3.2 provides a minimax lower bound for target excess risk with the key distinction that now it applies in the setting where there are multiple 42 source with different transfer distances to the target. This theorem characterizes the exces risk achievable by any algorithm as a function of these transfer distances as well as the number of samples from the different sources and the target data. Theorem 3.2 indicates that the more sources we have, the better performance we can achieve in the target domain. However, this performance gain maybe marginal for source tasks that have a large transfer distance to the target or where there are very few training data. In these cases of course it may be more computationally efficient to discard these sources given the marginal improvement in the generalization performance suggested by this theorem. Remark 3.9. (Identical sources) if all the source tasks are identical, then there are effectively n S 1 + ... +n S N number of source samples and by Theorem 3.1 the lower bound would be v u u u u t 1 n T d H + P N j=1 n S j d H +∆ P N j=1 n S j . Theorem 3.2 also gives the same order wise lower bound as v u u t 1 n T d H + P N j=1 n j d H +∆ n j ≤ v u u u u t 1 n T d H + P N j=1 n S j d H +∆ P N j=1 n S j ≤ √ N· v u u t 1 n T d H + P N j=1 n j d H +∆ n j Remark 3.10. (Infinitely many source samples) When ∆ i > 0 and n S i →∞, the fraction n S i d H+n S i ∆ i saturates at 1 ∆ i which shows that when the source and target have positive distance, the source can never compensate for the target samples. Remark 3.11. In the lower bound, the product terms ∆ i n S i appear which indicate that a source with large transfer distance can sometimes be as useful as a source with small transfer distance when there is a large amount of training data available from that source. 43 3.4.2 Extension to Multiclass Classification Theorems 3.1 and 3.2 only cover binary classification. In order to extend the results to multiclass classification first we need a generalization of the VC dimension that captures the complexity of hypothesis classes of functions with multiple outputs. [83] introduces two generalizations, namely Natarajan dimension and graph dimension. Here we utilize the former to extend our results to multiclass classification. Definition 3.3. (Natarajan dimension)[[27], Definition 4] Let H⊆Y X be a hypothesis class where Y is a discrete space with|Y| =k. A subsetS⊆X is N-shattered byH if there existsf 1 ,f 2 :S→Y such that∀u∈S,f 1 (u)̸=f 2 (u), and for every T⊆S there is a g∈H such that ∀x∈T,g(x) =f 1 (x), and∀x∈S\T,g(x) =f 2 (x). The Natarajan dimension ofH, denoted by d N (H), is the maximal cardinality of a set that is N-shattered by H. Note that the transfer distance defintion in 3.2 is compatible with the multiclass case. Equipped with the definition of Natarajan dimension, we can now state the minimax lower bound for mutli- class setting. Theorem 3.3. Consider the setting described in Theorems 3.1 and 3.2 with the exception that the source and target labels y S and y T are not necessarily binary but the source and target still have the same number of classes. Moreover,H has Natarajan dimension d N (H) obeying d N (H)≥ 10. Then for the single source setting we have inf ˆ h sup ρ(P,Q)≤∆ E S P ,S Q h E T ( ˆ h) i ≥c·ϵ(n S ,n T ,d N (H), ∆) . (3.6) 44 where ϵ(n S ,n T ,d N (H), ∆) = s 1 n T d N (H) + n S d N (H)+n S ∆ . (3.7) Furthermore, for the multiple sources setting we have inf ˆ h sup ρ(P j ,Q)≤∆ j j=1,...N E S P 1 ,..., P N ,S Q h E T ( ˆ h) i ≥c·ϵ(n S 1 ,...,n S N ,n T ,d N (H), ∆ 1 ,..., ∆ N ) (3.8) where ϵ(n S 1 ,...,n S N ,n T ,d N (H), ∆ 1 ,..., ∆ N ) = v u u t 1 n T d N (H) + n S 1 d N (H)+n S 1 ∆ 1 +... + n S N d N (H)+n S N ∆ N . (3.9) Remark 3.12. Note that if|Y| = 2, Natarajan dimension coincides with the VC dimension and Theorem 3.3 would reduce to Theorems 3.1 and 3.2. Remark 3.13. (Connection to multiclass PAC learning) When there is no source, Theorem 3.3 recovers the agnostic multilcass PAC learning [27] 3.5 Experimental Results In this section we evaluate our theoretical results on real data sets for binary action recognition and imageclassificationtasks. ByestimatingtheparametersappearinginTheorem3.1fordifferentpairs of tasks, we first plot the lower bounds and then by running weighted empirical risk minimization investigate the sharpness of the bounds. We also investigate the effectiveness of different source tasks with different transfer distances on the target generalization error. 45 3.5.1 Action Recognition Experimental setup. We first perform experiments on the UCF101 action recognition data set. We pick CricketBowling and TableTennis videos from UCF101 as the target task as well as three dif- ferent pairs of classes as the source tasks: 1- CricketBowling and BaseballPitch, 2- Cricketshot and Archery, 3- BasketballDunk and Basketball. We pass the videos through an i3d network pretrained on kinetics400 [18] with the fully connected top classifier removed and extract the corresponding features of dimension 2048 from the raw videos. We then work with the extracted features instead of the raw videos. Training. We train a one hidden layer neural network with 15 number of hidden units and ReLU activation functions for each pair of data sets. Table 3.3 consists of test accuracy on CricketBowling vs. TableTennis, when using the network trained on each source task. We use these accuracies for deriving the corresponding lower bounds. Furthermore, we run weighted empirical risk minimiza- tion as a simple transfer learning approach to find some upper bounds on the target generalization error. Givenn S andn T number of source and target samples, for estimating the corresponding one hidden layer neural network parameters we minimize the following weighted empirical risk min W 1 ,W 2 1−λ n T n T X i=1 Cost(W 2 ReLU(W 1 x (i) T ),y (i) T ) + λ n S n S X i=1 Cost(W 2 ReLU(W 1 x (i) S ),y (i) S ) (3.10) where the function Cost denotes the logistic regression cost and λ∈{0, 0.2, 0.4, 0.6, 0.8, 1}. We then pick the lambda which minimizes the target test error. Results. First we calculate the transfer distance by Definition 3.2 for each source/target pairs using Table 3.1. To this end, we assume that best target generalization error is zero and using the Table 3.1 we obtain the transfer distance for each pair which is demonstrated in Table 3.2. As 46 it can be observed by Table 3.2, the pair of Source1 and Target has the lowest transfer distance among other pairs since both of the source and target tasks share a same class which is Crick- etBowling. Furthermore, Table 3.2 determines which pairs are more suitable for transferring the source knowledge to the target. Next, we draw the lower bound curves for each pair in Fig 3.1. To this end, we need to find the VC dimension of the hypothesis class which consists of neural networks with the architecture of 2048∗ 15∗ 1 with ReLU activation functions. Theorem 1 in [7] gives a lower bound for VC dimension of neural networks with ReLU activation functions by 1 640 WL log 2 W L where W and L are the number of parameters and layers, respectively. Then in Figure 3.2 we plot the lower bounds along with the upper bounds obtained via Formula (3.10) for three different pairs of source and target as well as using only target samples. We obtained these upper bounds by running Formula (3.10) five times and then averaging the results. Fig 3.2 shows that when the distance of a source from the target is small it would be more effective in achieving small target generalization error. We would like to mention that in all of these plots we choose the same number of source samples for each pair. Figure 3.7 shows the average λ, the weight appearing in Formula (3.10), when the number of target samples is 100 to 150. It shows that in the pair Source1 and Target the average λ is high which demonstrate the usefulness of the source in the target task. Furthermore, the small value of λ in the pair Source3 and Target suggests that when the transfer distance is high, source samples are no longer usefull. 3.5.2 Image Classification Experimental setup. In this section we focus on image classification tasks and utilize Theorem 3.1 to recognize appropriate pairs of tasks that are suitable for transfer learning. We choose some 47 Task Test accuracy of Target using the source network Target: CricketBowling vs. TableTennis - Source1: CricketBowling vs. Baseball Pitch 0.946 Source2: Cricketshot vs. Archery 0.61 Source3: BasketballDunk vs. Basketball 0.52 Table 3.1: Three pairs of source and target tasks for video recognition pair of tasks ρ(Source,Target) (Source1, Target) 0.053 (Source2, Target) 0.39 (Source3, Target) 0.48 Table 3.2: Transfer distance of pairs of source and target on UCF101 action recognition. Task Test Accuracy of Target using the source network Target: Clock vs. Am- bulance (Clipart) - Source1: Clock vs. Am- bulance (Sketch) 0.916 Source2: Clock vs. Crow (Sketch) 0.697 Source3: Crow vs. Bas- ket (Sketch) 0.65 Table 3.3: Three pairs of source and target tasks for image classification pair of tasks ρ(Source,Target) (Source1, Target) 0.083 (Source2, Target) 0.3 (Source3, Target) 0.35 Table 3.4: Transfer distance of pairs of source and target on DomainNet image classifications. classes of the DomainNet data set [88] as source and target tasks. We pick Clock and Ambulance from DomainNet Clipart for the target task and three different pairs of classes as the source tasks: 1- Clock and Ambulance, 2- Cricketshot and TableTennis, 3- TableTennis and FrontCraw. Here 48 Figure 3.1: Theoretical lower bounds Figure 3.2: Lower and upper bounds Figure 3.3: (a) depicts our lower bounds for three pairs of source and target tasks on action classification. (b) depicts the lower bounds along with the upper bounds obtained via weighted empirical risk minimization. we use ResNet50 network pretrained on Imagenet for extracting features of dimension 2048 and in the sequel we work with the extracted features rather than the raw image data. Training. We train a one hidden layer neural network with 15 number of hidden units and ReLU activation functions for each of pairs of the tasks. Table 3.3 includes the test accuracy on the target task when using the networks trained on different sources, which is necessary for 49 Figure 3.4: Theoretical lower bounds Figure 3.5: Lower and upper bounds Figure 3.6: (a) depicts our lower bounds for three pairs of source and target tasks on image classification. (b) depicts the lower bounds along with the upper bounds obtained via weighted empirical risk minimization. estimating/calculating the transfer distance as demonstrated in Table 3.4. Similar to the subsection 3.5.1, we also run weighted empirical risk minimization for finding upper bounds for the pairs of the source and target tasks. Results. Similar to the previous section on action recognition, using Table 3.3 we can obtain the transfer distances and based on this distance we can identify suitable pairs of source and target 50 0 0.2 0.4 0.6 Average λ Pair1 Pair2 Pair3 Figure 3.7: Average λ in weighted empirical risk minimization for three different pairs of source and target tasks for action recognition. 0 0.2 0.4 Average λ Pair1 Pair2 Pair3 Figure 3.8: Average λ in weighted empirical risk minimization for three different pairs of source and target tasks for image classification. tasks for transfer learning. In the pair1 Source and target tasks share the same objects which are Clock and Ambulance which results in low transfer distance. In pair2, still one of the objects which is Clock is the same in the source and target and we can see that the transfer distance for pair2 is lower than that for pair3. Then we plot the lower bounds in Fig 3.4 and the corresponding upper bounds obtained by weighted empirical risk minimization in Fig 3.5. One can see that sources that are closer to the target according to our notion of distance are more effective in achieving small target generalization error. CricketBowling is common both in the source and Target1. This suggests that these tasks are similar to each other and the estimated transfer distance conforms with this intuition. Furthermore, CricketBowling and Cricketshot are intuitively similar to one another and this is also reflected in the lower transfer distance between source and Target2. 51 In Fig 3.8 we plot the average λ, the weight appearing in Formula (3.10) when the number of target samples varies from 150 to 200. 3.8 demonstrates that when a source is close to the target the weight of source risk in weighted empirical risk becomes high which shows the effectiveness of source samples in achieving small target generalization error. 3.6 Proof outline The main idea of proof is based on the following proposition proved in [104] [Theorem 2.5 of [104]] Assume that M≥ 2 and the function d(·,·) is a semi-distance. Also suppose that{P θ j } θ j ∈Θ is a family of distributions indexed over a parameter space, Θ, and Θ contains elements θ 0 ,θ 1 ,...,θ M such that: (i) d(θ i ,θ j )≥ 2s> 0, ∀ 0≤j <k≤M (ii) P j ≪P 0 , ∀ j = 1,...,M,, and 1 M M X j=1 D kl (P j |P 0 )≤α logM with 0<α< 1/8 and P j =P θ j , j = 0, 1,...,M andD kl denotes the KL-divergence. Then inf ˆ θ sup θ∈Θ P θ (d( ˆ θ,θ)≥s) ≥ √ M 1 + √ M 1− 2α− s 2α logM Based on Proposition 3.6 we construct a family of pairs of distributions, namely source and target distributions, whose transfer distances satisfy the ∆ -constraint. To do so we pick some points from the domain χ shattered by the hypothesis class and define appropriate distributions on this set of points. Furthermore, this family of distributions are indexed in the space of{−1, 1} d 52 which can be a metric space using Hamming distance. In order to satisfy the condition (i) in Proposition 3.6, the indexes have to be well separated which can be achieved using the well-known Gilbert-Varshamov’s bound. Finally we show that estimating a parameter with small hamming distance is equivalent to estimating an appropriate hypothesis with small excess risk error. 53 Chapter 4 Near-Optimal Straggler Mitigation for Distributed Gradient Methods 4.1 Introduction Gradient descent (GD) serves as a workhorse for modern inferential learning tasks spanning com- puter vision to recommendation engines. In these learning tasks one is interested in fitting models to a training data set of m training examples{x j } m j=1 (usually consisting of input-output pairs). The fitting problem often consists of finding a mapping that minimizes the empirical risk L(w) := 1 m m X j=1 ℓ(x j ;w). Here, ℓ(x j ;w) is a loss function measuring the misfit between the model and output on x j with w denoting the model parameters. GD solves the above optimization problem via the following iterative updates w t+1 =w t −µ t ∇L(w t ) =w t −µ t 1 m m X j=1 g j (w t ). (4.1) Here,g j (w t ) =∇ℓ(x j ;w t ) is the partial gradient with respect tow t computed fromx j , and µ t is the learning rate in the tth iteration. 54 Worker 1 Worker 2 Worker n Master Figure 4.1: A master-worker distributed computing model for distributed gradient descent. In order to scale GD to handle massive amount of training data, developing parallel/distributed implementations of gradient descent over multiple cores or GPUs on a single machine, or multiple machines in computing clusters is of significant importance [90, 37, 122, 95]. In this chapter we consider a distributed computing model consisting of a master node and n workers as depicted in Fig. 4.1. Each worker i stores and processes a subset of r i training examples locally, and then generates a message z i based on computing partial gradients using the local training data, and then sends this message to the master node. The master collects the messages from the workers, and uses these messages to compute the total gradient and update the model via (4.1). If each worker processes a disjoint subset of the examples, the master needs to gather all partial gradients from all the workers. Therefore, when different workers compute and communicate at different speeds, the run-time of each iteration of distributed GD is limited by the slowest worker (or straggler). This phenomenon known as the straggler effect, significantly delays the execution of distributed computing tasks when some workers compute or communicate much slower than others. For example, it was shown in [4] that over a wide range of production jobs, stragglers can prolong the completion time by 34% at median. 55 We focus on straggler mitigation in the above distributed GD framework. To formulate the problem, we first define two key performance metrics that respectively characterize how much local processing is needed at each worker, and how many workers the master needs to wait for before it can compute the gradient. In particular, we define the computational load, denoted by r, as the number of training examples each worker processes locally, and the average recovery threshold, denoted byK, as the average number of workers from whom the master collects the results before it can recover the gradient. As a function of the computational loadr, the average recovery threshold K decreases as r increases. For example, when r = m n such that each worker processes a disjoint subset of the examples,K attains its maximum ofn. One the other hand, if each worker processes all examples, i.e., r = m, the master only needs to wait for one of them to return the result, achieving the minimum K = 1. For an arbitrary computational load m n ≤ r≤ m, we aim to characterize the minimum average recovery threshold across all computing schemes, denoted by K ∗ (r), which provides the maximum robustness to the straggler effect. Moreover, due to the high communication overhead to transfer the results to the master (especially for a high-dimensional model vector w), we are also interested in characterizing the minimum average communication load, denoted by L ∗ (r), which is defined as the average (normalized) size of the messages received at the master before it can recover the gradient. To reduce the effect of stragglers in this chapter we propose a distributed computing scheme, named “Batched Coupon’s Collector” (BCC). We will show that this scheme achieves the average recovery threshold K BCC (r) =⌈ m r ⌉H ⌈ m r ⌉ ≈ m r log m r , (4.2) 56 whereH n denotes thenth harmonic number. We also prove a simple lower bound on the minimum average recovery threshold demonstrating that K ∗ (r)≥ m r . Thus, our proposed BCC scheme achieves the minimum average recovery threshold to within a logarithmic factor, that is, K ∗ (r)≤K BCC (r)≤⌈K ∗ (r)⌉H ⌈ m r ⌉ ≈K ∗ (r) log m r . (4.3) We will also demonstrate that the BCC scheme achieves the minimum average communication load to within a logarithmic factor, that is, L ∗ (r)≤L BCC (r)≤⌈L ∗ (r)⌉H ⌈ m r ⌉ ≈L ∗ (r) log m r . (4.4) The basic idea of the proposed BCC scheme is to obtain the “coverage” of the computed partial gradients at the master. Specifically, we first partition the entire training dataset into m r batches of size r, and then each worker independently and randomly selects a batch to process. As a result, the process of collecting messages at the master emulates the coupon collecting process in the well- known coupon collector’s problem (see, e.g., [93]), which requires to collect a total of m r different types of coupons using n independent trials. Since the examples in different batches are disjoint, we can compress the computed partial gradients at each worker by simply summing them up, and send the summation to the master. Beyond the theoretical analysis, we also implement the proposed BCC scheme on Amazon EC2 clusters, and empirically demonstrate performance gain over the state-of-the-art straggler 57 mitigation schemes. In particular, we run a baseline uncoded scheme where the training examples are uniformly distributed across the workers without any redundant data placement, the cyclic repetition scheme in [102] designed to combat the stragglers for the worst-case scenario, and the proposed BCC scheme, on clusters consisting of 50 and 100 worker nodes respectively. We observe that the BCC scheme speeds up the job execution by up to 85.4% compared with the uncoded scheme, and by up to 69.9% compared with the cyclic repetition scheme. Finally, we generalize the BCC scheme to accelerate distributed GD in heterogeneous clusters, in which each worker may be assigned different number of training examples according to its computation and communication capabilities. In particular, we derive analytically lower and upper boundsontheminimumjobexecutiontime, bydevelopingandanalyzingageneralizedBCCscheme for heterogeneous clusters. We have also numerically evaluated the performance of the proposed generalized BCC scheme. In particular, compared with a baseline strategy where the dataset is distributed without repetition, and the number of examples a worker processes is proportional to its processing speed, we numerically demonstrate a 29.28% reduction in average computation time. Prior Art and Comparisons For the aforementioned distributed GD problem, a simple data placement strategy is that each worker selects r out of the m examples uniformly at random. Under this data placement, each worker processes each of the selected examples, and communicates the computed partial gradient individually to the master. Following the arguments of the coupon’s collector problem, this simple randomized computing scheme achieves an average recovery threshold K random ≈ m r logm. (4.5) 58 Similar to the proposed BCC scheme, this randomized scheme achieves the minimum average recoverythresholdtowithinalogarithmicfactor. However, sinceeachworkercommunicatesr times more messages, compared with the BCC scheme for which each worker sends a single aggregated message, the average communication load has increased to L random ≈m logm. (4.6) Recently a few interesting papers [102, 41, 89] utilize coding theory to mitigate the effect of stragglers in distributed GD. In particular, a cyclic repetition (CR) scheme was proposed in [102] to randomly generate a coding matrix, which specifies the data placement and how to encode the computed partial gradients across workers for communication. Furthermore, in [41] and [89], the same performance was achieved using deterministic constructions of Reed-Solomon (RS) codes and cyclic MDS (CM) codes. These coding schemes can tolerater− 1 stragglers in the worst case when the computational load isr. More specifically, when the number of examples is equal to the number of workers (m =n) ∗ , the above coded schemes achieve the recovery threshold K CR =K RS =K CM =m−r + 1. (4.7) In all of these coded schemes, each worker encodes the computed partial gradients by generating a linear combination of them, and communicates the single coded message to the master. This yields a communication load L CR =L RS =L CM =m−r + 1. (4.8) ∗ When m > n, we can partition the dataset into n groups, and view each group of m n training examples as a “super example”. 59 10 20 30 40 50 0 20 40 60 80 Computational Load (r) average recovery threshold (K) Lower bound Proposed BCC scheme Simple randomized scheme CR scheme Figure 4.2: The tradeoffs between the computational load and the average recovery threshold, for distributed GD using m = 100 training examples across n = 100 workers. While the above simple randomized scheme and the coding theory-inspired schemes are effec- tive in reducing the recovery threshold and the communication load respectively, the proposed BCC scheme achieves the best of both. In Fig. 4.2, we numerically compare the average recovery threshold of the randomized scheme, the CR scheme in [102], and the proposed BCC scheme, and demonstrate the performance gain of BCC. To summarize, the proposed BCC schemes has the following advantages • Simplicity: Unlike the computing schemes that rely on delicate code designs for data place- ment and communication, the BBC scheme is rather simple to implement, and has little coding overhead. • Reliability: The BCC scheme simultaneously achieves near minimal average recovery thresh- old and average communication load, enabling good straggler mitigation and fast job execu- tion. 60 • Universality: In contrast to the coding theory-inspired schemes like CR, the proposed BCC scheme does not require any prior knowledge about the number of stragglers in the cluster, which may not be available or vary across the iterations. • Scalability: ThedataplacementintheBCCschemeisperformedinacompletelydecentralized manner. This allows the BCC scheme to seamlessly scale up to larger clusters with minimum overhead for reshuffling the data. Finally, we highlight some recent developments of utilizing coding theory to speedup a broad class of distributed computing tasks. Coding theory was introduced into designing distributed computing systems as an effective tool to alleviate the communication bottlenecks caused by data shuffling. In [68, 63], for a general MapReduce framework implemented on a distributed computing cluster, an optimal tradeoff between the local computation on individual workers and the commu- nication between workers was characterized, exploiting coded multicasting opportunities created by carefully designing redundant computations across workers. This tradeoff was empirically demon- strated to provide significant speedup for distributed computing benchmarks like TeraSort [69]. Since the development of this optimal tradeoff, the idea of coding across redundant computations is extended to the scenarios of mobile edge computing [65], asymmetric computation tasks [34], and heterogeneouscomputingclusters[53]. Ontheotherhand,in[57,32,117,50],error-correctingcodes were applied to combat stragglers’ effect and speed up distributed linear algebra operations (e.g., matrix multiplications) and distributed optimization problems. For example, maximum distance separable (MDS) codes were utilized to generate redundant coded computing tasks, providing ro- bustness to missing results from stragglers. The proposed coded computing schemes in [63] and [57] were further generalized in [67, 62], where it was shown that the solutions of [63] and [57] are two end operating points on a more general tradeoff between computation latency and communication overhead. 61 4.2 Problem Formulation We focus on a data-distributed implementation of the gradient descent updates in (4.1). In partic- ular, as shown in Fig. 4.1 of Section 4.1, we employ a distributed computing system that consists of a master node, and n worker nodes (denoted by Worker 1, Worker 2,..., Worker n). Worker i, stores and processes locally a subset of r i ≤ m training examples. We useG i ⊆{1,...,m} to denote the set of the indices of the examples processed by Worker i. In the tth iteration, Worker i computes a partial gradientg j (w t ) with respect to the current weight vectorw t , for each j∈G i . Ideally we would like the workers to process as few examples as possible. This leads us to the following definition for characterizing the computational load of distributed GD schemes. Definition 4.1 (Computational Load). We define the computational load, denoted by r, as the maxi- mum number of training examples processed by a single worker across the cluster, i.e.,r := max i=1,...,n r i . The assignment of the training examples to the workers, or the data distribution, can be rep- resented by a bipartite graph G that contains a set of data vertices{d 1 ,d 2 ,...,d m }, and a set of worker vertices{k 1 ,k 2 ,...,k n }. There is an edge connecting d j and k i if Worker i computes g j locally, or in other words, j belongs toG i . Since each data point needs to be processed by some worker, we require thatN (k 1 )∪...∪N (k n ) ={d 1 ,...,d m }, whereN (k i ) denotes the neighboring set ofk i . After Workeri,i = 1,...,n, finishes its local computations, it communicates a function of the local computation results to the master node. More specifically, as shown in Fig. 4.1 Worker i communicates to the master a messagez i of the form z i =ϕ i ({g j :j∈G i }), (4.9) via an encoding function ϕ i . 62 LetW ⊆{1,...,n} denote the index of the subset of workers whose messages are received at the master. After receiving these messages, the master node calculates the complete gradient (based on all training data) by using a decoding function ψ. More specifically, ψ({z i :i∈W}) = 1 m m X j=1 g j (w t ). (4.10) In order for the master to be able to calculate the complete gradient from the received messages it needs to wait for a sufficient number of workers. We quantify this and a related parameter more precisely below. Definition 4.2 (Average Recovery Threshold). The average recovery threshold, denoted by K, is the average number of workers from whom the master waits to collect messages before recovering the final gradient, i.e., K :=E[|W|]. Definition 4.3 (Average Communication Load). We define the average communication load, denoted by L, as the average aggregated size of the messages the master receives from the workers with indices inW (i.e.,{z i : i∈W}), normalized by the size of a partial gradient computed from a single example. We say that a pair (r,K) is achievable if for a computational load r, there exists a distributed computing scheme, such that the master recovers the gradient after receiving messages from on average K or less workers. Definition 4.4 (Minimum Average Recovery Threshold). We define the minimum average recovery threshold, denoted by K ∗ (r), as K ∗ (r) := min{K : (r,K) is achievable} (4.11) 63 Wealsodefinetheminimumaveragecommunicationload,denotedby L ∗ (r),inasimilarmanner. In the next section, we propose and analyze a computing scheme for distributed GD over a homogeneous cluster, and show that it simultaneously achieves a near optimal average recovery threshold and communication load (up to a logarithmic factor). 4.3 The Batched Coupon’s Collector (BCC) Scheme In this section, we consider homogeneous workers with identical computation and communication capabilities. As a result, each worker processes the same number of training examples, and we have r 1 =r 2 =··· =r n =r. We note that in this case for the entire dataset to be stored and processed across the cluster, we must have m r ≤n. For this setting, we propose the following scheme which we shall refer to as “batched coupon’s collector” (BCC). 4.3.1 Description of BCC The key idea of the proposed BCC scheme is to obtain the “coverage” of the computed partial gradients at the master. As indicated by the name of the scheme, BCC is composed of two steps: “batching” and “coupon collecting”. In the first step, the training examples are partitioned into batches, which are selected randomly by the workers for local processing. In the second step, the processing results from the data batches are collected at the master, emulating the process of the well-known coupon’s collector problem. Next, we describe in detail the proposed BCC scheme. Data Distribution. For a given computational load r, as illustrated in Fig. 4.3, we first evenly par- tition the entire data set into⌈ m r ⌉ data batches, and denote the index sets of the examples in these batches byB 1 ,B 2 ,...,B ⌈ m r ⌉ . Each of the batches containsr examples (with the last batch possibly being zero-padded). Each worker node independently picks one of the data batches uniformly at random for local processing. The probability that one or more batches are not picked by any one of 64 the workers vanishes as the number of workers increases. That is, with high probability, the entire dataset is stored across the worker nodes. We denote the set of indices of the data points selected by Worker i asB σ i , i.e.G i =B σ i . Worker 1 Worker 2 Worker n Partitioning Batch 1 Batch 2 Batch m/r Uniformly Random Sampling Figure 4.3: The data distribution of the proposed BCC scheme. The training dataset is evenly partition into m/r batches of size r, from which each worker independently selects one uniformly at random. Communication. After computing the partial gradient g j for all j∈B σ i , Worker i computes a single message by summing them up i.e., z i = X j∈Bσ i g j , (4.12) and sendsz i to the master. Data Aggregation at the Master. When the master node receives the message from a worker, it discards the message if the master has received the result from processing the same batch before, and keeps the message otherwise. The master keeps collecting messages until the processing results from all data batches are received. Finally, the master reduces the kept messages to the final result by simply computing their summation. We would like to note that the above BCC scheme is fully decentralized and coordination- free. Each worker selects its data batch independently of the other workers, and performs local computation and communication in a completely asynchronous manner. There is no need for any 65 feedback from the master to the workers or between the workers. All these features make this scheme very simple to implement in practical scenarios. 4.3.2 Near Optimal Performance Guarantees for BCC In this subsection, we theoretically analyze the BCC scheme, whose performance provides an upper bound on the minimum average recovery threshold of the distributed GD problem, as well as an upper bound on the minimum average communication load. To start, we state the main results of this chapter in the following theorem, which characterizes the minimum average recovery threshold and the minimum average communication load to within a logarithmic factor. Theorem 4.1. For a distributed gradient descent problem of training m data examples distributedly over n worker nodes, we have m r ≤K ∗ (r)≤K BCC (r) =⌈ m r ⌉H ⌈ m r ⌉ , (4.13) m r ≤L ∗ (r)≤L BCC (r) =⌈ m r ⌉H ⌈ m r ⌉ , (4.14) for sufficiently large n, whereK ∗ (r) andL ∗ (r) are the minimum average recovery threshold and the minimum average communication load respectively, K BCC (r) andL BCC (r) are the average recovery threshold and the average communication load achieved by the BCC scheme, and H t = P t k=1 1 k is the t-th harmonic number. Before proving the above Theorem 1, we first provide the following remarks, which discuss some implications of this result, and compare the performance of BCC with some other straggler mitigation strategies. 66 Remark 4.1. Given the fact that H ⌈ m r ⌉ ≈ log(⌈ m r ⌉), the results of Theorem 4.1 imply that for the homogeneous setting, the proposed BCC scheme simultaneously achieves a near minimal average recovery threshold and communication load ( up to a logarithmic factor). □ Remark 4.2. As we mentioned before, other coding-based approaches [102, 41, 89] mostly focus on the worst-case scenario, resulting in a high recovery threshold e.g. K CR =m−r + 1. † In contrast, instead of focusing on worst-case scenarios, our proposed scheme aims at achieving the “coverage” of the partial computation results at the master, by collecting the computation of a much smaller number of workers (on average). As numerically demonstrated in Fig. 4.2 in Section 4.1, The BCC scheme brings down the average recovery threshold from m−r + 1 to roughly m r log m r . □ Remark 4.3. In the coded computing schemes proposed in [102, 41, 89], a linear combination of the locally computed partial gradients is carefully designed at each worker, such that the final gradient can be recovered at the master with minimum message sizes communicated by the workers. In the BCC scheme, each worker also communicates a message of minimum size, which is created by summing up the local partial gradients. As a result, on average BCC achieves a much smaller recovery threshold and hence can substantially reduces the total amount of network traffic. This is especially true when the dimension of the gradient is large, leading to significant speed-ups in the job execution. □ Remark 4.4. The coded schemes in [102, 41, 89] are designed to make the system robust to a fixed number of stragglers. Specifically, for a cluster with s stragglers, a code can be designed such that the master can proceed after receiving m−s messages, no matter which s workers are slow. However, it is often difficult to predict the number of stragglers in a cluster, and it can change across iterations of the GD algorithm, which makes the optimal selection of this parameter for the † This is assuming m = n. We would like to point out that although designed for the worst-case, the fractional scheme proposed in [102] may finish when the master collects results from less than m−r + 1 workers. However, it only applies to the case where r divides m. 67 coding schemes in [102, 41, 89] practically challenging. In contrast, our proposed BCC scheme is universal, i.e., it does not require any prior knowledge about the stragglers in the cluster, and still promises a near-optimal straggler mitigation. □ Proof of Theorem 4.1. Thelowerbound m r in(4.13)and(4.14)isstraightforward. Theycorrespond to the best-case scenario where all workers the master hears from before completing the task, have mutually disjoint training examples. The upper bound in (4.13) and (4.14) is simultaneously achieved by the above described BCC scheme. To see this, we view the process of collecting messages at the master node as the classic coupon collector’s problem (see e.g., [93]), in which given a collection of N types of coupons, we need to draw uniformly at random, one coupon at a time with replacement, until we collect all types of coupons. In this case, we have⌈ m r ⌉ batches of training examples, from which each worker independently selects one uniformly at random to process. It is clear that the process of collecting messages at the master is equivalent to collecting coupons of N =⌈ m r ⌉ types. As we know that the expected numbers of draws to collect all N different types of coupons is NH N , we use N =⌈ m r ⌉ and reach the upper bound on the minimum average recovery threshold. To characterize the average communication load of the BCC scheme, we first note that since each worker communicates the summation of its computed partial gradients, the message size from each worker is the same as the size of the gradient computed from a single example. As a result, a communication load of 1 is accumulated from each surviving worker, and the BCC scheme achieves an average communication load that is the same as the achieved average recovery threshold. Beyond the theoretical analysis, we also implement the proposed BCC scheme for distributed GD over Amazon EC2 clusters. In the next section, we describe the implementation details, and compare its empirical performance with two baseline schemes. 68 4.3.3 Empirical Evaluations of BCC In this subsection, we present the results of experiments performed over Amazon EC2 clusters. In particular, we compare the performance of our proposed BCC scheme, with the following two schemes. • uncoded scheme: In this case, there is no repetition in data among the workers and the master has to wait for all the workers to finish their computations. • cyclic repetition scheme of [102]: In this case, each worker processes r training examples and in every iteration, the master waits for the fastest m−r + 1 workers to finish their computations. 4.3.3.1 Experimental Setup We train a logistic regression model using Nesterov’s accelerated gradient method. We compare the performanceoftheBCC,theuncodedandthecyclicrepetitionschemesonthistask. WeusePython as our programming language and MPI4py [25] for message passing across EC2 instances. In our implementation, we load the assigned training examples onto the workers before the algorithms start. We measure the total running time via Time.time(), by subtracting the starting time of the iterations from the completion time at the master. In the tth iteration, the master communicates the latest modelw t to all the workers using Isend(), and each worker receives the updated model using Irecv(). In the cyclic repetition scheme, each worker sends the master a linear combination of the computed partial gradients, whose coefficients are specified by the coding scheme in [102]. In the BCC and uncoded schemes the workers simply send the summation of the partial gradients back to the master. When the master receives enough messages from the workers, it computes the overall gradient and updates the model. Data Generation. We generate artificial data using a similar model to that of [102]. Specifically, we create a dataset consisting ofd input-output pairs of the form D ={(x 1 ,y 1 ), (x 2 ,y 2 ),..., (x d ,y d )}, 69 wheretheinputvectorx i ∈R p containspfeatures, andtheoutputy i ∈{−1, 1}isthecorresponding label. In our experiments we set p = 8000. To create the dataset, we first generate the true weight vector w ∗ whose coordinates are randomly chosen from{−1, 1}. Then, we generate each input vector according tox∼ 0.5×N (µ 1 ,I) + 0.5×N (µ 2 ,I) whereµ 1 = 1.5 p w ∗ andµ 2 = −1.5 p w ∗ , and its corresponding output label according to y∼Ber(κ), with κ = 1/(exp(x T w ∗ ) + 1). We run Nesterov’s accelerated gradient descent distributedly for 100 iterations, using the afore- mentioned three schemes. We compare their performance in the following two scenarios: • scenario one: We use 51 t2.micro instances, with one master and n = 50 workers. We have m = 50 data batches, each of which contains 100 data points generated according to the afore- mentioned model. • scenario two: We use 101 t2.micro instances, with one master and n = 100 workers. We have m = 100 data batches, each of which contains 100 data points. 4.3.3.2 Results For the uncoded scheme, each worker processes r = m n data batches. For the cyclic repetition and the BCC schemes, we vary the value of the computational load r to find the one that achieves the minimum total running time, subject to the memory constraints on the EC2 instances, and report the minimum running times for these two schemes. We plot the total running times of the three schemes in both scenarios in Fig. 6.5. We also list the breakdowns of the running times for scenario one in Table 6.2 and scenario two in Table 4.2 respectively. Withineachiteration,wemeasurethecomputationtimeasthemaximumcomputation time among the workers whose results are received by the master before the iteration ends. After the last iteration, we add the computation times in all iterations to reach the total computation 70 Figure 4.4: Running time comparison of the uncoded, the cyclic repetition, and the BCC schemes on Amazon EC2. In scenario one, we have n = 50 workers, and m = 50 data batches. In scenario two, we have n = 100 workers, and m = 100 data batches. Each data batch contains 100 data points. In both scenarios, the computational loads of the cyclic repetition scheme and the BCC scheme are chosen to minimize the total running times. time. The communication time is computed as the difference between the total running time and the computation time. ‡ We draw the following conclusions from these results. • As we observe in Fig. 6.5, in scenario one, the BCC scheme speeds up the job execution by 85.4% over the uncoded scheme, and 69.9% over the cyclic repetition scheme. In scenario two, the BCC scheme speeds up the job execution by 73.0% over the uncoded scheme, and 69.7% over the cyclic repetition scheme. In scenario one, we observe the master waiting for on average 11 workers to finish their computations, compared with 41 workers for the cyclic repetition scheme and all 50 workers for the uncoded scheme. In scenario two, we observe the master waiting for on average 25 workers to finish their computations, compared with 91 workers for the cyclic repetition scheme and all 100 workers for the uncoded scheme. ‡ Due to the asynchronous nature of the distributed GD, we cannot exactly characterize the time spent on computation and communication (e..g., often both are happening at the same time). The numbers listed in Tables I and II provide approximations of the time breakdowns. 71 scheme recovery threshold communication time (sec.) computation time (sec.) total running time (sec.) uncoded 50 28.556 0.230 28.786 cyclic repetition 41 12.031 1.959 13.990 BCC 11 3.043 1.162 4.205 Table 4.1: Breakdowns of the running times of the uncoded, the cyclic repetition, and the BCC schemes in scenario one. scheme recovery threshold communication time (sec.) computation time (sec.) total running time (sec.) uncoded 100 31.567 1.453 33.020 cyclic repetition 91 24.698 4.784 29.482 BCC 25 7.246 1.685 8.931 Table 4.2: Breakdowns of the running times of the uncoded, the cyclic repetition, and the BCC schemes in scenario two. • As we note in Fig. 6.5, the performance gains of both cyclic repetition and BCC schemes over the uncoded scheme become smaller with increasing number of workers. This is because that as the number of workers increases, in order to optimize the total running time, we need to also increase the computational loadr at each worker to maintain a low recovery threshold. However, due to the memory constraints of the EC2 instances, we cannot increase r beyond the value 10 to fully optimize the run-time performance. • We observe from Table 6.2 and Table 4.2 that having a smaller recovery threshold benefits both the computation time and the communication time. While the BCC scheme and the cyclic repetition scheme have the same computational load at each worker, the computation time of BCC is much shorter since it needs to wait for a smaller number of workers to finish. On the other hand, lower recovery threshold of BCC yields a lower communication load that is directly proportionaltothecommunicationtime. Asaresult, sinceinallexperimentsthecommunication time dominates the computation time, the total running time of each scheme is approximately proportional to its recovery threshold. 72 4.4 Extension to Heterogeneous Clusters For distributed GD in heterogeneous clusters, workers have different computational and communi- cation capabilities. In this case, the above proposed BCC scheme is in general sub-optimal due to its oblivion of network heterogeneity. In this section, we extend the above BCC scheme to tackle distributed DC over heterogeneous clusters. We also theoretically demonstrate that the extended BCC scheme provides an approximate characterization of the minimum job execution time. 4.4.1 System Model In the heterogeneous setting, we consider an uncoded communication scheme where after processing thelocaltrainingexamples, eachworkercommunicateseachofitslocallycomputedpartialgradients separately to the master. That is, Worker i, i = 1,...,n, communicatesz i ={g j :j∈G i } to the master. Under this communication scheme, the master computes the final gradient as soon as it collects the partial gradients computed from all examples. When this occurs, we say that coverage is achieved at the master node. We assume that the time required for Workers to process the local examples and deliver the partial gradients are independent from each other. We assume that this time interval, denoted by T i for Worker i, is a random variable with a shift-exponential distribution, i.e., Pr[T i ≤t] = 1− exp −µ i r i (t−a i r i ) , t≥a i r i . (4.15) Here, µ i ≥ 0 and a i ≥ 0 are the fixed straggler and shift parameters of Worker i. In this case, the total job execution time, or the time to achieve coverage at the master is given by T := min t : ∪ i:T i ≤t G i ={1,...,m} . (4.16) 73 We are interested in characterizing the minimum average execution time in a heterogeneous cluster, which can be formulated as the following optimization problem. P 1 : minimize G E[T ], (4.17) In the rest of this section, we develop lower and upper bounds on the optimal value ofP 1 . 4.4.2 Lower and Upper Bounds on Optimal Value ofP 1 To start, we first define the waiting time for the master to receive at least s partial gradients (possibly with repetitions) ˆ T (s) := min t : X i:T i ≤t r i ≥s . (4.18) We also consider the following optimization problem P 2 : minimize r 1 ,...,rn E[ ˆ T (s)]. (4.19) For the master to collect all m partial gradients, one computed from each training example, for any dataset placement, it has to receive at least s≥m partial gradients (possibly with repeti- tions) from the workers. Therefore, it is obvious that the coverage time T cannot be shorter than ˆ T (m), and the optimal value min r 1 ,...,rn E[ ˆ T (m)] provides a lower bound on the optimal value of the coverage problemP 1 . For the above optimization problemP 2 , an algorithm is developed in [91] for distributed matrix multiplication on heterogeneous clusters. This algorithm obtains computation loadsr 1 ,...,r n that are asymptotically optimal in the largen limit. Therefore, utilizing the results in [91], we can obtain a good estimate of the optimal value min r 1 ,...,rn E[ ˆ T (s)]. 74 It is intuitive that once we fix the work loads at the worker, i.e., (r 1 ,r 2 ,...,r n ), the time for the master to receive s results ˆ T s should increase as s increases. We formally state this phenomenon in the following lemma. Lemma 4.1 (Monotonicity). Consider an arbitrary dataset placement G where Worker i processes |G i | =r i training examples, for any 0≤s 1 ,s 2 ≤ P n i=1 r i , such that s 1 ≤s 2 , we have E G [ ˆ T (s 1 )]≤E G [ ˆ T (s 2 )]. (4.20) Proof. For a fixed dataset placement G, we consider a particular realization of the computation times across the n workers, denoted by δ = (t 1 ,t 2 ,...,t n ), where t i is the realization of T i for Worker i to process r i data points. We denote the realization of ˆ T (s) underδ as ˆ t δ (s). Obviously, for s 1 ≤s 2 , we have ˆ t δ (s 1 )≤ ˆ t δ (s 2 ). Since this is true for all realizationsδ, we have E G [ ˆ T (s 1 )]≤ E G [ ˆ T (s 2 )]. To tackle the distributed GD problem over heterogeneous cluster, we generalize the above BCC scheme, and characterize the completion time of the generalized scheme using the optimal value of the above problemP 2 . The characterized completion time serves as an upper bound on the minimum average coverage time. Next, we state this result in the following theorem. Theorem 4.2. For a distributed gradient descent problem of training m data examples distributedly over n heterogeneous worker nodes, where the computation and communication time at Worker i has an exponential tail with a straggler parameterµ i and a shift parametera i , the minimum average time to achieve coverage is bounded as min G E[T ]≥ min r 1 ,...,rn E[ ˆ T (m)], (4.21) min G E[T ]≤ min r 1 ,...,rn E[ ˆ T (⌊cm logm⌋)] + 1, (4.22) 75 where c = 2 + log(a+Hn/µ ) logm , a = max(a 1 ,...,a n ), µ = min(µ 1 ,...,µ n ). The proof of Theorem 4.2 is deferred to the appendix. Remark 4.5. The above theorem, when combined with the results in [91] on evaluating min r 1 ,...,rn E[ ˆ T (s)], allows us to obtain a good estimate on the average minimum coverage time. Specifically, we can apply the results in [91] to evaluate the lower and upper bounds in Theorem 4.2 for s = m and s =⌊cm logm⌋, respectively. □ Remark 4.6. The upper bound on the average coverage time is achieved by a generalized BCC scheme, in which given the optimal data assignments (r ∗ 1 ,...,r ∗ n ) for P 2 with s =⌊cm logm⌋, Workeri independently selects r ∗ i examples uniformly at random. We emphasize that similar to the BCC data distribution policy in the homogeneous setting, the main advantages of the generalized BCC lies in its simplicity and decentralized nature. That is, each node selects the training examples randomly and independently from the other nodes, and we do not need to enforce a global plan for the data distribution. This also provides a scalable design so that when a new worker is added to the cluster, according to the updated dataset assignments computed fromP 2 with n + 1 workers and s =⌊cm logm⌋, each worker can individually adjust its workload by randomly adding or dropping some training examples, without needing to coordinate with the master or other workers. □ 4.4.3 Numerical Results We numerically evaluate the performance of the generalized BCC scheme in heterogeneous clusters, using the proposed random data assignment. In this case, we compute the optimal assignment (r ∗ 1 ,...,r ∗ n ) to minimize the average time for the master to collect⌊m logm⌋ partial gradients. In comparison, we also consider a “load balancing” (LB) assignment strategy where them data points are distributed across the cluster based on workers’ processing speeds, i.e., r i = µ i P µ i m. 76 We consider the computation task of processingm = 500 examples over a heterogeneous cluster of n = 100 workers. All workers have the same shift parameter a i = 20, for all i = 1,...,n. The straggling parameter µ i = 1 for 95 workers, and µ i = 20 for the remaining 5 workers. As shown in Fig. 4.5, the computation of the LB assignment is long since the master needs to wait for every worker to finish. However, utilizing the proposed random assignment, the master can terminate the computation once it has achieved coverage, which significantly alleviates the straggler effect. As a result, the generalized BCC scheme reduces the average computation time by 29.28% compared with the LB scheme. LB Generalized BCC 700 800 900 1,000 Average Computation Time Figure 4.5: Illustration of the performance gain achieved by generalized BCC scheme for a heterogeneous cluster. 4.5 Conclusion We propose a distributed computing scheme, named batched coupon’s collector (BCC), which ef- fectively mitigates the straggler effect in distributed gradient descent algorithms. We theoretically illustrate that the BCC scheme is robust to the maximum number of stragglers to within a logarith- mic factor. We also empirically demonstrate the performance gain of BCC over baseline straggler mitigation strategies on EC2 clusters. Finally, we generalize the BCC scheme to minimize the job execution time over heterogeneous clusters. 77 Chapter 5 Lagrange Coded Computing: Optimal Design for Resiliency, Security, and Privacy 5.1 Introduction Themassivesizeofmoderndatasetsnecessitatescomputationaltaskstobeperformeddistributedly, where data is dispersed among many servers operating in parallel [2]. As we “scale out” computa- tions across many servers, several fundamental challenges arise. Cheap commodity hardware tends to vary greatly in computation time, and it has been demonstrated [30, 61, 110] that a small frac- tion of servers, referred to as stragglers, can be 5 to 8 times slower than average, creating significant delays in computations. Also, as we distribute computations across many servers, massive amounts data must be moved between them to execute the computational tasks, often over many iterations of a running algorithm, and this creates a substantial bandwidth bottleneck [60]. Distributed com- puting systems are also much more susceptible to adversarial servers, making security and privacy a major concern [14, 24, 17]. We consider a general scenario where computation is carried out distributively across several workers, and propose Lagrange Coded Computing (LCC), a new framework to simultaneously pro- vide 78 Worker 1 Worker 2 Worker # $ Worker # % Worker & ' Worker & ( Worker ) dataset * ' ,…,* - . possibly malicious nodes / possibly colluding nodes Coding of the dataset 0 * 1 0 * ' 0 * $ Master Worker # ' . . . . . . 0 * 23 0 * 45 . . . . . . 6( 0 * ' ) 6( 0 * $ ) 6( 0 * 1 ) Worker 9 ' Worker 9 : . . . 0 * ;< = stragglers *? ? 23 ? 25 Figure5.1: Anoverviewoftheproblemconsideredinthischapter,wherethegoalistoevaluatea not necessarily linear functionf on a given datasetX = (X 1 ,...,X K ) usingN workers. Each worker appliesf on a possibly coded version of inputs (denoted by ˜ X i ’s). By carefully designing the coding strategy, the master can decode alltherequiredresultsfromasubsetofworkers,inthepresenceof stragglers (workerss 1 ,...,s S )andByzantine workers (workers m 1 ,...,m A ), while keeping the dataset private to colluding workers (workers c 1 ,...,c T ). leftmirgin=* resiliency against straggler workers that may prolong computations; leftmiirgiin=* security against Byzantine (or malicious, adversarial) workers, with no compu- tational restriction, that deliberately send erroneous data in order to affect the computation for their benefit; and leftmiiirgiiin=* (information-theoretic) privacy ofthedatasetamidstpossiblecollusionofworkers. LCC can be applied to any computation scenario where the function of interest is an arbitrary multivariate polynomial of the input dataset. This covers many computations of interest in machine learning, such as various gradient and loss-function computations and tensor algebraic operations (e.g., low-rank tensor approximation). The key idea of LCC is to encode the input using Lagrange polynomials, to create computational redundancy in a novel coded form across workers. This redundancy can then be exploited to provide resiliency, security, and privacy. Specifically, as illustrated in Fig. 5.1, using a master-worker distributed computing architecture withN workers, the goal is to computef(X i ) for everyX i in a large datasetX = (X 1 ,X 2 ,...,X K ), where f is a given polynomial with degree degf. To do so, N coded versions of the input dataset, denoted by ˜ X 1 , ˜ X 2 ,..., ˜ X N are created, and the workers then compute f over the coded data, as 79 if no coding is taking place. For a given N and f, we say that the tuple (S,A,T ) is achievable if there exists an encoding and decoding scheme that can complete the computations in the presence of up to S stragglers, up to A adversarial workers, whilst keeping the dataset private against sets of up to T colluding workers. Our main result is that by carefully encoding the dataset the proposed LCC achieves (S,A,T ) if (K +T− 1) degf +S + 2A + 1≤ N. The significance of this result is that by one additional worker (i.e., increasing N by 1) LCC can increase the resiliency to stragglers by 1 or increase the robustness to malicious servers by 1/2, while maintaining the privacy constraint. Hence, this result essentially extends the well-known optimal scaling of error-correcting codes (i.e., adding one parity can provide robustness against one erasure or 1/2 error in optimal maximum distance separable codes) to the distributed secure computing paradigm. We prove the optimality of LCC by showing that it achieves the optimal tradeoff between resiliency, security, and privacy. This further extends the scaling law in coding theory, showing that similar to the resiliency and security requirement, it is fundamental that any additional worker increases data privacy against colluding workers by 1/ degf. Finally, we specialize our theoretical guarantees of LCC in context of least-squares linear regres- sion, an elemental learning task, and demonstrate its performance gain by optimally suppressing stragglers. Leveraging the algebraic structure of gradient computation, several strategies have been developed recently to exploit data and gradient coding for straggler mitigation in the training process (e.g. [58, 103, 75, 50, 66]). We implement LCC for regression on Amazon EC2 clusters, and empirically compare its performance with conventional uncoded approaches and two state-of- the-art straggler mitigation schemes: gradient coding (GC) [103, 42, 89, 111] and matrix-vector multiplication (MVM) based approaches [58, 75]. Our experiment shows that compared with the 80 uncoded scheme, GC scheme, and MVM scheme, LCC improves the run-time by 6.79×-13.43×, 2.36×-4.29×, and 1.01×-12.65×, respectively. Related works. There has recently been a surge of interest on using coding theoretic approaches to alleviate key bottlenecks (e.g., stragglers, bandwidth, and security) in distributed machine learn- ing applications (e.g., [TandonLDK17 , 56, 68, 32, 116, 42, 89, 114, 64, 33, 85, 19]). The proposed LCC scheme significantly advances prior works in this area by 1) generalizing coded computing to arbitrary multivariate polynomial computations, which are of particular importance in learning applications; 2) extending the application of coded computing to secure and private computing; 3) reducing the computation/communication load in distributed computing (and distributed learning) by factors that scale with the problem size, without compromising security and privacy guarantees; and 4) enabling 2.36×-12.65× speedup over the state-of-the-art in distributed least-squares linear regression in cloud networks. Secure/private multiparty computing (MPC) and machine learning (e.g., [11, 79]) are also ex- tensively studied topics that address a problem setting similar to LCC. As we elaborate in Sec- tion ??, compared with conventional methods (e.g., the celebrated BGW scheme for MPC [11]), LCC achieves substantial reduction in the amount of randomness, storage overhead, and computa- tion complexity. 5.2 Problem Formulation and Examples We consider the problem of evaluating a multivariate polynomial f : V→ U over a dataset X = (X 1 ,...,X K ), where V and U are vector spaces over a field . The goal is to compute Y 1 ≜ 81 f(X 1 ),...,Y K ≜ f(X K ), in a distributed computing environment with a master and N workers (Figure 5.1). We define the degree of f, denoted by degf, as the total degree of the polynomial. ∗ In this setting each worker has already stored a fraction of the dataset prior to computation, in a possibly coded manner. Specifically, for i∈ [N] (where [N]≜{1,...,N}), worker i stores ˜ X i ≜ g i (X 1 ,...,X K ), where g i is a (possibly random) function, refered to as the encoding function of that worker. We restrict our attention to linear encoding schemes (see [115] for a formal definition) , which guarantee low encoding complexity and simple implementation. Each worker i computes ˜ Y i ≜ f( ˜ X i ) and returns it to the master. The master waits for a subset of fastest workers and decodes Y 1 ,...,Y K . The procedure must satisfy several additional requirements: • Resiliency, i.e., robustness against stragglers. Formally, the master must be able to obtain the correct values ofY 1 ,...,Y K even if up toS workers fail to respond (or respond after the master executes the decoding algorithm), whereS is the resiliency parameter. A scheme that guarantees resiliency against S stragglers is called S-resilient. • Security, i.e., robustness against adversaries. That is, the master must be able to obtain correct values ofY 1 ,...,Y K even if up to A workers return arbitrarily erroneous results, where A is the security parameter. A scheme that guarantees security against A adversaries is called A-secure. • Privacy, i.e., the workers must remain oblivious to the content of the dataset, even if up to T of them collude, where T is the privacy parameter. Formally, for every⊆ [N] of size at most T, we must have I(X; ˜ X ) = 0, where I is mutual information, ˜ X represents the collection of the ∗ We focus on the non-trivial case where K > 0 and f is not constant. The total degree of a polynomial f is the maximum among all the total degrees of its monomials. When discussing finite , we resort to the canonical representation of polynomials, in which the individual degree within each term is no more than (|F|− 1). 82 encoded dataset stored at the workers in , and X is seen as chosen uniformly at random. A scheme which guarantees privacy against T colluding workers is called T-private. † More concretely, given any subset of workers that return the computing results (denoted by K), the master computes ( ˆ Y 1 ,..., ˆ Y K ) =h K ({ ˜ Y i } i∈K ), where eachh K is a deterministic function (or is random but independent of both the encoding functions and input data). We refer to the h K ’s as decoding functions. ‡ We say that a scheme is S-resilient, A-secure, and T-private if the master always returns the correct results (i.e., each Y i = ˆ Y i ), and all above requirements are satisfied. Given this framework, we aim to characterize the region for (S,A,T ), such that an S-resilient, A-secure, andT-private scheme can be found, givenN,K, and functionf, for any sufficiently large field . This framework encapsulates many computations of interest, including linear computation [13, 50, 58, 107], bilinear computation[118], general tensor algebra [92], and gradient computation [96]. 5.3 Main Results We now state our main results and discuss their connections with prior works. Our first theorem characterizes the region for (S,A,T ) that LCC achieves (i.e., the set of all feasible S-resilient, A-secure, and T-private schemes via LCC as defined in the previos section). † Equivalently, it requires that ˜ X and X are independent. Under this condition, the input data X still appears uniformly random after the colluding workers learn ˜ X, which guarantees the privacy. To guarantee that the privacy requirement is well defined, we assume that and V are finite whenever T > 0. ‡ Similar to encoding, we also require the decoding function to have low complexity. When there is no adversary (A = 0), we restrict our attention to linear decoding schemes. 83 Theorem 5.1. Given a number of workers N and a dataset X = (X 1 ,...,X K ), Lagrange Coded Computing (LCC) provides an S-resilient, A-secure, and T-private scheme for computing {f(X i )} K i=1 for any polynomial f, as long as (K +T− 1) degf +S + 2A + 1≤N. (5.1) Remark 5.1. To prove Theorem 5.1, we formally present LCC in Section ??, which achieves the stated resiliency, security, and privacy. The key idea is to encode the input dataset using the Lagrange polynomial. In particular, encoding functions (i.e., g i ’s) in LCC amount to evaluations of a Lagrange polynomial of degree K− 1 atN distinct points. Hence, computations at the workers amount to evaluations of a composition of that polynomial with the desired function f. Therefore, inequality (5.1) may simply be seen as the number of evaluations that are necessary and sufficient in order to interpolate the composed polynomial, which is later evaluated at a certain point to finalize the computation. LCC also has a number of additional properties of interest. First, the proposed encoding is identical for all computationsf, which allows pre-encoding of the data without knowing the identity of the computing task (i.e., universality). Second, decoding and encoding rely on polynomial interpolation and evaluation, and hence efficient off-the-shelf subroutines can be used. Remark 5.2. Note that LHS of inequality (5.1) is independent of the number of workers N, hence the key property of LCC is that adding 1 worker can increase its resilience to stragglers by 1 or its security to malicious servers by 1/2, while keeping the privacy constraint T the same. Note that using an uncoded replication based approach, to increase the resiliency to stragglers by 1, one needs to essentially repeat each computation once more (i.e., requiring K more machines as opposed to 1 machine in LCC). This result essentially extends the well-known optimal scaling of error-correcting 84 codes (i.e., adding one parity can provide robustness against one erasure or 1/2 error in optimal maximum distance separable codes) to the distributed computing paradigm. Our next theorem demonstrates the optimality of LCC. Theorem 5.2. Lagrange Coded Computing (LCC) achieves the optimal trade-off between resiliency, security, and privacy for any multilinear function f among all computing schemes that uses lin- ear encoding, for all problem scenarios. Moreover, when focusing on the case where no security constraint is imposed, LCC, which uses linear decoding functions, is optimal for any polynomial f among all schemes with this additional decoding constrain, when the characteristic of is sufficiently large. Remark 5.3. Theorem 5.2 is formally stated and proved in the long version [115]. The main proof idea is to show that any computing strategy that outperforms LCC (or its uncoded version) would violate the decodability requirement, by finding two instances of the computation process where the same intermediate computing results correspond to different output values. Remark 5.4. LCC improves and generalizes previously works on coded computing in a few aspects: Generality–LCC significantly generalizes prior works to go beyond linear and bilinear computations that have so far been the main focus in this area, and can be applied to arbitrary multivariate polynomial computations that arise in machine learning applications. Universality–once the data has been coded, any polynomial up to a certain degree can be computed distributedly via LCC. In other words, data encoding of LCC can be universally used for any polynomial computation. This is in stark contrast to previous task specific coding techniques in the literature. Furthermore, workers apply the same computation as if no coding took place; a feature that reduces computational costs, and prevents ordinary servers from carrying the burden of outliers. Security and Privacy–other than a handful of works discussed above, straggler mitigation (i.e., resiliency) has been the primary 85 – BGW LCC Complexity per worker K 1 Frac. data per worker 1 1/K Randomness KT T Min. num. of workers deg(f)(T + 1) deg(f)(K +T− 1) + 1 Table5.1: ComparisonbetweenBGWbaseddesignsandLCC.Thecomputationalcomplexityisnormalizedbythatofevaluating f; randomness, which refers to the number of random entries used in encoding functions, is normalized by the length of X i . focus of the coded computing literature. This work extends the application of coded computing to secure and private computing for general polynomial computations. Remark 5.5. To illustrate the significant role of LCC in secure and private computing, we consider the celebrated BGW scheme [11]. § As we elaborate below, in comparison with the BGW scheme, LCC results in a factor of K reduction in the amount of randomness, storage overhead, and com- putation complexity, while requiring more workers to guarantee the same level of privacy (see Table 5.1). ¶ A key distinction between the two is that BGW uses Shamir’s scheme [97] to secret-share the entire dataset, while LCC instead uses Lagrange polynomials for encoding. LCC operates on 1 K fraction of the input dataset as a unit, resulting in a significant reduction (factor of K) in storage and randomness. The BGW scheme will then evaluate f over all stored coded data at the nodes. In LCC, however, each worker ℓ only needs to store one encoded ˜ X ℓ and compute f( ˜ X ℓ ). This leads to the second key advantage of LCC, which is a factor of K reduction in computation complexity at each worker. In BGW, after computation each worker ℓ has essentially evaluated a polynomial of degree at most deg(f)·T. With no straggler or adversary, the master can recover all required results through § Conventionally, the BGW scheme operates in multiple rounds, requiring significantly more communication over- head than one-shot schemes. For simplicity of comparison, we present a modified one-shot version of BGW. ¶ A BGW scheme was also proposed in [11] for secure MPC, however for a substantially different setting. Similarly, a comparison can be made by adapting it to our setting, leading to similar results, omitted for brevity. 86 polynomial interpolation, as long as N ≥ deg(f)·T + 1 workers participated ‖ . Under the same condition, LCC requires deg(f)· (K +T− 1) + 1 workers, larger than that of the BGW scheme. 5.4 Lagrange Coded Computing In this Section we prove Theorem 5.1 by presenting LCC and characterizing the region for (S,A,T ) that it achieves. We start with an example to illustrate the key components of LCC. 5.4.1 Illustrating Example Consider the function f(X i ) = X 2 i , where X i ’s are some square matrices. We demonstrate LCC in the scenario where the input data X is partitioned into K = 2 batches X 1 and X 2 , and the computing system has N = 8 workers. Assume that = 11 . We aim to achieve (S,A,T ) = (1, 1, 1). The gist of LCC is picking a uniformly random matrix Z, and encoding (X 1 ,X 2 ,Z) using a Lagrange interpolation polynomial: u(z)≜X 1 · (z−2)(z−3) (1−2)(1−3) +X 2 · (z−1)(z−3) (2−1)(2−3) +Z· (z−1)(z−2) (3−1)(3−2) . We then fix distinct {α i } 8 i=1 in such that{α i } 8 i=1 ∩ [2] =∅, and let workers 1,..., 8 storeu(α 1 ),...,u(α 8 ). First, note that for every j∈ [8], worker j sees ˜ X j , a linear combination of X 1 and X 2 that is masked by addition of λ·Z for some nonzero λ∈ 11 ; since Z is uniformly random, this guarantees perfect privacy for T = 1. Next, note that worker j computes f( ˜ X j ) = f(u(α j )), which is an evaluation of the composition polynomial f(u(z)), whose degree is at most 4, at α j . Normally, a polynomial of degree 4 can be interpolated from 5 evaluations at distinct points. However, thepresenceofA = 1adversaryandS = 1stragglerrequiresthemastertoemployaReed- Solomondecoder, andhavethree additionalevaluationsatdistinctpoints(ingeneral, twoadditional evaluations per adversary and one per straggler). Finally, after decoding polynomial f(u(z)), the master can obtain f(X 1 ) and f(X 2 ) by evaluating it at z = 1 and z = 2. ‖ It is also possible to use the conventional multi-round BGW, which only requires N≥ 2T + 1 workers to ensure T-privacy. However, multiple rounds of computation and communication (Ω(log deg(f)) rounds) are needed, which further increases its communication overhead. 87 5.4.2 General Description Similar to Subsection 5.4.1, we select any K +T distinct elements β 1 ,...,β K+T from , and find a polynomialu :→V of degree at mostK +T−1 such thatu(β i ) =X i for anyi∈ [K], andu(β i ) =Z i for i ∈ {K + 1,...,K +T}, where all Z i ’s are chosen uniformly at random from V. This is simply accomplished by letting u be the Lagrange interpolation polynomial u(z) ≜ P j∈[K] X j · Q k∈[K+T ]\{j} z−β k β j −β k + P K+T j=K+1 Z j · Q k∈[K+T ]\{j} z−β k β j −β k . We then select N distinct elements{α i } i∈[N] from such that{α i } i∈[N] ∩{β j } j∈[K] = ∅ (this requirement is alleviated if T = 0), and let ˜ X i = u(α i ) for any i∈ [N]. That is, ˜ X i =u(α i ) = (X 1 ,...,X K ,Z K+1 ,...,Z K+T )·U i , where U∈ (K+T )×N q is encoding matrix U i,j ≜ Q ℓ∈[K+T ]\{i} α j −β ℓ β i −β ℓ , and U i is its i’th column. Following the above encoding, each worker i applies f on ˜ X i and sends the result back to the master. Hence, the master obtains N−S evaluations, at most A of which are incorrect, of the polynomial f(u(z)). Since deg(f(u(z)))≤ deg(f)· (K +T− 1), and N≥ (K +T− 1) deg(f) + S + 2A + 1, the master can obtain all coefficients of f(u(z)) by applying Reed-Solomon decoding. Having this polynomial, the master evaluates it at β i for everyi∈ [K] to obtainf(u(β i )) =f(X i ), and hence we have shown that the above scheme is S-resilient and A-secure. As forT-privacy, our proof relies on the fact that the bottom T×N submatrix U bottom ofU is an MDS matrix (i.e., every T×T submatrix of U bottom is invertible, see Appendix 7.4.1). Hence, for a colluding set of workers⊆ [N] of size T, their encoded data ˜ X is masked by ZUbottom, where Z≜ (Z K+1 ,...,Z K+T ), andU bottom ∈ T×T q is the submatrix corresponding to columns inU bottom that are indexed by . Now, the fact that any U bottom is invertible implies that the random padding added for these colluding workers is uniformly random, which completely masks the coded data. This directly guarantees T-privacy. 88 5.5 Optimality In this section, we provide a layout for the proof of optimality for LCC (i.e., Theorem 5.2). Specif- ically, we need to prove its optimality from three aspects, formally described in the following theorems: Theorem 5.2a (Optimal Resiliency). Any computing scheme where both the encoding and decoding functions are linear can tolerate at mostS =N−(K−1)degf−1 stragglers whenN≥Kdegf−1, and S =⌊N/K⌋− 1 stragglers when N < Kdeg f− 1, assuming the characteristic of q is 0 or greater than deg f. More specifically, we define that a linear encoding function is one that computes a linear combi- nation of the input variables and possibly a random key; while a linear decoding function computes a linear combination of workers’ output. The above theorem proves that LCC achieves the optimum resiliency, as it can tolerate S = N−(K−1)degf−1 stragglers, and its uncoded version (described in Appendix 7.4.2) can tolerate S =⌊N/K⌋− 1 stragglers. Theorem 5.2a is proved in Appendix 7.4.3, by constructing instances of the computation process for any assumed scheme that outperforms LCC or its uncoded version, and proving that such scheme fails to achieve decodability in these instances. Theorem 5.2b (Optimal Security). For any linear encoding scheme and any multilinear function f, security can be provided against at most A =⌊(N− (K− 1)deg f− 1)/2⌋ adversaries when N≥Kdeg f− 1, and A =⌊N/2K− 1/2⌋ adversaries when N <Kdeg f− 1. Similar to the resiliency requirement, the above theorem proves that LCC achieves the optimum security. We prove Theorem 5.2b in Appendix 7.4.4, using an extended concept of Hamming distance, defined in [118] for coded computing. 89 Finally, the LCC scheme pads the dataset X with additional T random entries to guarantee T-privacy; and this amount of randomness is shown to be minimal. Theorem 5.2c. (Optimal randomness) Any linear encoding scheme that universally achieves the same tradeoff (S,A,T ) specified in Theorem 5.1 for all linear functions f must use an amount of randomness no less than that of LCC. The proof of Theorem 5.2c is given in Appendix 7.4.5. 5.6 Application to Linear Regression and Experiments on AWS EC2 We demonstrate a practical application of LCC in accelerating distributed linear regression, whose gradient computation is a quadratic function of the input, hence matching well the LCC framework. We also show its performance gain over state of the arts via experiments on AWS EC2 clusters. Applying LCC for linear regression. Given a feature matrix X ∈ R m×d containing m data points of d features, and a label vector y ∈ R m , a linear regression problem aims to find the weight vector w∈ R d that minimizes the loss||Xw−y|| 2 . Gradient descent (GD) solves this by iteratively moving the weight along the negative gradient, which is in iteration-t computed as 2X ⊤ (Xw (t) −y). To run GD distributedly over a system with a master and n workers, each worker stores r n fraction of coded columns for some parameter 1≤r≤n. Given the current weightw, each worker performs computation using its local storage. We cast this computation to the model in Section ??, by grouping the columns into K =⌈ n r ⌉ blocks such that X = [ ¯ X 1 ··· ¯ X K ] ⊤ , and let the system compute XX ⊤ w, which is the sum of a degree-2 polynomial f( ¯ X k ) = ¯ X k ¯ X ⊤ k w evaluated over 90 scenario 1 scenario 2 scenario 3 0 20 40 60 total run-time, sec uncoded GC MVM LCC Figure 5.2: Run-time comparison of LCC with other three schemes: uncoded, GC, and MVM. ¯ X 1 ,..., ¯ X K . ∗∗ Using LCC, we can achieve a recovery threshold ofR LCC = 2(K−1)+1 = 2⌈ n r ⌉−1 (Theorem 5.1). †† Comparison with state of the arts. The conventional uncoded scheme picks r = 1, and has each worker j compute ¯ X j ¯ X ⊤ j w, yielding recovery threshold R uncoded = n. By redundantly stor- ing/processing r > 1 uncoded sub-matrices at each worker, the “gradient coding” (GC) meth- ods [103, 42, 89] code across partial gradients computed from uncoded data, and reduce the recovery threshold to R GC = n− r + 1. An alternative “matrix-vector multiplication based” (MVM) approach [56] requires two rounds of computation. MVM achieves a recovery threshold of R MVM =⌈ 2n r ⌉ in each round, when the storage is evenly split between rounds. Compared with GC, LCC codes directly on data, and reduces the recovery threshold by about r/2 times. While the amount of computation and communication per node is the same for both, LCC is expected to finish much faster due to its much smaller recovery threshold. Compared with MVM, LCC achieves a smaller recovery threshold than that in each round of MVM (assuming even storage split). While each MVM worker performs less computation in each iteration, it sends two ∗∗ Since the value ofX ⊤ y does not vary across iterations, it only needs to be computed once. We assume that it is available at the master for weight updates. †† The recovery threshold (denoted by R) is defined as the minimum number of workers the master needs to wait for, to guarantee decodability (i.e., tolerating the remaining stragglers). 91 vectors with sizes proportional to m and d respectively, whereas each LCC worker only sends one dimension-d vector. We run linear regression on EC2 using Nesterov’s accelerated gradient descent, implemented on t2.micro nodes. We generate synthetic datasets of m data points, by randomly sampling 1) true weight w ∗ , and 2) each input x i of d features and computing output y i =x ⊤ i w ∗ . For each dataset, we run GD for 100 iterations over 40 workers. We use different dimensions of input X: 8000× 7000 for scenarios 1&2, and 160000× 500 for scenario 3. In scenario 1, we let the system run with natural stragglers. To mimic slow/failed workers, we artificially introduce stragglers in scenarios 2 and 3, by imposing a 0.5 seconds delay on each worker with probability 5% in each iteration. To avoid numerical instability due to large entries, we can embed input data into a large finite field, and apply LCC with exact computations. However in all our experiments the gradients are calculated correctly without carrying out this step. Results. For GC and LCC, we optimize the total run-time over r subject to local memory size. For MVM, we further optimize run-time over the storage assigned between two rounds of multiplications. See Fig. 5.2 for run-time measurement, and the long version [115] for detailed breakdown by scenario. We draw the following conclusions from the experiments. • LCC achieves the least run-time in all scenarios. In particular, LCC speeds up the uncoded scheme by 6.79×-13.43×, the GC scheme by 2.36-4.29×, and the MVM scheme by 1.01-12.65×. • In scenarios 1 & 2 where m is close to the number of features d, LCC achieves a similar perfor- mance as MVM. However, with much more data points in scenario 3, LCC finishes substantially faster than MVM by as much as 12.65×. The main reason is that MVM requires large amounts of data transfer from workers to the master in the first round and from master to workers in 92 the second round (both proportional to m). However, the amount of communication from each worker or master is proportional to d for all other schemes, much smaller than m in scenario 3. 93 Chapter 6 Fitting ReLUs via SGD and Quantized SGD 6.1 Introduction Many modern learning tasks involve fitting nonlinear models to data. Given training data consist- ing ofn pairs of input featuresx i ∈R d and desired outputsy i ∈R we wish to infer a function that best explains the training data. A prominent example is neural network models which have en- abled impressive empirical success in applications spanning natural language processing to robotics. Guaranteed training of nonlinear data models however remain elusive. The main challenge is that fitting such nonlinear models requires solving highly nonconvex optimization problems and it is not clear why local search methods such as stochastic gradient descent converge to globally optimal solutions without getting stuck in spurious local optima and saddles. InthischapterwefocusonfittingRectifiedLinearUnits(ReLUs)tothedatawhicharefunctions ϕ w :R d →R of the form ϕ w (x) = max(0,⟨w,x⟩). We study a nonlinear least-squares formulation of the form min w∈R d L(w) := 1 n n X i=1 ℓ i (w) = 1 n n X i=1 (max(0,⟨w,x i ⟩)−y i ) 2 . (6.1) 94 A popular approach to solving problems of this kind is via Stochastic Gradient Descent (SGD). Indeed, SGD due to its manageable memory and footprint and highly parallelizable nature has become a mainstay of modern learning systems. Despite its wide use however, due to the nonconvex nature of the loss it is completely unclear why SGD converges to a globally optimal model. Fitting ReLUs via SGD poses new challenges: When are the iterates able to converge to global optima? How many samples are required? What is the convergence rate and is it possible to insure a fast, geometric rate of convergence? How do the answers to above change based on the mini-batch size? Yet, another challenge that arises when implementing SGD in a distributed framework is the high communication overhead required for transferring gradient updates between processors. A recent remedy is the use of quantized gradient updates such as Quantized SGD (QSGD) to reduce communication overhead. However, there is little understanding of how such quantization schemes perform on nonlinear learning tasks. Do quantized updates converge to the same solution as unquantized variants? If so, how is the convergence rate affected? How many quantization levels or bits are required to achieve good accuracy? In this chapter we wish to address the above challenges. Our main contributions are as follows: • We study the problem of fitting ReLUs and show that SGD converges at a fast linear rate to a globally optimal solution. This holds with a near minimal number of data observations. We also characterize the convergence rate as a function of the SGD mini-batch sizes. • We show that the QSGD approach of [3] also converges at a linear rate to a globally optimal solutions. This holds even when the number of quantization levels grows only Logarithmically in problem dimension. We also characterize the various tradeoffs between communication and computational resources when using such low-precision algorithms. • We provide experimental results corroborating our theoretical findings. 95 6.2 Algorithms: SGD and QSGD In this section we discuss the details of the algorithms we plan to study. We begin by discussing the specifics of the SGD iterates we will use. We then discuss how to use quantization techniques in order to reduce the communication overhead in a distributed implementation. 6.2.1 Stochastic Gradient Descent (SGD) for Fitting ReLUs To solve the optimization problem (6.1) we use a mini-batch SGD scheme. While, the loss function (6.1) is not differentiable, one can still use an update akin to SGD by defining a generalized notion of gradients for non-differentiable points as a limit of gradients of points converging to the non- differentiable point [22]. Then, in each iteration we sample the indices i (1) t ,i (2) t ,...,i (m) t uniformly with replacement from{1, 2,...,n} and apply updates of the form w t+1 =w t −η· 1 m m X j=1 ∇ℓ i (j) t (w t ). (6.2) Here,∇ℓ i denotes the generalized gradient of the loss ℓ i and is equal to ∇ℓ i (w) = 2 ReLU (⟨w,x i ⟩)−y i 1 +sgn(⟨w,x i ⟩) x i . 6.2.2 Reducing the Communication Overhead via Quantized SGD One of the major advantages of SGD is that it is highly scalable to massive amounts of data. SGD can be easily implemented in a distributed platform where each processor calculates a portion of the mini-batch based on the available local data. Then the partial gradients are sent back to a master node or the other processors to calculate the full mini-batch gradient and update the next iteration. Thelattercaseforexampleiscommoninmoderndeeplearningimplementations[3]. Both 96 distributed approaches however, suffer from a major bottleneck due to the cost of transmitting the gradients to the master or between the processors. A recent remedy for reducing this cost is utilizing lossy compression to quantize the gradients prior to transmission/broadcast. In particular, a recent paper [3] proposes the quantized SGD (QSGD) algorithm based on a randomized quantization functionQ s :R d →R d withs denoting the number of quantization levels. Specifically, for a vector v∈ R d the ith entry of the quantization function Q s (v) is given by Q s (v i ) =||v|| 2 ·sgn(v i )·ξ i (v,s), where ξ i (v,s)’s are independent random variables defined as ξ i (v,s) = h/s with prob. 1− ( |v i |s ||v|| 2 −h). (h + 1)/s otherwise. Here, h∈ [0,s) is an integer such that |v i | ||v|| 2 ∈ [h/s, (h + 1)/s] and we follow the convention that sign(0) = 0. To see how this quantization scheme can be used in a distributed setting consider a master- worker distributed platform consisting of K worker processors numbered 1, 2,...,K and a master processor. To run SGD on this platform each worker processor chooses a batch of size n K out of n data points randomly with replacement. In each iteration, the master broadcasts the latest model to all the workers. Each worker then chooses m k points from local available data points randomly and computes a partial gradient based on the selected data points using the latest model received from the master and quantizes the resulting partial gradient. The workers then send the quantized stochastic gradients to the master. The master waits for all the quantized partial 97 stochastic gradients from the workers and then updates the model using their average. As a result the aggregate effect of QSGD leads to updates of the form w t+1 =w t −η· 1 K K X k=1 Q s ∇ 1 m k m k X j=1 ℓ i (j) t (w t ) . (6.3) Since only the quantized partial gradients are transimitted between the processors, QSGD signifi- cantly reduces the number of communicated bits. 6.3 Main Results 6.3.1 SGD for Fitting ReLUs In this section we discuss our results for convergence of the SGD iterates. Theorem 6.1. Let w ∗ be a fixed weight vector, and the feature vectors x i ∈R d be i.i.d. Gaussian random vectors distributed asN (0,I) with the corresponding labels given by y i = max(0,⟨x i ,w ∗ ⟩). Furthermore, assume (I) the number of samples obey n>c 0 d, for a fixed numerical constant c 0 . (II) the initial estimatew 0 obeys||w 0 −w ∗ || 2 ≤ √ δ 1 7 200 ||w ∗ || 2 for some 0<δ 1 ≤ 1/2. Then, the Stochastic Gradient Descent (SGD) updates in (6.2) with mini-batch size m∈ N and learning rate η = 3 4( 36d m + 25 16 ) obey E[∥w t −w ∗ ∥ 2 2 ]≤ 1− 9 16( 36d m + 25 16 ) ! | {z } convergence rate=ρ t ||w 0 −w ∗ || 2 2 (6.4) 98 with probability at least 1−δ 1 − 2(n + 2)e −γd − 2(n + 25)e −γn . Furthermore, if t≥ (log(2/ϵ) + log(1/δ 2 )) 1 1−ρ for 0<δ 2 ≤ 1, then ||w t −w ∗ || 2 2 ≤ϵ||w 0 −w ∗ || 2 2 , (6.5) holds with probability at least 1−δ 1 −δ 2 − 2(n + 2)e −γd − 2(n + 25)e −γn . Remark 6.1. We note that Theorem 6.1 requires the initial estimate w 0 to be sufficiently close to the planted model (i.e.≤ √ δ 1 7 200 ||w ∗ || 2 ). Such an initial estimate can be easy obtained by using one full gradient descent step at zero [99]. Moreover, [99] shows that Gradient descent does not get stuck on a saddle point. The required sample complexity for this initialization to be effective is on the order of n≥c d δ 2 1 for c a fixed numerical constant. For the purposes of this result we will use δ 1 a small constant so that on the order of n≳d samples are sufficient for this initialization. Remark 6.2. Theorem 6.1 shows that the SGD iterates (6.2) converge to a globally optimal solution at a geometric rate. Furthermore, the required number of samples for this convergence to occur is nearly minimal and is on the order of the number of parameters d. Remark 6.3. Theorem 6.1 also characterizes the influence of mini-batch size on the convergence rate, illustrating the trade-off between the computational load and the convergence speed. Remark 6.4. In Theorem 6.1, the probability of success depends on the distance of initial point to the optimal solution. As we detail in the appendix we can use an ensemble algorithm to reduce the failure probability arbitrarily small without the need to start from an initial point that is very close to the optimal solution. Remark 6.5. We note that for m =n, the updates in (6.2) reduce to full gradient descent and the guarantee (6.4) takes the form||w t −w ∗ || 2 2 ≤ρ t ||w 0 −w ∗ || 2 2 . In this special case our result recovers that of [99] up to a constant factor. 99 6.3.2 Quantized SGD We next focus on providing gurantees for QSGD. Theorem 6.2. Consider the same setting and assumptions as Theorem 6.1. Furthermore, consider a parallel setting with K worker processors and a master per Section 6.2.2 and assume that each worker computes m K partial gradients in each iteration (i.e. m k =m/K). We run QSGD over these processors via the iterative updates in (6.3). Then E[∥w t −w ∗ ∥ 2 2 ]≤ 1− 9 16 1 +min( d s 2 , √ d s ) ( 36d m + 25 16 ) + 25 16 | {z } convergence rate=α t ||w 0 −w ∗ || 2 2 (6.6) holds with probability at least 1−δ 1 − 2(n + 2)e −γd − 2(n + 25)e −γn . Furthermore, if t≥ (log(2/ϵ) + log(1/δ 2 )) 1 1−α for 0<δ 2 ≤ 1 then ||w t −w ∗ || 2 2 ≤ϵ||w 0 −w ∗ || 2 2 (6.7) with probability at least 1−δ 1 −δ 2 − 2(n + 2)e −γd − 2(n + 25)e −γn . Remark 6.6. As mentioned in Remark 6.1 of Theorem 6.1, in order to address the initialization issue we can use one full Gradient Descent pass from zero to find an initializaiton obeying the conditions of this theorem. Remark 6.7. Similar to the results of Theorem 6.1, Theorem 6.2 shows geometric convergence of the QSGD iterates (6.3) with a near minimal number of samples (n≳d). Furthermore, it characterizes the effect of both mini-batch size and quantization levels on the convergence rate. Specifically, by increasing the quantization levels, the iterates (6.2) converge faster. Perhaps unexpected, by 100 choosing the number of the bits to be on the order of log √ d the iterates (6.2) and (6.3) converge with the same rate up to a constant factor. This allows QSGD to significantly reduce the communication load while maintaining a computational effort comparable to SGD. 6.4 Numerical Results and Experiments on Amazon EC2 InthissectionwewishtoinvestigatetheresultsofTheorems6.1and6.2usingnumericalsimulations and experiments on Amazon EC2. We first wish to investigate how the rate of convergence of mini- batch SGD and QSGD iterates depends on the different parameters. To this aim, we generate the planted weight vector w ∗ ∈ R d with d = 1000 with entries distributed i.i.d. ∼ N(200, 3). In addition, we generaten = 10000 feature vectorsx i ∈R d i.i.d.∼N(0, 1) and set the corresponding output labels y i = max(0,⟨x i ,w ∗ ⟩). To estimatew ∗ , we start from a random initial point and run SGD and QSGD with learning rates η = m d and m·b 9d , where b is the number of bits required for quantization. In Figure 6.1(a) we focus on corroborating our convergence analysis for SGD. To this aim we vary the mini-batch size m and plot the relative error (∥w−w ∗ ∥ 2 /∥w ∗ ∥ 2 ) as a function of the iterations. This figure demonstates that the convergence rate is indeed linear and increasing the batch size m results in a faster convergence. In Figure 6.1(b) we focus on understanding the effect of the number of bits on the convergence behavior of QSGD. To this aim we fix the mini-batch size at m = 800 and vary the number of bitsb. This plot confirms that QSGD maintains a linear convergence rate comparable to SGD (especially when b = 7). To understand the required sample complexity of the algorithms, we plot phase transition curves depicting the empirical probability that gradient descent converges tow ∗ for different data set sizes (n) and feature dimensions (d). For each value of n and d we perform 10 trials with the 101 data generated according to the model discussed above. In each trial we run the algorithm for 2000 iterations. If the relative error after 2000 iterations is less than 10 −3 we consider the trial a success otherwise a failure. Figures 6.4(a) and 6.4(b) depict the phase transition for mini-batch SGD and QSGD with b = 7 bits, respectively. As it can be seen, the figures corroborate the linear relationship of the required sample complexity with the feature dimensions. Furthermore, these figuresdemonstratethatquantizationdoesnotsubstantiallychangetherequiredsamplecomplexity. In order to understand the effectiveness of QSGD at reducing the communication overhead, we provide experiments on Amazon EC2 clusters and compare the performance of QSGD with SGD. We utilize a master-worker architecture and make use of t2.micro instances. We use Python for implementing two algorithms and use MPI4py [25] for message passing across the instances. Before starting the iterative updates each worker receives its portion of the training data. In each iteration t, after the workers receive the latest modelw t from the master, they compute the stochastic gradients atw t based on the local data and then send it back to the master. In SGD and QSGD we use float64 and int8 for sending gradients, respectively. Additionally, we use isend() and irecv() for sending and receiving, respectively, and Time.time() to measure the running time. We compare the performances of these two algorithms in the following two scenarios. • Scenario one: We use 41 t2.micro instances, with one master and 40 workers. We have n = 20000 data points with feature dimension d = 4000. We partition and distribute the data into 40 equal batches of size 20000/40 = 500 with each worker performing updates based its own batch. • Scenario two: We use 51 t2.micro instances, with one master and 50 workers. In this scenario we use n = 25000 and d = 4000, and again distribute the data evenly among the workers. 102 0 50 100 150 200 10 −11 10 −8 10 −5 10 −2 10 1 iterations (t) (a) Relative error (||w t −w ∗ || 2 /||w ∗ || 2 ) m = 200 m = 400 m = 600 m = 800 0 50 100 150 200 10 −11 10 −8 10 −5 10 −2 10 1 iterations (t) (b) Relative error (||w t −w ∗ || 2 /||w ∗ || 2 ) b = 4 b = 5 b = 6 b = 7 SGD Figure 6.1: (a) This plot depicts the convergence behavior of mini-batch SGD iterates for various mini-batch sizes m.(b) This plot depicts the convergence behavior of QSGD iterates for various bits of quantization b with the mini-batch size fixed at m = 800. Table 6.1 summarizes the experiment scenarios. In both cases we run SGD and QSGD algorithms for 300 iterations. Figure 6.5 depicts the total running times. Table 6.2 also shows the breakdowns of the run-times. These experiments indicate that the total running time of QSGD is 5 times less than that of SGD. Since both algorithms have similar convergence rate and hence computational time, this clearly demonstrates that the communication time for QSGD is significantly smaller. 103 Figure 6.2 Figure 6.3 Figure 6.4: Empirical probability that (a) SGD and (b) QSGD with b = 7 finds the global optimum for different number of data points (n) and feature dimensions (d). Figure 6.5: Run-time compassion of SGD and QSGD on Amazon EC2 clusters for the two scenarios. 104 Table 6.1: Experiment scenarios. scenario index # of workers (K) # of data points (n) feature dimension (d) 1 40 20000 4000 2 50 25000 4000 Table 6.2: Breakdowns of the run-times in the both scenarios. schemes scenario index comm. time comp. time total time SGD 1 28.5100 s 4.921 s 33.431 s QSGD 1 3.2470 s 5.056 s 8.303 s SGD 2 38.2910 s 4.94 s 43.231 s QSGD 2 6.5010 s 5.169 s 11.67 s 6.5 Related Work The problem of fitting non-linear models to data has a rich history in statistics and learning theory using a variety of algorithms [49, 38, 44, 46]. In the context of training deep models such nonlinear learning problems have lead to major breakthroughs in various applications [55, 23]. Despite these theoretical and empirical advances, rigorous understanding of local search heuristics such as stochastic gradient descent which are arguably the most widely used techniques in practice, have remained elusive. Recently, there has been a surge of activity surrounding nonlinear data fitting via local search. Focusing on ReLU nonlinearities [99] shows that in generic instances full gradient descent converges to a globally optimal model at a geometric rate. See also [86, 106, 47, 119, 72, 39, 40, 35] for a variety of related theoretical and empirical work for fitting ReLUs and shallow neural networks via local search heuristics. In contrast to the above, which require full gradient updates in this chapter we focus on fitting ReLU nonlinearities via mini-batch SGD and QSGD. We would like to mention a recent result which studies a stochastic method for fitting ReLUs [48] via a First order Perturbed Stochastic Gradient Descent. This approach differs from ours in a variety of ways including the update strategy and required sample complexity. In particular, this approach is based on stochastic methods with random search strategies which only use function evaluations. 105 Furthermore, this result requires n ≳ d 4 samples and provides a polynomial converge guarantee where as is this work we have established a geometric rate of convergence to the global optima with a near minimal number of samples that scales linearly in the problem dimension (i.e. n≳d). Another result we would also like to mention is [71] which studies SGD method for training two- layer neural networks with ReLU activation. [71] assumes that there is an identity mapping in the network and shows that SGD converges to the global minimum in two phases. In the phase one, the defined potential function decreases but the gradient does not point to the correct direction. In t he phase two, SGD enters a one point convex region and converges to the global minimum. However, the rate of the convergence is polynomial rather than geometric. Recently there has been a lot of exciting activity surrounding SGD analysis. Classic SGD convergence analysis shows that the distance to the global optima (in loss or parameter value) decreases polynomially in the number of iterations (e.g. 1/t aftert iterations). More recent results demonstrate that in certain cases, a significantly faster and geometric rate of convergence (i.e. ρ −t with ρ < 1) is possible [84, 80]. For instance, [100] showed that randomized Kaczmarz algorithm converges to the solution of a consistent linear system of equations at a geometric rate. More recently, [74] showed that under some assumptions such as smoothness, strong convexity, and perfect interpolation, SGD achieves a geometric rate of convergence. In this work we add to this growing literature and demonstrate that such a fast geometric rate of convergence to a globally optimal solution is also possible despite the nonlinear and nonconvex nature of fitting ReLUs. Reducing communication overhead by compressing the gradients has become increasing popular in recent literature [113, 29, 31]. Most notably [94] empirically demonstrated that one-bit quan- tization of gradients is highly effective for scalable training of deep models. The QSGD paper [3] develops convergence gurantees for convex losses as well convergence of gradient to zero for noncon- vex losses. Related [12] also shows that under the assumption of smoothness and bounded variance 106 a quantized SGD procedure converges to a stationary point of a general non-convex function with a polynomial convergence rate. In contrast in this work we focus on geometric convergence to the global optima but for the specific problem of fitting a ReLU nonlinearity. 107 Bibliography [1] https://github.com/z-fabian/transfer_lowerbounds_arXiv. 2020. [2] Martın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. “TensorFlow: A System for Large-Scale Machine Learning.” In: OSDI. Vol. 16. 2016, pp. 265–283. [3] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. “QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding”. In: Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017, pp. 1709–1720. [4] Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. “Reining in the Outliers in Map-Reduce Clusters using Mantri”. In: OSDI. Vol. 10. 1. 2010, p. 24. [5] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. “Regularized Learning for Domain Adaptation under Label Shifts”. In: International Conference on Learning Representations. 2018. [6] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. “Regularized learning for domain adaptation under label shifts”. In: arXiv preprint arXiv:1903.09734 (2019). [7] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. “Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks”. In: The Journal of Machine Learning Research 20.1 (2019), pp. 2285–2301. [8] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. “A theory of learning from different domains”. In: Machine learning 79.1-2 (2010), pp. 151–175. [9] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. “Analysis of representations for domain adaptation”. In: Advances in neural information processing systems. 2007, pp. 137–144. 108 [10] Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. “Impossibility theorems for domain adaptation”. In: International Conference on Artificial Intelligence and Statistics . 2010, pp. 129–136. [11] Michael Ben-Or, Shafi Goldwasser, and Avi Wigderson. “Completeness theorems for non-cryptographic fault-tolerant distributed computation”. In: Proceedings of the twentieth annual ACM symposium on Theory of computing. ACM. 1988, pp. 1–10. [12] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. “signSGD: compressed optimisation for non-convex problems”. In: CoRR abs/1802.04434 (). arXiv: 1802.04434. [13] Rawad Bitar, Parimal Parag, and Salim El Rouayheb. “Minimizing Latency for Secure Coded Computing Using Secret Sharing via Staircase Codes”. In: arXiv preprint arXiv:1802.02640 (2018). [14] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. “Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent”. In: Advances in Neural Information Processing Systems. 2017, pp. 118–128. [15] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. “Learning Bounds for Domain Adaptation”. In: NIPS. 2007. [16] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. “Learning bounds for domain adaptation”. In: Advances in neural information processing systems. 2008, pp. 129–136. [17] Dan Bogdanov, Sven Laur, and Jan Willemson. “Sharemind: A Framework for Fast Privacy-Preserving Computations”. In: Proceedings of the 13th European Symposium on Research in Computer Security: Computer Security. ESORICS ’08. Málaga, Spain: Springer-Verlag, 2008, pp. 192–206. isbn: 978-3-540-88312-8. doi: 10.1007/978-3-540-88313-5_13. [18] Joao Carreira and Andrew Zisserman. “Quo vadis, action recognition? a new model and the kinetics dataset”. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 6299–6308. [19] Lingjiao Chen, Zachary Charles, Dimitris Papailiopoulos, et al. “DRACO: Robust Distributed Training via Redundant Gradients”. In: arXiv preprint arXiv:1803.09877 (2018). [20] Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin Wang. “Transferability vs. Discriminability: Batch Spectral Penalization for Adversarial Domain Adaptation”. In: Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019. 109 [21] Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin Wang. “Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation”. In: International Conference on Machine Learning. 2019, pp. 1081–1090. [22] Frank H. Clark. “Optimization and Nonsmooth Analysis”. In: SIAM (). [23] Ronan Collobert and Jason Weston. “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning”. In: Proceedings of the 25th International Conference on Machine Learning. ICML ’08. Helsinki, Finland: ACM, 2008, pp. 160–167. isbn: 978-1-60558-205-4. doi: 10.1145/1390156.1390177. [24] Ronald Cramer, Ivan Bjerre Damgrd, and Jesper Buus Nielsen. Secure Multiparty Computation and Secret Sharing. 1st. New York, NY, USA: Cambridge University Press, 2015. isbn: 1107043050, 9781107043053. [25] Lisandro D Dalcin, Rodrigo R Paz, Pablo A Kler, and Alejandro Cosimo. “Parallel distributed computing using python”. In: Advances in Water Resources (2011). [26] Amit Daniely, Roy Frostig, and Yoram Singer. “Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity”. In: Advances In Neural Information Processing Systems. 2016, pp. 2253–2261. [27] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. “Multiclass learnability and the erm principle”. In: Proceedings of the 24th Annual Conference on Learning Theory. JMLR Workshop and Conference Proceedings. 2011, pp. 207–232. [28] Shai Ben David, Tyler Lu, Teresa Luu, and Dávid Pál. “Impossibility theorems for domain adaptation”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings. 2010, pp. 129–136. [29] Christopher M De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré, and Christopher Ré. “Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms”. In: Advances in Neural Information Processing Systems 28. Ed. by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett. Curran Associates, Inc., 2015, pp. 2674–2682. [30] Jeffrey Dean and Luiz André Barroso. “The tail at scale”. In: Communications of the ACM 56.2 (2013), pp. 74–80. [31] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. “Large Scale Distributed Deep Networks”. In: Advances in Neural Information Processing Systems 25. Curran Associates, Inc., 2012, pp. 1223–1231. [32] Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. “Short-dot: Computing large linear transforms distributedly using coded short dot products”. In: Advances In Neural Information Processing Systems. 2016, pp. 2100–2108. 110 [33] Sanghamitra Dutta, Mohammad Fahim, Farzin Haddadpour, Haewon Jeong, Viveck R. Cadambe, and Pulkit Grover. “On the Optimal Recovery Threshold of Coded Matrix Multiplication”. In: arXiv preprint arXiv:1801.10292 (2018). [34] Yahya H Ezzeldin, Mohammed Karmoose, and Christina Fragouli. “Communication vs Distributed Computation: an alternative trade-off curve”. In: e-print arXiv:1705.08966 (2017). [35] Haoyu Fu, Yuejie Chi, and Yingbin Liang. “Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression”. In: arXiv preprint arXiv:1802.06463 (2018). [36] Tomer Galanti, Lior Wolf, and Tamir Hazan. “A theoretical framework for deep transfer learning”. In: Information and Inference: A Journal of the IMA 5.2 (2016), pp. 159–209. [37] Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. “Large-scale matrix factorization with distributed stochastic gradient descent”. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2011, pp. 69–77. [38] Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. “Reliably Learning the ReLU in Polynomial Time”. In: CoRR abs/1611.10258 (2016). arXiv: 1611.10258. [39] Surbhi Goel and Adam Klivans. “Learning depth-three neural networks in polynomial time”. In: arXiv preprint arXiv:1709.06010 (2017). [40] Surbhi Goel, Adam Klivans, and Raghu Meka. “Learning One Convolutional Layer with Overlapping Patches”. In: arXiv preprint arXiv:1802.02547 (2018). [41] Wael Halbawi, Navid Azizan-Ruhi, Fariborz Salehi, and Babak Hassibi. “Improving Distributed Gradient Descent Using Reed-Solomon Codes”. In: e-print arXiv:1706.05436 (2017). [42] Wael Halbawi, Navid Azizan Ruhi, Fariborz Salehi, and Babak Hassibi. “Improving Distributed Gradient Descent Using Reed-Solomon Codes”. In: CoRR abs/1706.05436 (2017). arXiv: 1706.05436. url: http://arxiv.org/abs/1706.05436. [43] Steve Hanneke and Samory Kpotufe. “On the Value of Target Data in Transfer Learning”. In: Advances in Neural Information Processing Systems. 2019, pp. 9867–9877. [44] Joel L. Horowitz and Wolfgang Härdle. “Direct Semiparametric Estimation of Single-Index Models with Discrete Covariates”. English (US). In: Journal of the American Statistical Association 91.436 (Dec. 1996), pp. 1632–1640. issn: 0162-1459. [45] Wentao Huang. “Coding for Security and Reliability in Distributed Systems”. PhD thesis. California Institute of Technology, 2017. [46] Hidehiko Ichimura. Semiparametric Least Squares (sls) and Weighted SLS Estimation of Single- Index Models. Working Papers. Minnesota - Center for Economic Research, 1991. 111 [47] Gauri Jagatap and Chinmay Hegde. “Learning ReLU Networks via Alternating Minimization”. In: arXiv preprint arXiv:1806.07863 (2018). [48] Chi Jin, Lydia T. Liu, Rong Ge, and Michael I. Jordan. “Minimizing Nonconvex Population Risk from Rough Empirical Risk”. In: CoRR abs/1803.09357 (2018). arXiv: 1803.09357. url: http://arxiv.org/abs/1803.09357. [49] Adam Tauman Kalai and Ravi Sastry. “The Isotron Algorithm: High-Dimensional Isotonic Regression.” In: COLT. 2009. [50] Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. “Straggler mitigation in distributed optimization through data encoding”. In: NIPS (2017), pp. 5440–5448. [51] Alireza Karbalayghareh, Xiaoning Qian, and Edward R Dougherty. “Optimal Bayesian transfer regression”. In: IEEE Signal Processing Letters 25.11 (2018), pp. 1655–1659. [52] Alireza Karbalayghareh, Xiaoning Qian, and Edward Russell Dougherty. “Optimal bayesian transfer learning for count data”. In: IEEE/ACM transactions on computational biology and bioinformatics (2019). [53] Mehrdad Kiamari, Chenwei Wang, and A Salman Avestimehr. “On Heterogeneous Coded Distributed Computing”. In: IEEE GLOBECOM (Dec. 2017). [54] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems 25 (2012), pp. 1097–1105. [55] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. NIPS’12. Lake Tahoe, Nevada: Curran Associates Inc., 2012, pp. 1097–1105. [56] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. “Speeding Up Distributed Machine Learning Using Codes”. In: NIPS Workshop on Machine Learning Systems (Dec. 2015). [57] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. “Speeding Up Distributed Machine Learning Using Codes”. In: IEEE Trans. Inf. Theory 64.3 (Mar. 2018), pp. 1514–1529. issn: 0018-9448. doi: 10.1109/TIT.2017.2736066. [58] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. “Speeding Up Distributed Machine Learning Using Codes”. In: IEEE Transactions on Information Theory 64.3 (Mar. 2018), pp. 1514–1529. issn: 0018-9448. doi: 10.1109/TIT.2017.2736066. [59] Qi Lei, Wei Hu, and Jason Lee. “Near-optimal linear regression under distribution shift”. In: International Conference on Machine Learning. PMLR. 2021, pp. 6164–6174. 112 [60] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. “Communication efficient distributed machine learning with the parameter server”. In: Advances in Neural Information Processing Systems. 2014, pp. 19–27. [61] Mu Li, David G. Andersen, Alexander Smola, and Kai Yu. “Communication Efficient Distributed Machine Learning with the Parameter Server”. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS’14. Montreal, Canada: MIT Press, 2014, pp. 19–27. url: http://dl.acm.org/citation.cfm?id=2968826.2968829. [62] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. “Coding for Distributed Fog Computing”. In: IEEE Commun. Mag. 55.4 (Apr. 2017), pp. 34–40. issn: 0163-6804. doi: 10.1109/MCOM.2017.1600894. [63] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr. “A Fundamental Tradeoff Between Computation and Communication in Distributed Computing”. In: IEEE Trans. Inf. Theory 64.1 (Jan. 2018). issn: 0018-9448. doi: 10.1109/TIT.2017.2756959. [64] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr. “A Fundamental Tradeoff Between Computation and Communication in Distributed Computing”. In: IEEE Transactions on Information Theory 64.1 (2018), pp. 109–128. [65] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr. “A Scalable Framework for Wireless Distributed Computing”. In: IEEE/ACM Trans. Netw. 25.5 (Oct. 2017), pp. 2643–2654. issn: 1063-6692. doi: 10.1109/TNET.2017.2702605. [66] Songze Li, Seyed Mohammadreza Mousavi Kalan, A Salman Avestimehr, and Mahdi Soltanolkotabi. “Near-Optimal Straggler Mitigation for Distributed Gradient Methods”. In: arXiv preprint arXiv:1710.09990 (2017). [67] Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. “A Unified Coding Framework for Distributed Computing with Straggling Servers”. In: IEEE NetCod (Sept. 2016). [68] Songze Li, Mohammad Ali Maddah-Ali, and Amir Salman Avestimehr. “Coded MapReduce”. In: 53rd Allerton Conference (Sept. 2015). [69] Songze Li, Sucha Supittayapornpong, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. “Coded TeraSort”. In: IPDPSW (May 2017). [70] Tianyang Li, Xinyang Yi, Constantine Carmanis, and Pradeep Ravikumar. “Minimax gaussian classification & clustering”. In: Artificial Intelligence and Statistics . PMLR. 2017, pp. 1–9. [71] Yuanzhi Li and Yang Yuan. “Convergence Analysis of Two-layer Neural Networks with ReLU Activation”. In: NIPS. 2017. 113 [72] Shiyu Liang, Ruoyu Sun, Yixuan Li, and R Srikant. “Understanding the Loss Surface of Single-Layered Neural Networks for Binary Classification”. In: CoRR abs/1803.00909 (2018). [73] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. “Unsupervised Domain Adaptation with Residual Transfer Networks”. In: Advances in Neural Information Processing Systems (2016). [74] Siyuan Ma, Raef Bassily, and Mikhail Belkin. “The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning”. In: CoRR abs/1712.06559 (2017). arXiv: 1712.06559. [75] Raj Kumar Maity, Ankit Singh Rawat, and Arya Mazumdar. “Robust Gradient Descent via Moment Encoding with LDPC Codes”. In: SysML Conference (2018). [76] Yishay Mansour, Mehryar Mohri, Jae Ro, Ananda Theertha Suresh, and Ke Wu. “A theory of multiple-source adaptation with limited target labeled data”. In: International Conference on Artificial Intelligence and Statistics . PMLR. 2021, pp. 2332–2340. [77] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. “Domain adaptation: Learning bounds and algorithms”. In: arXiv preprint arXiv:0902.3430 (2009). [78] Pascal Massart. “Concentration inequalities and model selection”. In: (2007). [79] P. Mohassel and Y. Zhang. “SecureML: A System for Scalable Privacy-Preserving Machine Learning”. In: 2017 IEEE Symposium on Security and Privacy (SP). Vol. 00. May 2017, pp. 19–38. doi: 10.1109/SP.2017.12. [80] Eric Moulines and Francis R. Bach. “Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning”. In: Advances in Neural Information Processing Systems 24. Ed. by J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger. Curran Associates, Inc., 2011, pp. 451–459. [81] Mohammadreza Mousavi Kalan, Zalan Fabian, Salman Avestimehr, and Mahdi Soltanolkotabi. “Minimax Lower Bounds for Transfer Learning with Linear and One-hidden Layer Neural Networks”. In: Advances in Neural Information Processing Systems 33 (2020). [82] Seyed Mohammadreza Mousavi Kalan, Zalan Fabian, Salman Avestimehr, and Mahdi Soltanolkotabi. “Minimax Lower Bounds for Transfer Learning with Linear and One-hidden Layer Neural Networks”. In: Advances in Neural Information Processing Systems. 2020. [83] Balas K Natarajan. “On learning sets and functions”. In: Machine Learning 4.1 (1989), pp. 67–97. 114 [84] Deanna Needell, Rachel Ward, and Nati Srebro. “Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm”. In: Advances in Neural Information Processing Systems 27. Ed. by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger. Curran Associates, Inc., 2014, pp. 1017–1025. [85] H. A. Nodehi and M. A. Maddah-Ali. “Limited-Sharing Multi-Party Computation for Massive Matrix Operations”. In: 2018 IEEE International Symposium on Information Theory (ISIT). June 2018, pp. 1231–1235. doi: 10.1109/ISIT.2018.8437651. [86] Samet Oymak. “Stochastic Gradient Descent Learns State Equations with Nonlinear Activations”. In: arXiv preprint arXiv:1809.03019 (2018). [87] Sinno Jialin Pan and Qiang Yang. “A survey on transfer learning”. In: IEEE Transactions on knowledge and data engineering 22.10 (2009), pp. 1345–1359. [88] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. “Moment matching for multi-source domain adaptation”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 1406–1415. [89] Netanel Raviv, Itzhak Tamo, Rashish Tandon, and Alexandros G Dimakis. “Gradient Coding from Cyclic MDS Codes and Expander Graphs”. In: e-print arXiv:1707.03858 (2017). [90] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In: Advances in neural information processing systems. 2011, pp. 693–701. [91] Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Salman Avestimehr. “Coded computation over heterogeneous clusters”. In: IEEE ISIT. 2017, pp. 2408–2412. [92] Paul Renteln. Manifolds, Tensors, and Forms: An Introduction for Mathematicians and Physicists. Cambridge University Press, 2013. [93] Sheldon Ross. A First Course in Probability. 9th ed. Pearson, 2012. [94] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. “1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs”. In: Interspeech 2014. Sept. 2014. [95] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns”. In: Fifteenth Annual Conference of the International Speech Communication Association. 2014. [96] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [97] Adi Shamir. “How to Share a Secret”. In: Commun. ACM 22.11 (Nov. 1979), pp. 612–613. issn: 0001-0782. doi: 10.1145/359168.359176. 115 [98] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. “Wasserstein distance guided representation learning for domain adaptation”. In: Thirty-Second AAAI Conference on Artificial Intelligence . 2018. [99] Mahdi Soltanolkotabi. “Learning ReLUs via Gradient Descent”. In: Advances in Neural Information Processing Systems 30. 2017. [100] Thomas Strohmer and Roman Vershynin. “A Randomized Kaczmarz Algorithm with Exponential Convergence”. In: Journal of Fourier Analysis and Applications 15.2 (Apr. 2008), p. 262. issn: 1531-5851. doi: 10.1007/s00041-008-9030-4. [101] Yan Shuo Tan and Roman Vershynin. “Phase retrieval via randomized Kaczmarz: theoretical guarantees”. In: Information and Inference: A Journal of the IMA (2018), iay005. doi: 10.1093/imaiai/iay005. eprint: /oup/backfile/content_public/journal/imaiai/pap/10.1093_imaiai_iay005/2/iay005.pdf. [102] Rashish Tandon, Qi Lei, Alexandros Dimakis, and Nikos Karampatziakis. “Gradient Coding”. In: NIPS Machine Learning Systems Workshop (2016). [103] Rashish Tandon, Qi Lei, Alexandros G. Dimakis, and Nikos Karampatziakis. “Gradient Coding: Avoiding Stragglers in Distributed Learning”. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. 2017, pp. 3368–3376. url: http://proceedings.mlr.press/v70/tandon17a.html. [104] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer series in statistics. Dordrecht: Springer, 2009. doi: 10.1007/b13794. [105] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Vol. 48. Cambridge University Press, 2019. [106] Gang Wang, Georgios B Giannakis, and Jie Chen. “Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization”. In: arXiv preprint arXiv:1808.04685 (2018). [107] Sinong Wang, Jiashang Liu, Ness Shroff, and Pengyu Yang. “Fundamental Limits of Coded Linear Transform”. In: arXiv preprint arXiv:1804.09791 (2018). [108] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. “A survey of transfer learning”. In: Journal of Big data 3.1 (2016), pp. 1–40. [109] Yifan Wu, Ezra Winston, Divyansh Kaushik, and Zachary Lipton. “Domain adaptation with asymmetrically-relaxed distribution alignment”. In: arXiv preprint arXiv:1903.01689 (2019). [110] Neeraja J. Yadwadkar, Bharath Hariharan, Joseph E. Gonzalez, and Randy Katz. “Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling”. In: Journal of Machine Learning Research 17.106 (2016), pp. 1–37. url: http://jmlr.org/papers/v17/15-149.html. 116 [111] Min Ye and Emmanuel Abbe. “Communication-Computation Efficient Gradient Coding”. In: arXiv preprint arXiv:1802.03475 (2018). [112] Kaichao You, Ximei Wang, Mingsheng Long, and Michael Jordan. “Towards accurate model selection in deep unsupervised domain adaptation”. In: International Conference on Machine Learning. 2019, pp. 7124–7133. [113] Mingchao Yu, Zhifeng Lin, Krishna Narra, Songze Li, Youjie Li, Nam Sung Kim, Alexander Schwing, Murali Annavaram, and Salman Avestimehr. “GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training”. In: Advances in Neural Information Processing Systems. 2018, pp. 5129–5139. [114] Q. Yu, S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. “How to optimally allocate resources for coded distributed computing?” In: 2017 IEEE International Conference on Communications (ICC). May 2017, pp. 1–7. doi: 10.1109/ICC.2017.7996730. [115] Qian Yu, Songze Li, Netanel Raviv, Seyed Mohammad Mousavi Kalan, Mahdi Soltanolkotabi, and A Salman Avestimehr. “Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy”. In: arXiv preprint arXiv:1806.00939 (2018). [116] Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. “Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication”. In: Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017, pp. 4406–4416. [117] Qian Yu, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. “Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication”. In: NIPS (2017), pp. 4406–4416. [118] Qian Yu, Mohammad Ali Maddah-Ali, and Amir Salman Avestimehr. “Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding”. In: arXiv preprint arXiv:1801.07487 (2018). [119] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. “Learning One-hidden-layer ReLU Networks via Gradient Descent”. In: arXiv preprint arXiv:1806.07808 (2018). [120] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael I Jordan. “Bridging theory and algorithm for domain adaptation”. In: arXiv preprint arXiv:1904.05801 (2019). [121] Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J Gordon. “On learning invariant representation for domain adaptation”. In: arXiv preprint arXiv:1901.09453 (2019). [122] Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. “A fast parallel SGD for matrix factorization in shared memory systems”. In: Proceedings of the 7th ACM conference on Recommender systems. ACM. 2013, pp. 249–256. 117 Chapter 7 Appendices 7.1 Appendix for Chapter 1 7.1.1 Calculating the Generalization Errors ( Proof of Proposition 1.1) • Linear model: By expanding the expression we get E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ] =E[ c W T x T −W T x T −w T 2 ℓ 2 ] =E[ c W T x T −W T x T 2 ℓ 2 ] +kσ 2 =E[x T T (W T − d W T ) T (W T − d W T )x T ] +kσ 2 =E[trace(x T T (W T − d W T ) T (W T − d W T )x T )] +kσ 2 =E[trace((W T − d W T ) T (W T − d W T )x T x T T )] +kσ 2 = trace((W T − d W T ) T (W T − d W T )E[x T x T T ]) +kσ 2 = trace((W T − d W T ) T (W T − d W T )Σ T ) +kσ 2 =||Σ 1 2 T (W T − d W T ) T || 2 F +kσ 2 (7.1) • One-hidden layer neural network model with fixed hidden-to-output layer: 118 By expanding the expression we obtain E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ] =E[ Vφ( c W T x T )−Vφ(W T x T ) 2 ℓ 2 ] +kσ 2 ≥σ 2 min (V )E[ φ( c W T x T )−φ(W T x T ) 2 ℓ 2 ] +kσ 2 (7.2) LetA = c W T Σ 1 2 T ,B =W T Σ 1 2 T , andx = Σ −1 2 T x T . Sox∼N (0,I d ). Moreover, letA = α T 1 . . . α T ℓ andB = β T 1 . . . β T ℓ . Since E[ φ( c W T x T )−φ(W T x T ) 2 ℓ 2 ] = P ℓ i=1 E[|φ(α T i x)−φ(β T i x)| 2 ], it suffices to find a lower bound for the following expression E[|φ(a T x)−φ(b T x)| 2 ] where a and b are two arbitrary vectors in R d , φ is the ReLU activation function, and x∼ N (0,I d ). We have E[|φ(a T x)−φ(b T x)| 2 ] =E[|φ(a T x)| 2 ] +E[|φ(b T x)| 2 ]− 2E[φ(a T x)φ(b T x)]. (7.3) Now we calculate each term appearing on the right hand side. Sincea T x∼N(0,∥a∥ 2 ℓ 2 ), we have 119 E[|φ(a T x)| 2 ] =E[|ReLU(a T x)| 2 ] = Z +∞ 0 t 2 √ 2π∥a∥ ℓ 2 e −t 2 2∥a∥ 2 ℓ 2 dt = ∥a∥ 2 ℓ 2 2 . Similarly, E[|φ(b T x)| 2 ] = ∥b∥ 2 ℓ 2 2 . To calculate the cross term note that a T x and b T x are jointly Gaussian with zero mean and covariance matrix equal to ∥a∥ 2 ℓ 2 a T b a T b ∥b∥ 2 ℓ 2 . Therefore, we have (e.g. see [26]) 2E[φ(a T x)φ(b T x)] = 2E[ReLU(a T x)ReLU(b T x)] =∥a∥ ℓ 2 ∥b∥ ℓ 2 p 1−γ 2 + (π− cos −1 (γ))γ π (7.4) where γ := a T b ∥a∥ ℓ 2 ∥b∥ ℓ 2 . Plugging these results in (7.3), we can conlude that E[|φ(a T x)−φ(b T x)| 2 ] = ∥a∥ 2 ℓ 2 2 + ∥b∥ 2 ℓ 2 2 −∥a∥ ℓ 2 ∥b∥ ℓ 2 p 1−γ 2 + (π− cos −1 (γ))γ π = 1 2 ∥a−b∥ 2 ℓ 2 −∥a∥ ℓ 2 ∥b∥ ℓ 2 p 1−γ 2 −γ cos −1 (γ) π . (7.5) 120 We are interested in finding a universal constant 0 < c < 1 2 such that E[|φ(a T x)−φ(b T x)| 2 ]≥ c∥a−b∥ 2 ℓ 2 . Using (7.5) and dividing by∥a∥ ℓ 2 ∥b∥ ℓ 2 this is equivalent to finding 0 < c < 1 2 such that 1 2 −c ∥a∥ 2 ℓ 2 +∥b∥ 2 ℓ 2 − 2a T b ∥a∥ ℓ 2 ∥b∥ ℓ 2 + γ cos −1 (γ)− p 1−γ 2 π ≥ 0 Next note that by the AM-GM inequality we have ( 1 2 −c) ∥a∥ 2 ℓ 2 +∥b∥ 2 ℓ 2 − 2a T b ∥a∥ ℓ 2 ∥b∥ ℓ 2 + γ cos −1 (γ)− p 1−γ 2 π ≥ ( 1 2 −c) 2∥a∥ ℓ 2 ∥b∥ ℓ 2 − 2a T b ∥a∥ ℓ 2 ∥b∥ ℓ 2 + γ cos −1 (γ)− p 1−γ 2 π = ( 1 2 −c)(2− 2γ) + γ cos −1 (γ)− p 1−γ 2 π = (1−γ) (1− 2c) + 1 π · γ cos −1 (γ)− p 1−γ 2 1−γ . Therefore, it suffices to find 0<c< 1 2 such that the R.H.S. of the above is positive. It is easy to verify that h(γ) := γ cos −1 (γ)− √ 1−γ 2 1−γ ≥ −π 2 for−1≤ γ < 1. This in turn implies that the R.H.S. above is positive with c = 1 4 . In the case when∥a∥ ℓ 2 = 0 or∥b∥ ℓ 2 = 0 ( let us assume∥b∥ ℓ 2 = 0), (7.3) reduces to E[|φ(a T x)−φ(b T x)| 2 ] =E[|φ(a T x)| 2 ] = ∥a∥ 2 ℓ 2 2 ≥ 1 2 ∥a−b∥ 2 ℓ 2 ≥ 1 4 ∥a−b∥ 2 ℓ 2 . 121 Plugging the latter into (7.2) we arrive at E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ]≥σ 2 min (V )E[ φ( c W T x T )−φ(W T x T ) 2 ℓ 2 ] +kσ 2 ≥ 1 4 σ 2 min (V )||Σ 1 2 T ( c W T −W T ) T || 2 F +kσ 2 , concluding the proof. • One-hidden layer neural network model with fixed input-to-hidden layer: By expanding the expression we get E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ] =E[ c V T φ(Wx T )−V T φ(Wx T ) 2 ℓ 2 ] +kσ 2 . If we denoteE[φ(Wx T )φ(Wx T ) T ] = e Σ T , then similar to (7.1) we obtain E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ] =|| e Σ 1 2 T ( c V T −V T ) T || 2 F +kσ 2 . (7.6) Therefore, it suffices to calculate e Σ T . LetW Σ 1 2 T = a T 1 . . . a T ℓ andx = Σ −1 2 T x T (sox∼N (0,I d )). By (7.4) we obtain that e Σ T = 1 2 ∥a i ∥ ℓ 2 ∥a j ∥ ℓ 2 q 1−γ 2 ij + (π− cos −1 (γ ij ))γ ij π ij (7.7) where γ ij := a T i a j ∥a i ∥ ℓ 2 ∥a j ∥ ℓ 2 . 122 7.1.2 Calculating KL-Divergences for the Linear Model (Proof of Lemma 1.1) First we compute the KL-divergence between the distributions P W (i) S (x S ,y S ) andP W (j) S (x S ,y S ): D KL (P W (i) S (x S ,y S ),P W (j) S (x S ,y S )) =D KL (P W (i) S (x S ),P W (j) S (x S )) +E[D KL (P W (i) S (y S |x S ),P W (j) S (y S |x S ))]. The marginal distributions P W (i) S (x S ) and P W (j) S (x S ) are equal so their KL-divergence is zero. The conditional distributionsP W (i) S (y S |x S ) andP W (j) S (y S |x S ) are normally distributed with covari- ance matrix σ 2 I k and with mean respectively equal toW (i) S x S andW (j) S x S . Therefore, D KL (P W (i) S (y S |x S ),P W (j) S (y S |x S )) = W (i) S x S −W (j) S x S 2 ℓ 2 2σ 2 . This in turn implies that D KL (P W (i) S (x S ,y S ),P W (j) S (x S ,y S )) = E[ W (i) S x S −W (j) S x S 2 ℓ 2 ] 2σ 2 = ||Σ 1 2 S (W (i) S −W (j) S ) T || 2 F 2σ 2 , where the last equality follows similarly to the proof of Proposition 2.3.2 in the linear case. A similar calculation also yields D KL (Q W (i) T (x T ,y T ),Q W (j) T (x T ,y T )) = ||Σ 1 2 T (W (i) T −W (j) T ) T || 2 F 2σ 2 . 123 7.1.3 Lower Bound for Minimax Risk When∆ ≥ q σ 2 D log 2 r T n T and∆ < q σ 2 D log 2 r T n T ( Proof of Lemma 1.2) Consider the set n η :η = Σ 1 2 T W T T for some W T ∈R k×d and||η|| F ≤ 4δ o where δ > 0 is a value to be determined later in the proof. Furthermore, let{η 1 ,...,η N } be a 2δ-packing of this set in the F-norm. Since dim(range(Σ 1 2 T W T T ))=rk in whichW T is regarded as an input, this set sits in a space of dimension rk, where r =rank(Σ T ). Hence we can find such a packing with logN≥rk log 2 elements. Therefore, we have a collection of matrices of the form η j = Σ 1 2 T (W (i) T ) T for someW (i) T ∈R k×d such that Σ 1 2 T (W (i) T ) T F ≤ 4δ for each i∈ [N] and 2δ≤||Σ 1 2 T (W (i) T −W (j) T ) T || F ≤ 8δ for each i̸=j∈ [N]× [N]. So by Lemma 1.1 we get D KL (Q W (i) T ,Q W (j) T )≤ 32δ 2 σ 2 for each i̸=j∈ [N]× [N]. 124 Figure 7.1: Configuration of the parameters of source and target distributions in Lemma 1.2. Then set δ≤ ∆ 8 . We can chooseP W (1) S =... =P W (N) S withW (1) S =... =W (N) S and ||Σ 1 2 T (W (1) S ) T || F = 4δ . So they satisfy the condition ρ(W (i) S ,W (i) T )≤ 8δ≤ ∆ for each i∈ [N]. Figure 7.1 illustrates this configuration. So having samples from P W (i) S does not contain any information about the true index in [N] in the hypothesis testing problem andE is independent of J, hence we get I(J;E) = 0. Therefore, using (1.11) and (7.15) we get R T (P ∆ ;ϕ◦ρ)≥ϕ(δ) 1− n T 32δ 2 σ 2 + log 2 rk log 2 ! =δ 2 1− n T 32δ 2 σ 2 + log 2 rk log 2 ! for any 0≤ δ≤ ∆ 8 . We need to check the boundary and stationary points to solve the above optimization problem. 125 If ∆ ≥ q σ 2 (rk−1) log 2 n T = q σ 2 D log 2 n T holds, then R T (P ∆ ;ϕ◦ρ)≥ σ 2 (rk− 1) 2 log 2 128n T rk . Since D =rl− 1≥ 20, we have log 2 rk ≥ 1 2(rk−1) , so R T (P ∆ ;ϕ◦ρ)≥ σ 2 D 256n Q (7.8) and if ∆ < q σ 2 (rk−1) log 2 n T = q σ 2 D log 2 n T then R T (P ∆ ;ϕ◦ρ)≥ ( ∆ 8 ) 2 [1− n T ∆ 2 2σ 2 + log 2 rk log 2 ] ≥ ( ∆ 8 ) 2 [1− n T ∆ 2 2σ 2 + log 2 D log 2 ]. Since D≥ 20 we get R T (P ∆ ;ϕ◦ρ)≥ 1 100 ∆ 2 [1− 0.8 n T ∆ 2 σ 2 D ]. (7.9) 7.1.4 Lower Bound for Minimax Risk When∆ ≤ 1 45 q σ 2 D r S n S +r T n T ( Proof of Lemma 1.3) Let δ ′ = ∆ + u∆ |{z} =δ , for u> 0 to be determined, Consider the set {η :η = Σ 1 2 T W T S for some W S ∈R k×d and||η|| F ≤ 4δ ′ } and let{η 1 ,...,η N } be a 2δ ′ -packing in the F-norm and consider each η i as a single point. Since dim(range(Σ 1 2 T W T T ))=rk in whichW T is regarded as an input, this set sits in a space of dimension rk where r =rank(Σ T ). Therefore, we can find such a packing with logN≥rk log 2 elements. 126 Hence, we have a collection of matrices of the formη i = Σ 1 2 T (W (i) S ) T for someW (i) S ∈R k×d such that ||Σ 1 2 T (W (i) S ) T || F ≤ 4δ ′ for each i∈ [N] 2δ ′ ≤||Σ 1 2 T (W (i) S −W (j) S ) T || F ≤ 8δ ′ for each i̸=j∈ [N]× [N]. Figure 7.2: Configuration of the parameters of source and target distributions in Lemma 1.3. We choose eachW (i) T such that ρ(W (i) T ,M (i) S ) =||Σ 1 2 T (W (i) T −W (i) S ) T || F = ∆ . So ρ(W (i) T ,W (j) T )≥ 2δ for each i̸=j∈ [N]× [N]. Moreover, ρ(W (i) T ,W (j) T )≤ρ(W (i) T ,W (i) S ) +ρ(W (i) S ,M (j) S ) +ρ(W (j) S ,W (j) T ) ≤ 2∆ + 8(∆ + u∆) . 127 Figure 7.2 illustrates this configuration. By Lemma 1.1 we have D KL (Q W (i) T ,Q W (j) T )≤ 2∆ 2 (5 + 4u) 2 σ 2 for each i̸=j∈ [N]× [N]. Also we have ||Σ 1 2 S (W (i) S −W (j) S ) T || F =||Σ 1 2 S Σ − 1 2 T Σ 1 2 T (W (i) S −W (j) S ) T || F ≤ 8 Σ 1 2 S Σ − 1 2 T δ ′ = 8 Σ 1 2 S Σ − 1 2 T (∆ + u∆) Hence, by Lemma 1.1 we have D KL (P W (i) S ,P W (j) S )≤ 32 Σ 1 2 S Σ − 1 2 T 2 ∆ 2 (u + 1) 2 σ 2 for each i̸=j∈ [N]× [N]. Therefore, using (1.11) and (7.15) we arrive at R T (P;ϕ◦ρ)≥ϕ(u∆)[1 − n S 32 Σ 1 2 S Σ − 1 2 T 2 ∆ 2 (u+1) 2 σ 2 +n T 2∆ 2 (5+4u) 2 σ 2 + log 2 rk log 2 ] = (u∆) 2 [1− n S r S 32∆ 2 (u+1) 2 σ 2 +n T r T 2∆ 2 (5+4u) 2 σ 2 + log 2 rk log 2 ]. The above inequality holds for every u ≥ 0. Maximizing the expression above over u we can conclude if ∆ ≤ q σ 2 (rk−1) log 2 32n S r S +50n T r T then R T (P;ϕ◦ρ)≥ (u∆) 2 [1− n S r S 32∆ 2 (u+1) 2 σ 2 +n T r T 2∆ 2 (5+4u) 2 σ 2 + log 2 rk log 2 ] 128 where u = 3∆(4 n S r S +n T r T )+ √ ∆ 2 [16(n S r S ) 2 +25(n T r T ) 2 +32n S r S n T r T ]+4(n S r S +n T r T )(rk−1)σ 2 log 2 16∆( n P r P +n Q ) . Now, we need to simplify the above expressions. First note that (u∆) ≥ ( 3∆ 16 + q ∆ 2 + 4 Dσ 2 log 2 n S r S +n T r T 16 ), so (u∆) 2 ≥ ∆ 2 + 2.7 Dσ 2 n S r S +n T r T 256 . (7.10) Moreover, 1− n S r S 32∆ 2 (u+1) 2 σ 2 +n T 2∆ 2 (5+4u) 2 σ 2 + log 2 rk log 2 ≥ 1− [n S r S +n T r T ] 32∆ 2 ( 5 4 +u) 2 σ 2 + log 2 D log 2 and ∆( 5 4 +u)≤ 2∆ + 1 16 s 25∆ 2 + 4 log 2Dσ 2 n S r S +n T r T . Since ∆ ≤ 1 45 q σ 2 D n S r S +n T r T , ∆ 2 ( 5 4 +u) 2 ≤ (4 + ( 5 16 ) 2 )∆ 2 + 1 4 ∆ s 25∆ 2 + 4 log 2Dσ 2 n S r S +n T r T ≤ (4 + ( 5 16 ) 2 ) 1 45 2 Dσ 2 n S r S +n T r T + 1 4× 45 2 q 25 2 + 45 2 × 4 log 2 Dσ 2 n S r S +n T r T ≤ 0.012 Dσ 2 n S r S +n T r T . 129 Hence, 1− [n S r S +n T r T ] 32∆ 2 ( 5 4 +u) 2 σ 2 + log 2 D log 2 ≥ 1− 0.56− 1 D ≥ 0.39. Therefore, we arrive at R T (P;ϕ◦ρ)≥ ∆ 2 1000 + 6 1000 Dσ 2 n S r S +n T r T . (7.11) 7.1.5 Proof of Theorem 1.1 (One-hidden layer neural network with fixed hidden-to- output layer) By Proposition 1.1, the generalization error is bounded from below as E Q θ T [∥b y T −y T ∥ 2 ℓ 2 ]≥ 1 4 σ 2 min (V )||Σ 1 2 T ( c W T −W T ) T || 2 F +kσ 2 . Therefore, it suffices to find a lower bound for the following quantity: R T (P ∆ ;ϕ◦ρ) := inf c W T sup (P W S ,Q W T )∈P ∆ E S P W S ∼P 1:n P W S E S Q W T ∼Q 1:n Q W T ϕ(ρ( c W T (S P W S ,S Q W T ),W T )) where ϕ(x) =x 2 for x∈R and ρ is defined per Definition 1.1. The rest of the proof is similar to the linear case as the corresponding transfer distance metrics are the same. We only need to upper bound the corresponding KL-divergences in this case. We do so by the following lemma. Lemma 7.1. Suppose that P W (i) S and P W (j) S are the joint distributions of features and labels in a source task and Q W (i) T and Q W (j) T are joint distributions of features and labels in a target task as 130 defined in Section 1.2.1 in the one-hidden layer neural network with fixed hidden-to-output layer model. Then D KL (P W (i) S ||P W (j) S )≤ ∥V∥ 2 ||Σ 1 2 S (W (i) S −W (j) S ) T || 2 F 2σ 2 and D KL (Q W (i) T ||Q W (j) T )≤ ∥V∥ 2 ||Σ 1 2 S (W (i) T −W (j) T ) T || 2 F 2σ 2 . Furthermore, we also note that since in this caseW S ,W T ∈R ℓ×d , the definition of D is slightly different from that in the linear case. In this case D = rank(Σ T )ℓ− 1. 7.1.6 Bounding the KL-Divergences in the Neural Network Model (Proof of Lemma 7.1) First we compute the KL-divergence between the distributions P W (i) S (x S ,y S ) andP W (j) S (x S ,y S ) D KL (P W (i) S (x S ,y S ),P W (j) S (x S ,y S )) =D KL (P W (i) S (x S ),P W (j) S (x S )) +E[D KL (P W (i) S (y S |x S ),P W (j) S (y S |x S ))]. The marginal distributions P M (i) S (x S ) and P M (j) S (x S ) are equal so their KL-divergence is zero. The conditional distributionsP M (i) S (y S |x S ) andP M (j) S (y S |x S ) are normally distributed with covari- ance matrixσ 2 I k and with mean respectively equal toVφ(W (i) S x S ) andVφ(W (j) S x S ). Therefore, we obtain D KL (P W (i) S (y S |x S ),P W (j) S (y S |x S )) = Vφ(W (i) S x S )−Vφ(W (j) S x S ) 2 ℓ 2 2σ 2 . 131 Then we have D KL (P W (i) S (x S ,y S ),P W (j) S (x S ,y S )) = E Vφ(W (i) S x S )−Vφ(W (j) S x S ) 2 ℓ 2 2σ 2 ≤ ∥V∥ 2 ||Σ 1 2 S (W (i) S −W (j) S ) T || 2 F 2σ 2 . Since ReLU is a Lipschitz function. Similarly we get D KL (Q W (i) T (x T ,y T ),Q W (j) T (x T ,y T )) = E||Vφ(W (i) T x T )−Vφ(W (j) T x T )|| 2 F 2σ 2 ≤ ∥V∥ 2 ||Σ 1 2 T (W (i) T −W (j) T ) T || 2 F 2σ 2 . 7.1.7 Proof of Theorem 1.1 (One-hidden layer neural network model with fixed input- to-hidden layer) By Proposition 1.1, the generalization error isE Q θ T [∥b y T −y T ∥ 2 ℓ 2 ] =|| e Σ 1 2 T ( c V T −V T ) T || 2 F +kσ 2 so it suffices to find a lower bound for the following quantity R T (P ∆ ;ϕ◦ρ) := inf b V T sup (P V S ,Q V T )∈P ∆ E S P V S ∼P 1:n P V S E S Q V T ∼Q 1:n Q V T ϕ(ρ( c V T (S P V S ,S Q V T ),V T )) Where ϕ(x) =x 2 for x∈R and ρ is defined per Definition 1.1. Inherently, this case is the same as the linear model except that the distribution of the features has changed which was calculated in (7.7). The rest of the proof is similar to the linear case with the difference that Σ S , Σ T should be replaced by e Σ S , e Σ T . 132 7.2 Appendix for Chapter 2 7.2.1 Calculating the Error of Estimated Classifier (Proof of Proposition 2.3.2) Prob C ˆ θ T (x T )̸=C θ T (x T ) = Prob x T T ˆ θ T > 0,x T T θ T < 0 +Prob x T T ˆ θ T < 0,x T T θ T > 0 Let w 1 =x T T ˆ θ T and w 2 =x T T θ T . Sincex T ∼N (0,I d ), we have w 1 w 2 ∼N 0 0 , || ˆ θ|| 2 ℓ 2 ˆ θ T θ ˆ θ T θ ||θ|| 2 ℓ 2 Hence, Prob C ˆ θ T (x T )̸=C θ T (x T ) = Prob w 1 > 0,w 2 < 0 +Prob w 1 < 0,w 2 > 0 = 1− 2Prob w 1 > 0,w 2 > 0 = 1− 2( 1 4 + 1 2π arcsin ˆ θ T θ || ˆ θ|| ℓ 2 ||θ|| ℓ 2 ) = 1 π arccos ˆ θ T θ || ˆ θ|| ℓ 2 ||θ|| ℓ 2 = 1 π arccos( ˆ θ T θ) Where in the last line we use the fact that ˆ θ andθ lie on the unit sphere. 7.2.2 Proof of Theorem 2.1 By using Proposition 2.3.2 we can write the minimax risk as follows 133 R ∆ T = inf b θ T sup ρ(θ S ,θ T )≤∆ E S P θ S ∼P 1:n S θ S S Q θ T ∼Q 1:n T θ T Prob C ˆ θ T (x T )̸=C θ T (x T ) = inf b θ T sup ρ(θ S ,θ T )≤∆ E S P θ S ∼P 1:n S θ S S Q θ T ∼Q 1:n T θ T ρ b θ T (S P θ S ,S P θ T ),θ T where b θ T = b θ T (S P θ S ,S P θ T ) is a function of samples (S P θ S ,S P θ T ). Note that the distanceρ that is the Geodesic distance on the unit sphere is a metric, which would be necessary in the sequel. Then we follow the usual technique of reducing the minimax risk to hypothesis testing inspired by the proof [81] (See also [105, Chapter 15] for non-transfer learning minimax risk). [81] provides lower bounds for minimax risk in regression problems. In this work we use some ideas of [81] to find minimax lower bounds for classification. Since our goal is to estimate the the target parameter using the source and target data, we need to pick N pairs of distributions (P (1) θ S ,Q (1) θ T ),..., (P (N) θ S ,Q (N) θ T ) such that ρ(θ (i) T ,θ (j) T )≥ 2δ for each i̸=j∈ [N]× [N] (7.12) and ρ(θ (i) S ,θ (i) T )≤ ∆ for each i∈ [N] (7.13) 7.12 assures that the target parameters are 2δ-separated suing the ρ distance defined in 2.1, and 7.13 assures that the source and target distributions belong to the class of transfer learning problem over which the supremum of the minimax risk is taken. 134 Now by considering the following hypothesis testing: • Let J be a random sample from the uniform distribution over [N] :={1, 2,...,N}. • Given J =j, sample S P θ (j) S ∼P 1:n S θ (j) S and S Q θ (j) T ∼P 1:n T θ (j) T . We aim at finding the true index using the n S +n T samples using a testing function. Using the proof [81], one can find the following lower bound for the minimax risk: R ∆ T ≥δ 1− n S I(J;E) +n T I(J;F ) + log 2 logN (7.14) where E and F are random variables such that E|{J =j}∼P θ (j) S and F|{J =i}∼Q θ (j) T , and I denotes the mutual information. Then by using the convexity of KL-divergence, we can bound the mutual information appearing in 7.14 as follows I(J;E)≤ 1 N 2 X i,j D KL (P θ (i) S ||P θ (j) S ) I(J;F )≤ 1 N 2 X i,t D KL (Q θ (i) T ||Q θ (j) T ). (7.15) Following lemma bounds the KL-divergence in 7.15. 135 Lemma 7.2. LetP θ S ,P θ ′ S be two joint distributions of the features and labels in the source task and Q θ T , Q θ ′ T be those in the target task according to the model defined in Section 2.3.1. Then D KL (P θ S ||P θ ′ S ) +D KL (P θ ′ S ||P θ S )≤||θ S −θ ′ S || 2 ℓ 2 D KL (Q θ T ||Q θ ′ T ) +D KL (Q θ ′ T ||Q θ T )≤||θ T −θ ′ T || 2 ℓ 2 Proof. of Lemma 7.2. D KL P θ S (x S ,y S )||P θ ′ S (x S ,y S ) =D KL P θ S (x S )||P θ ′ S (x S ) +E D KL P θ S (y S |x S )||P θ ′ S (y S |x S ) =E D KL P θ S (y S |x S )||P θ ′ S (y S |x S ) If p a = 1 1+e a and p b = 1 1+e b be two Bernoulli distributions, by simply calculating the KL- divergence one can get D KL (p a ||p b ) +D KL (p b ||p a )≤ (a−b) 2 Hence, 136 D KL (P θ S ||P θ ′ S ) +D KL (P θ ′ S ||P θ S )≤E D KL P θ S (y S |x S )||P θ ′ S (y S |x S ) +E D KL P θ ′ S (y S |x S )||P θ S (y S |x S ) ≤E x T S (θ S −θ ′ S ) 2 =||θ S −θ ′ S || 2 ℓ 2 In the following, we use the well-known Gilbert-Varshamov lemma for constructing packing sets. For convenience, we state this lemma here. Lemma 7.3. (See Lemma 4.10 [78], and Lemma 1 [70]) Let M ={u∈{−1, +1} p s.t.||u|| 0 = s} where p is the dimension of the vectors u and s is the sparsity level with 1≤ s < p 4 . Then there exist the vectors u 1 ,...,u N such that Ham(u i ,u j )> s 2 for all 1≤i<j≤N and logN≥ s 5 log p s where Ham(u i ,u j ) denotes the Hamming distance of the vectors u i and u j . Based on the distance of source and target, ∆ , we divide the proof of Theorem 3.1 into two parts and one can conclude the proof using the following two lemmas. Lemma 7.4. Assume that ∆ ≥ 1 π arccos q 1− d 200n T [ 1 4 − 100 log 2 d ] where n T and d are the number of target samples and the dimension. Then we would have R ∆ T ≥c d n T 137 Furthermore, if ∆ < 1 π arccos q 1− d 200n T [ 1 4 − 100 log 2 d ] , then R ∆ T ≥ (1− cos 2 (∆)) 1− n T (1− cos 2 (∆)) + log 2 .04d Lemma 7.5. Suppose that there are n S and n T number of source and target samples and ∆ < . Then R ∆ T ≥c d n S +n T Proof. of Lemma 7.4. By using Lemma 7.3 we choose the set of target parameters as M Target ={θ T | ( p 1−sα 2 ,αψ 1 ,...,αψ d−1 )∈R d } (7.16) where ψ∈{−1, +1},||(ψ 1 ,...,ψ d−1 )|| 0 =s, and s = d−1 6 . Furthermore log|M Target |≥ s 5 d− 1 s ≥ 4d 100 and we will determine α≤ 1 √ s later. As for the source parameters, we chooseθ (1) S =... =θ (|M Target |) S = (1, 0,..., 0). Next, we need to bound the corresponding KL-divergences. Since all the source parameters are the same, D KL (P θ (i) S ||P θ (j) S ) = 0 for each i̸=j∈ [N]× [N] (7.17) 138 With respect to the target parameters, I(J;F )≤ 1 N 2 X i,t D KL (Q θ (i) T ||Q θ (j) T ) ≤ 1 2 max i,j D KL (Q θ (i) T ||Q θ (j) T ) +D KL (Q θ (j) T ||Q θ (i) T ) ≤ 1 2 max i,j ||θ (i) T −θ (j) T || 2 ℓ 2 ≤sα 2 (7.18) Now, we claim that⟨θ (i) T ,θ (j) T ⟩ ≤ 1− sα 2 4 for each i̸=j∈ [N]× [N] Without loss of generality assume that the non-zero elements of the vector θ (i) T lie in the first s + 1 components, i.e. θ (i) T = ( √ 1−sα 2 ,β 1 ,...,β s , 0, 0,..., 0) where β i ∈{−1, +1}. With respect to the vectorθ (j) T , the first component is √ 1−sα 2 and suppose that there are x number of indices, i.e. i 1 ,...,i x such that β i k = θ (j) T (i k ) for k = 1,...,x where θ (j) T (i k ) is the i k th element of θ (j) T . Furthermore, suppose thatθ (j) T hasγ number of non-zero elements in thes+1th todth components. Hence, the Hamming distance of θ (i) T and θ (j) T is at most s−x +γ which has to be larger than s 2 because of the construction of the set M Target . On the other hand, we must have γ≤ s−x. Therefore, 2(s−x)> s 2 andx< 3s 4 . Then we have⟨θ (i) T ,θ (j) T ⟩ ≤ 1−sα 2 +xα 2 < 1−sα 2 + 3sα 2 4 = 1− sα 2 4 Finally, we can conclude that ρ(θ (i) T ,θ (j) T )≥ 1 π arccos(1− sα 2 4 ) (7.19) Then since for each i̸=j∈ [N]×[N] 7.19 holds, we can set 2δ = 1 π arccos(1− sα 2 4 ) andM Target would be a 2δ-separated set with respect to the metric ρ. 139 Next, the distance of source and target parameters must satisfy ρ(θ (i) S ,θ (i) T )≤ ∆ for each i∈ [N] By construction, we have ρ(θ (i) S ,θ (i) T ) = 1 π arccos √ 1−sα 2 for each i∈ [N] If ∆ ≥ √ 1−sα 2 (we will specify α later) then by 7.14 we have R ∆ T ≥ 1 2π arccos(1− sα 2 4 ) 1− n T sα 2 + log 2 .04d (7.20) ≥ sα 2 8π 1− n T sα 2 + log 2 .04d (7.21) In 7.20, we use 7.17 and 7.18 for bounding the mutual information I(J;E) and I(J;F ). If we set α = s d 200n T s 1 4 − 100 log 2 d provided that d≥ 300 and d n T < 800, we get R ∆ T ≥ d 1600πn T 1 4 − 100 log 2 d 1− 1 32 − 25 log 2 2d for ∆ ≥ 1 π arccos q 1− d 200n T ( 1 4 − 100 log 2 d ) . For ∆ < 1 π arccos q 1− d 200n T ( 1 4 − 100 log 2 d ) , if we set sα 2 = 1− cos 2 (π∆) then we have ∆ = ρ(θ (i) S ,θ (i) T ) and R ∆ T ≥ (1− cos 2 (π∆)) 8π 1− n T (1− cos 2 (π∆)) + log 2 .04d Proof. of Lemma 7.5. For the set of target parameters we consider the same set M Target as defined in 7.16. Moreover, for the set of source parameters, for each i∈ [N], where N =|M Target |, we 140 chooseθ (i) S as an arbitrary point on the unit sphere in R d with ρ(θ (i) S ,θ (i) T ) = ∆ . Since ρ satisfies the triangle inequality, for i,j∈ [N] we have ρ(θ (i) S ,θ (j) S )≤ρ(θ (i) S ,θ (i) T ) +ρ(θ (i) T ,θ (j) T ) +ρ(θ (j) T ,θ (j) S ) ≤ 2∆ + ρ(θ (i) T ,θ (j) T ) ≤ 2∆ + 1 π arccos(⟨θ (i) T ,θ (j) T ⟩) ≤ 2∆ + 1 π arccos(1− sα 2 4 ) (7.22) where in the last inequality we use 7.19. Unlike the construction of the source parameters in Proof of Lemma 7.4, here they are not equal to one another and we need to bound the corresponding KL-divergences. Following the same idea used in 7.18, I(J;E)≤ 1 2 max i,j ||θ (i) S −θ (j) S || 2 ℓ 2 = 1−⟨θ (i) S ,θ (j) S ⟩ By 7.22 we have ⟨θ (i) S ,θ (j) S ⟩≥ cos 2π∆ + arccos(1 − sα 2 4 ) Hence, I(J;E)≤ 1− cos 2π∆ + arccos(1 − sα 2 4 ) (7.23) 141 By plugging 7.18 and 7.23 into 7.14 we get R ∆ T ≥ 1 2π arccos(1− sα 2 4 ) 1− n T sα 2 +n S 1− cos(2π∆ + arccos(1 − sα 2 4 )) + log 2 .04d ≥ sα 2 8π 1− n T sα 2 +n S 1− cos(2π∆ + arccos(1 − sα 2 4 )) + log 2 .04d (7.24) If we expand the coefficient of n S in 7.24, we get 1− cos(2π∆)(1 − sα 2 4 ) + sin(2π∆) sin(arccos(1 − sα 2 4 )) By plugging the following inequalities into 7.24 sin(arccos(1− sα 2 4 ))≤ 1 sin(2π∆) ≤ 2π∆ cos(2π∆) ≥ 1− 2π 2 ∆ 2 we obtain R ∆ T ≥ sα 2 8π 1− n T sα 2 +n S 1− (1− 2π 2 ∆ 2 )(1− sα 2 4 ) + 2π∆] + log 2 .04d ≥ sα 2 8π 1− n T sα 2 +n S 2π 2 ∆ 2 + 2π∆ + sα 2 4 − sα 2 π 2 ∆ 2 2 + log 2 .04d For ∆ < 1 π , we have ∆ 2 π 2 < ∆ π. Hence, 142 R ∆ T ≥ sα 2 8π 1− (n T +n S ) 4π∆ + sα 2 + log 2 .04d If ∆ < .04d−log 2 16π(n S +n T ) , choose sα 2 = 3(.04d−log 2) 8(n S +n T ) , then we get R ∆ T ≥ 3(.04d− log 2) 64π(n S +n T ) 1− 5(.04d− log 2) .32d − log 2 .04d ≥c· d n S +n T 7.3 Appendix for Chapter 3 7.3.1 Proof of Theorem 3.1 We also use the following famous result in information theory known as Gilbert-Varhsamov’s bound for packing argument. (Lemma 2.9 of [104]) Let d≥ 8. Then there exists a subset{w (0) ,...,w (M) } of Ω = {−1, 1} d such that w (0) = (1, 1,..., 1), dist(w (j) ,w (k) )≥ d 8 , ∀ 0≤j <k≤M and M≥ 2 d/8 , where dist(w,w ′ ) = P d k=1 I(w k ̸= w ′ k ) is the Hamming distance between binary sequences w = (w 1 ,...,w d ) and w ′ = (w ′ 1 ,...,w ′ d ). We will also use the following lemma proved in [43]. We would like to mention that some ideas of the proof are similar to those in [43]. However, as discussed in section 3.2, the problem setting of [43] is different from that of this work which results in constructing a different set of distributions. 143 Lemma 7.6. Let 0<ϵ< 1/2 and z∈{−1, 1}. Then D kl Ber 1/2 + (z/2)·ϵ ,Ber 1/2− (z/2)·ϵ ≤c 0 ·ϵ 2 for some c 0 ≤ 4 independent of ϵ Now we are in place to provide the proof of Theorem 3.1. Letd =d H −2 and pickx −1 ,x 0 ,...,x d from χ shattered byH. Next, we construct a family of pairs of distributions (P w ,Q w ) indexed by w∈{−1, 1} d where {−1, 1} d is the parameter space playing the role of Θ in Proposition 3.6. For the following, fix ϵ =c 1 ·ϵ(n S ,n T ,d H , ∆) ≤ 1 2 forsomeconstantc 1 tobedeterminedlaterinproofandϵ(n S ,n T ,d H , ∆) is defined in Theorem 3.1. Distribution Q w : Q w is composed of a marginal and a conditional distribution, namely Q w = Q w x ×Q w y|x . We define the marginaldistributions as follows: Q w x (x =x −1 ) = ∆ Q w x (x =x 0 ) = 0.99− ∆ Q w x (x =x i ) = 1 100d for i = 1,..,d For the conditional distributions: Q w y|x (y = 1|x =x −1 ) =Q w y|x (y = 1|x =x 0 ) = 1 Q w y|x (y = 1|x =x i ) = 1/2 + (w i )ϵ for i = 1,..,d 144 Distribution P w : P w is composed of a marginal and a conditional distribution, namely P w = P w x ×P w y|x . We define the marginal distributions as follows: P w x (x =x −1 ) =P w x (x =x 0 ) = 1/2 1− d d +n S ∆ P w x (x =x i ) = 1 d +n S ∆ for i = 1,..,d For the conditional distributions: P w y|x (y = 1|x =x −1 ) = 0 P w y|x (y = 1|x =x 0 ) = 1 P w y|x (y = 1|x =x i ) = 1/2 + (w i )ϵ for i = 1,..,d Verifying ρ(P w ,Q w )≤ ∆ : Bayes classifier of the domain generated by P w is as follows: h ∗ S (x −1 ) = 0 h ∗ S (x 0 ) = 1 h ∗ S (x i ) = 1 if w i = 1 ,otherwise h ∗ S (x i ) = 0 for i = 1,..,d Similarly for the domain generated by Q w , we have h ∗ T (x −1 ) =h ∗ T (x 0 ) = 1 h ∗ T (x i ) = 1 if w i = 1 ,otherwise h ∗ T (x i ) = 0 for i = 1,..,d 145 So h ∗ S and h ∗ T disagree only onx −1 which implies that ρ(P w ,Q w ) =Q[h ∗ S (x T )̸=y T ]−Q[h ∗ T (x T )̸=y T ] = ∆ Since we want to derive a lower bound for the minimax risk stated in Theorem 3.1, among the hypotheses that they agree on x i for i = 1,...,d, the hypothesis that outputs x −1 and x 0 as 1 results in a smaller target error. Hence, we can restrict ourselves to ˜ H which is the projection of H onto{−1, 1} d with the constraint that h(x −1 ) =h(x 0 ) = 1 for all h∈ ˜ H. Furthermor, for any w,w ′ ∈{−1, 1} d we have E T (h w ′) = dist(w,w ′ ) 100d ·ϵ, ∀ h w ′∈ ˜ H when the target domain is generated by Q w Reduction to a packing: By using Proposition 7.3.1, we can get a subset Σ of{−1, 1} d whose cardinality is M≥ 2 d/8 and for any w,w ′ belonging to Σ we have dist(w,w ′ )≥d/8. Furthermore, for any w,w ′ ∈ Σ we have E T (h w ′)≥ d 8 · ϵ 100d = ϵ 800 On the other hand, there is a bijective map between{−1, 1} d and elements of ˜ H and any classifier ˆ h :{x i }→{0, 1} with ˆ h(x −1 ) = ˆ h(x 0 ) = 1 can be reduced to a w∈{−1, 1} d . So we can choose Σ asthesetofindicesinProposition3.6withHammingdistanceasthesemi-metricandtheexpression P w (dist( ˆ w,w)>d/8) translates into P w (E T (h ˆ w )>c·ϵ). 146 KL divergence bound (part (ii) of Proposition 3.6): Define P w =P n S w ×Q n T w . For anyw,w ′ ∈ Σ we have D kl (P w |P w ′) =n S ·D kl (P w |P ′ w ) +n T ·D kl (Q w |Q w ′) =n S ·E Px D kl (P w y|x |P w ′ y|x ) +n T · E Qx D kl (Q w y|x |Q w ′ y|x ) =n S · d X i=1 1 d +n S ∆ D kl (P w y|x i |P w ′ y|x i ) +n T · d X i=1 1 100d D kl (Q w y|x i |Q w ′ y|x i ) ≤n S · d d +n S ∆ c 0 ϵ 2 +n T · 1 100 c 0 ϵ 2 ≤c 0 c 2 1 ·d if c 1 < 1 6 then c 0 c 2 1 < 1 8 and we can apply Proposition 3.6. 7.3.2 Proof of Theorem 3.2 Proof of Theorem 3.2 is similar to that of Theorem 3.1. However, we construct different target and source probability distributions. Let d =d H −N− 1 and pick x −M ,...,x 0 ,x 1 ,...,x d from χ shattered byH. Then we construct a family of distributions (P (1) w ,...,P (N) w ,Q w ) indexed by w∈{−1, 1} d . Let ϵ = c 1 ·ϵ(n S 1 ,...,n S N ,n T ,d H , ∆ 1 ,..., ∆ N ) for some constant c 1 < 1 to be determined later in proof. Furthermore, without loss of generality assume that 1≥ ∆ 1 ≥ ∆ 2 ≥...≥ ∆ N ≥ 0. 147 Distribution Q w : Q w is composed of a marginal and a conditional distribution, namely Q w = Q w x ×Q w y|x . We define the marginal distributions as follows: Q w x (x =x −i ) = ∆ i − ∆ i+1 for i = 1,...,N− 1 and Q w x (x =x −N ) = ∆ N Q w x (x =x 0 ) = 0.99− ∆ 1 Q w x (x =x i ) = 1 100d for i = 1,..,d For the conditional distributions: Q w y|x (y = 1|x =x −i ) = 1 for i = 1,...,N Q w y|x (y = 1|x =x 0 ) = 1 Q w y|x (y = 1|x =x i ) = 1/2 + (w i )ϵ for i = 1,...,d Distribution P (i) w : P (i) w is composed of a marginal and a conditional distribution, namely P (i) w = P w x (i) ×P w y|x (i) . We define the marginal distributions as follows: P w x (i) (x =x −j ) = 1 N + 1 1− d d +n S i ∆ i for j = 1,...,N P w x (i) (x =x 0 ) = 1 N + 1 1− d d +n S i ∆ i P w x (i) (x =x j ) = 1 d +n S i ∆ i for j = 1,..,d 148 For the conditional distributions: P w y|x (y = 1|x =x −j ) = 0 if j≥i, otherwise P w y|x (y = 1|x =x −j ) = 1 for j = 1,...,N P w y|x (y = 1|x =x 0 ) = 1 P w y|x (y = 1|x =x j ) = 1/2 + (w j )ϵ for j = 1,..,d Verifying ρ(P (i) w ,Q w )≤ ∆ i : Bayes classifier of the domain generated by P (i) w is as follows: h ∗ S i (x −j ) = 0 if j≥i, otherwise h ∗ S i (x −j ) = 1 for j = 1,...,N h ∗ S i (x 0 ) = 1 h ∗ S i (x j ) = 1 if w j = 1, otherwise h ∗ S i (x j ) = 0 for j = 1,..,d Similarly for the domain generated by Q w , we have h ∗ T (x −j ) = 1 for j = 1,...,N h ∗ T (x 0 ) = 1 h ∗ T (x j ) = 1 if w j = 1 ,otherwise h ∗ T (x j ) = 0 for j = 1,..,d So h ∗ S i and h ∗ T disagree onx −i ,..,x −N which implies that ρ(P (i) w ,Q w ) =Q[h ∗ S i (x T )̸=y T ]−Q[h ∗ T (x T )̸=y T ] = ∆ i 149 With the same argument we used in the proof of Theorem 3.1 we can restrict ourselves to ˜ H which is the projection ofH with the constraint that h(x −N ) = ... = h(x −1 ) = h(x 0 ) = 1 for all h∈ ˜ H. The rest of the proof is exactly the same except the part regarding the KL divergence bound. KL divergence bound: Define P w =P (1) w n S 1 ×...×P (N) w n S N ×Q n T w . D kl (P w |P w ′) = N X j=1 n S j ·D kl (P (j) w |P (j) w ′ ) +n T ·D kl (Q w |Q w ′) = N X j=1 n S j · E P (j) x D kl (P w y|x (j) |P w ′ y|x (j) ) +n T · E Qx D kl (Q w y|x |Q w ′ y|x ) = N X j=1 n S j · d X i=1 1 d +n S j ∆ j D kl (P w y|x i (j) |P w ′ y|x i (j) ) +n T · d X i=1 1 100d D kl (Q w y|x i |Q w ′ y|x i ) ≤ N X j=1 n S j · d d +n S j ∆ j c 0 ϵ 2 +n T · 1 100 c 0 ϵ 2 ≤c 0 c 2 1 ·d for small enough c 1 we can apply Proposition 3.6. 7.3.3 Proof of Theorem 3.3 (Extension to Multiclass Classification) We reduce the multiclass transfer learning problem to the binary one and then deduce Theorem 3.3 from Theorems 3.1 and 3.2. To do so, we follow the idea used in [27], which deduces multiclass PAC learning from binary PAC learning. For completeness, first we summarize the proof idea for multiclass PAC learning used in [27], then we adapt it to the transfer learning setting. 150 IfD beadistributionoverX×Y, whereX andY arethefeatureandlabelspaces, andf :X→Y be a function from the hypothesis classH, then we denote the error of f with respect toD by Err D (f) = Prob (x,y)∼D (f(x)̸=y) and the best achievable error by Err D (H) := inf f∈H Err D (f) We also define a learning algorithm A for the classH as A :∪ ∞ n=0 (X×Y) n →Y X Thenwedefinethesamplecomplexityofthelearningalgorithm Aastheminimalintegerm A,H (ϵ,δ) for every ϵ,δ> 0 such that for every m≥m A,H (ϵ,δ) and every distributionD onX×Y we have Prob Sm∼D m Err D (A(S m ))> Err D (H) +ϵ ≤δ where S m = {(x 1 ,y 1 ),..., (x m ,y m )} is a training sequence consisting of m i.i.d. samples from the distributionD. Moreover, if theD is realizable byH we would denote the sample complex- ity as m r A,H (ϵ,δ), otherwise we would denote m a A,H (ϵ,δ) as the agnostic sample complexity of the algorithmA. Furthermore we define the agnostic and realizable sample complexity of H by m a PAC,H (ϵ,δ) := inf A m a A,H (ϵ,δ) and m r PAC,H (ϵ,δ) := inf A m r A,H (ϵ,δ). Theorem5in[27]showsthatm r PAC,H (ϵ,δ)≥C 1 ( d N (H)+ln 1 δ ϵ )andm a PAC,H (ϵ,δ)≥C 2 ( d N (H)+ln 1 δ ϵ 2 ). To prove these lower bounds we reduce the problem to binary PAC learning. LetH⊆Y X be a hypothesis class of Natarajan dimension d. We define a binary hypothesis class as H d :={0, 1} [d] . To deduce the lower bounds we just need to show m a PAC,H ≥ m a PAC,H d and m r PAC,H ≥ m r PAC,H d . 151 LetA be a learning algorithm forH, then we construct a learning algorithm ¯ A forH d . Let the set U ={u 1 ,...,u d }⊆X be N-shattered byH and f 0 ,f 1 be the corresponding functions in Definition 3.2. We map every sample (x,y)∈ [d]×{0, 1} to (u x ,f y (u x ))∈X×Y, so every distributionD on [d]×{0, 1} would be mapped to a corresponding distributionD ′ onX×Y. Moreover, if h ∗ d is the optimal hypothesis on the domain [d]×{0, 1} with respect to distributionD, the optimal hypothesis h ∗ on the domainX×Y with respect to the distributionD ′ satisfies the following: h ∗ (u i ) =f 1 (u i ) if h ∗ d (i) = 1, and h ∗ (u i ) =f 0 (u i ) if h ∗ d (i) = 0. Then we define the learning algorithm ¯ A forH d as follows. Given a sample set (x i ,y i ) m i=1 ⊆ [d]×{0, 1}, let the function g be output of the algorithm A on the sample set (u x i ,f y i (x i ))⊆X×Y. Then the output of ¯ A would be f : [d]→{0, 1} such that f(i) = 1 iff g(u i ) = f 1 (u i ). Hence, if Prob Sm∼D ′m Err D ′(A(S m )) > Err D ′(H) +ϵ ≤ δ then Prob Sm∼D m Err D ( ¯ A(S m )) > Err D (H d ) +ϵ ≤ δ. Subsequently, we can conclude that m a A,H ≥ m a ¯ A,H d as well as m r A,H ≥m r ¯ A,H d , and then m a PAC,H ≥m a PAC,H d and m r PAC,H ≥m r PAC,H d . We follow the same idea for the transfer learning problem. LetA be the transfer learning algorithm forH on the domainX×Y which outputs a function ˆ h. We construct the learning algorithm ¯ A forH d on the domain [d]×{0, 1} which outputs the function ˆ h d . Given the source sample Again, assume that the set U ={u 1 ,...,u d }⊆X is N-shattered byH with the functions f 0 ,f 1 . Given the source samples (x S i ,y S i ) n S i=0 ⊆ [d]×{0, 1} and the target samples (x T i ,y T i ) n T i=0 ⊆ [d]×{0, 1}, consider the mapped the samples into the domainX×Y, namely (u x S i ,f y S i (u x S i )) n S i=0 and (u x T i ,f y T i (u x T i )) n T i=0 . We denote the source and target distributions on the domain [d]×{0, 1} by P and Q, and the corresponding source and target distributions on the domainX×Y by P ′ and Q ′ . Note that by the described map, if ρ(P,Q)≤ ∆ , then ρ(P ′ ,Q ′ )≤ ∆ . Now we define the algorithm ¯ A as follows. Let the function ˆ h be the output of the learning algorithmA on the samples (u x S i ,f y S i (u x S i )) n S i=0 and (u x T i ,f y T i (u x T i )) n T i=0 . Then the output of ¯ A would be ˆ h d : [d]→{0, 1} such 152 that ˆ h d (i) = 1 iff ˆ h(u i ) = f 1 (u i ). Our goal is to show that for any hypothesis h∈H there exist source and target distributionsD S ,D T with ρ(D S ,D T )≤ ∆ over the domainX×Y such that Prob D n S S ,D n T T E T (h)>c· s 1 n T d N (H) + n S d N (H)+n S ∆ ! ≥ 3− 2 √ 2 8 By contradiction, suppose that the learning the learning algorithmA outputs the function ˆ h which satisfies Prob D n S S ,D n T T E T ( ˆ h) > c· r 1 n T d N (H) + n S d N (H)+n S ∆ ! < 3−2 √ 2 8 for every distributionsD S ,D T with ρ(D S ,D T )≤ ∆ over the domainX×Y. Then in the corresponding binary domain, the function ˆ h d , which is the output of the algorithm ¯ A, satisfies Prob P n S,Q n T E T ( ˆ h d )>c· r 1 n T d + n S d+n S ∆ ! < 3−2 √ 2 8 for every distributions P,Q with ρ(P,Q)≤ ∆ over the domain [d]×{0, 1}. However, by Theorem 3.1 we know that this is false. Hence, we can conclude Theorem 3.3. The multiple sources scenario follows the same idea. 7.3.4 Additional Experimental Results In section 3.5 we fix the number of source samples and vary the number of target samples. Here in order to investigate the effect of source samples on the target generalization error, we fix the number of target samples at n T = 3 and vary the number of source samples. Fig 7.3 depicts the theoretical lower bounds along with the upper bounds obtained by empirical risk minimization for image classifications. We use the same source/target pairs as used in section 3.5.2. Fig 7.3 demonstrates that Source1 is more helpful in reducing the target generalization error because it has a low distance from the target. Furthermore, it shows that increasing the number of source samples is useful up to a point and beyond that point the error saturates and does not decrease further as discussed in Remark 3.10. 153 Figure 7.3: Depicts the lower bounds along with the upper bounds obtained via weighted empirical risk minimization. In this setting the number of target samples is fixed at n T = 3 7.4 Appendix for Chapter 5 7.4.1 The MDS property of U bottom Lemma 7.7. The matrix U bottom is an MDS matrix. Proof. First, let V ∈ T×N be V i,j = Y ℓ∈[T ]\{i} α j −β ℓ+K β i+K −β ℓ+K . It follows from the resiliency property of LCC that by having ( ˜ X 1 ,..., ˜ X N ) = (X 1 ,...,X T )·V, the master can obtain the values of X 1 ,...,X T from any T of the ˜ X i ’s. This is one of the alternative definitions for an MDS code, and hence, V is an MDS matrix. 154 To show that U bottom is an MDS matrix, it is shown that U bottom can be obtained from V by multiplying rows and columns by nonzero scalars. Let [T : K]≜{T + 1,T + 2,...,T +K}, and notice that for (s,r)∈ [T ]× [N], entry (s,r) of U bottom can be written as Y t∈[K+T ]\{s+K} α r −β t β s+K −β t = Y t∈[K] α r −β t β s+K −β t · Y t∈[K:T ]\{s+K} α r −β t β s+K −β t . Hence, U bottom can be written as U bottom = diag Y t∈[K] 1 β s+K −β t s∈[T ] ·V· diag Y t∈[K] (α r −β t ) r∈[N] , (7.25) where V is a T×N matrix such that V i,j = Y t∈[T ]\{i} α j −β t+K β i+K −β t+K . Since{β t } K t=1 ∩{α r } N r=1 =∅, and since all the β i ’s are distinct, it follows from (7.25) that U bottom can be obtained from V by multiplying each row and each column by a nonzero element, and hence U bottom is an MDS matrix as well. 7.4.2 The Uncoded Version of LCC In Section 5.4.2, we have described the LCC scheme, which provides anS-resilient,A-secure, andT- private scheme as long as (K +T− 1) degf +S + 2A + 1≤N. Instead of explicitly following the 155 same construction, a variation of the LCC can be made by instead selecting the values of α i ’s from the set{β j } j∈[K] (not necessarily distinctly). We refer to this approach as the uncoded version of LCC, which essentially recovers the uncoded repetition scheme, which simply replicates each X i onto multiple workers. By replicating every X i between⌊N/K⌋ and⌈N/K⌉ times, it can tolerate at mostS stragglers andA adversaries, whenever S + 2A≤⌊N/K⌋− 1, (7.26) whichachievestheoptimumresiliencyandsecuritywhenthenumberofworkersissmall(specifically, N <K degf− 1, see Theorem 5.2a and 5.2b). However, uncoded repetition does not support the privacy requirement. When privacy is taken into account (i.e.,T > 0), an alternative approach in place of repetition is toinsteadstoreeachinputvariableusingShamir’ssecretsharingscheme[97]over⌊N/K⌋to⌈N/K⌉ machines. This approach achieves any (S,A,T ) tuple whenever N≥ K(S + 2A +deg f·T + 1). However, it does not improve LCC. 7.4.3 Proof of Theorem 5.2a In this appendix, we prove that LCC achieves optimum resiliency for general polynomial functions. The proof consists of 2 steps. In Step 1, we prove the converse for the special case where f is a multilinear function. Then in Step 2, we generalize this result to arbitrary polynomial functions, by proving that for any function f that allows S-resilient desings, there exists a multilinear function with the same degree for which a computation scheme can be found that also toleratesS stragglers. For Step 1, we consider the computing scenario where f is a non-zero multilinear function f defined on V with degree d. To simplify the proof for this scenario, we prove a slightly stronger version of the converse: the Lagrange Coded Computing is optimal even if we consider computing 156 designs with arbitrary (possibly non-linear) decoding functions. We formally state this as the following lemma. Lemma 7.8. For any multilinear function f of degree d∈ N + , any linear encoding scheme can tolerate at most S =N− (K− 1)d− 1 stragglers when N≥Kd− 1, and S =⌊N/K⌋− 1 stragglers when N <Kd− 1. Lemma7.8isprovedinAppendices7.4.6. Themainproofideaistoshowthatforanycomputing strategy that tries to tolerate more stragglers than specified, there would be scenarios where all available computing results are degenerated (i.e., constants), while the computing results needed by the master are variable, violating the decodability requirement. Now in Step 2, we prove the matching converse for any polynomial function. Specifically, given anyfunctionf withdegreed, weprovideanexplicitconstructionofanmultilinearfunction, denoted by f ′ , such that it enables tolerating at least the same number of stragglers. The construction satisfies certain properties to ensure this fact. Both the construction and the properties are formally stated in the following lemma (which is proved in Appendix 7.4.7): Lemma7.9. Given any functionf of degreed, letf ′ be a map fromV d →U such thatf ′ (Z 1 ,...,Z d ) = P S⊆[d] (−1) |S| f( P j∈S Z j ) for any{Z j } j∈[d] ∈ V d . Then f ′ is multilinear with respect to the d inputs. Moreover, if the characteristic of the base field F is 0 or greater thand, thenf ′ is non-zero. Assuming the correctness of Lemma 7.9, it suffices to prove that f ′ enables computation designs that tolerates at least the same number of stragglers, compared to that of f. We prove this fact by constructing such computing schemes for f ′ given any design for f. Note that f ′ is defined as a linear combination of functions f( P j∈S Z j ), each of which is a composition of a linear map and f. Given the linearity of the encoding design, any computation scheme off can be directly applied to any of these functions, tolerating the same number of strag- glers. Since the decoding functions are linear, the same scheme also applies to linear combinations 157 of them, which includesf ′ . Hence, the optimum resiliency achievable forf can also be achieved by f ′ . Combining the above and Lemma 7.8, Theorem 5.2a follows immediately. 7.4.4 Proof of Theorem 5.2b In this appendix, we prove that LCC provides security against the maximum possible number of adversaries. By comparing Lemma 7.8 and Theorem 5.2b, we essentially need to show that the maximum possible number of adversaries that any linear encoding scheme can tolerate, is no greater than half of the corresponding number for stragglers. This converse can be proved by connecting the straggler mitigation problem and the adversary tolerance problem using the extended concept of Hamming distance for coded computing, which is defined in [118]. Specifically, given any (possibly random) encoding scheme, its hamming distance is defined as the minimum integer, denoted by d, such that for any two instances of inputX whose outputs Y are different, and for any two possible realizations of the N encoding functions, the encoded version of these two inputs, using the two lists of encoding functions respectively, differs for at least d workers. It was shown in [118] that this hamming distance behaves similar to its classical counter part: an encoding scheme is S-resilient and A-secure whenever S + 2A≤d− 1. Hence, for any encoding scheme that isA secure, it has a hamming distance of at least 2A + 1. Consequently it can tolerate 2A stragglers. Combining the above and Lemma 7.8, we have completed the proof of Theorem 5.2b. 158 7.4.5 Proof of Theorem 5.2c In this appendix we prove that LCC uses the minimum possible randomness among all linear schemes that achieves the optimum tradeoff stated in Theorem 5.1 for linear f. The proof is taken almost verbatim from [45], Chapter 3. In what follows, an (n,k,r,z)t q secure RAID scheme is a storage scheme over t q (where q is a field with q elements) in which k message symbols are coded inton storage servers, such that thek message symbols are reconstructible from anyn−r servers, and anyz servers are information theoretically oblivious to the message symbols. Further, such a scheme is assumed to use v random entries as keys, and by [45], Proposition 3.1.1, must satisfy n−r≥k +z. Theorem 7.1. [45], Theorem 3.2.1. A linear rate-optimal (n,k,r,z)t q secure RAID scheme uses at least zt keys over q (i.e., v≥z). Clearly, in our scenario V can be seen as dimV q for some q. Further, by setting N = n, T = z, andt = dimV, it follows from Theorem 7.1 that any encoding scheme which guarantees information theoretic privacy against sets ofT colluding workers must use at leastT random entries{Z i } i∈[T ] . 7.4.6 Proof of Lemma 7.8 We start by defining the following notations. For any multilinear function f defined on V with degree d, let X i,1 ,X i,2 ,...,X i,d denote its d input entries (i.e., X i = (X i,1 ,X i,2 ,...,X i,d ) and f is linear with respect to each entry). Let V 1 ,...,V d be the vector space that contains the values of the entries. Without loss of generality, we assume both the encoding and decoding functions are determinis- tic in this proof, as the randomness does not help with decodability. Similar to [118], we define the recovery threshold, denoted byK ∗ f (K,N), as the minimum number of workers that the master has 159 to wait to guarantee decodability. Then we essentially need to prove thatK ∗ f (K,N)≥ (K−1)d+1 when N≥Kd− 1, and K ∗ f (K,N)≥N−⌊N/K⌋ + 1 when N <Kd− 1. ObviouslyK ∗ f (K,N) is a non-decreasing function with respect toN. Hence, it suffices to prove thatK ∗ f (K,N)≥N−⌊N/K⌋ + 1 whenN≤Kd− 1. We prove this converse bound by induction. (a) Ifd = 1, thenf is a linear function, and we aim to proveK ∗ f (K,N)≥N + 1 forN≤K− 1. This essentially means that no valid computing schemes can be found when N <K. Assuming the opposite, suppose we can find a valid computation design using at most K− 1 workers, then there is a decoding function that computes all f(X i )’s given the results from these workers. Because the encoding functions are linear, we can thus find a non-zero vector (a 1 ,...,a K )∈ K such that when X i = a i V for any V ∈ V, the coded variable ˜ X i stored by any worker equals 0. This leads to a fixed output from the decoder. On the other hand, because f is assumed to be non-zero, the computing results{f(X i )} i∈[K] is variable for different values of V, which leads to a contradiction. Hence, we have prove the converse bound for d = 1. (b) Suppose we have a matching converse for any multilinear function with d = d 0 . We now prove the lower bound for any multilinear functionf of degreed 0 + 1. Similar to part (a), it is easy to prove that K ∗ f (K,N)≥N + 1 for N≤K− 1. Hence, we focus on N≥K. The proof idea is to construct a multilinear function f ′ with degree d 0 based on function f, and to lower bound the minimum recovery threshold off using that off ′ . More specifically, this is done by showing that given any computation design for function f, a computation design can also be developed for the corresponding f ′ , which achieves a recovery threshold that is related to that of the scheme for f. In particular, for any non-zero function f(X i,1 ,X i,2 ,...,X i,d 0 +1 ), we can find V ∈ V d 0 +1 , such that f(X i,1 ,X i,2 ,...,X i,d 0 ,V ) as a function of (X i,1 ,X i,2 ,...,X i,d 0 ) is non-zero. We define 160 f ′ (X i,1 ,X i,2 ,...,X i,d 0 ) = f(X i,1 ,X i,2 ,...,X i,d 0 ,V ), which is a multilinear function with degree d 0 . Given parametersK andN, we now develop a computation strategy forf ′ for a dataset ofK inputs and a cluster ofN ′ ≜N−K workers, which achieves a recovery threshold of K ∗ f (K,N)− (K− 1). We construct this computation strategy based on an encoding strategy of f that achieves the recovery threshold K ∗ f (K,N). Because the encoding functions are linear, we consider the en- coding matrix, denoted by G ∈ K×N , and defined as the coefficients of the encoding functions ˜ X i = P K j=1 X j G ji . Following the same arguments we used in the d = 1 case, the left null space of G must be{0}. Consequently, the rank of G equals K, and we can find a subset K of K workers such that the corresponding columns of G form a basis of K . We construct a computation scheme for f ′ with N ′ ≜ N−K workers, each of whom stores the coded version of (X i,1 ,X i,2 ,...,X i,d 0 ) that is stored by a unique respective worker in [N]\K in the computation scheme of f. Now it suffices to prove that the above construction achieves a recovery threshold of K ∗ f (K,N)− (K−1). Equivalently, weneedtoprovethatgivenanysubsetS of [N]\K ofsizeK ∗ f (K,N)−(K−1), the values of f(X i,1 ,X i,2 ,...,X i,d 0 ,V ) for i ∈ [K] are decodable from the computing results of workers inS. We now exploit the decodability of the computation design for function f. For any j∈K, the setS∪K\{j} has size K ∗ f (K,N). Consequently, for any vector a = (a 1 ,...,a K )∈ K , by letting X i,d 0 +1 = a i V, we have that{a i f(X i,1 ,X i,2 ,...,X i,d 0 ,V )} i∈[K] is decodable given the computing results from workers inS∪K\{j}. Moreover, for any j∈ [K], let a (j) ∈ K be a non-zero vector that is orthogonal to all columns of G with indices inK\{j}, workers inK\{j} would store 0 for the X i,d 0 +1 entry, and return constant 0 due to the multilinearity of f. Consequently, any {a (j) i f(X i,1 ,X i,2 ,...,X i,d 0 ,V )} i∈[K] is decodable from the computing results from workers inS. Because columns of G with indices inK form a basis of K , the vectors a (j) for j ∈K also from a basis. Consequently, f ′ (X i,1 ,X i,2 ,...,X i,d 0 ), which equals f(X i,1 ,X i,2 ,...,X i,d 0 ,V ), is also 161 decodable given results from workers inS for any i∈ [K]. On the other hand, note that the computing results for each worker inS given each a (j) can also be computed using the results from the same workers when computingf ′ . Hence, the decoder for functionf ′ can first recover the computing results for workers inS for function f, and then proceed to decoding the final result. Thus we have completed the proof of decodability. To summarize, we have essentially proved thatK ∗ f (K,N)−(K−1)≥K ∗ f ′(K,N−K) whenever a valid scheme comptingf can be found givenN andK (i.e.,K ∗ f (K,N)≤N). We can verify that the converse bound K ∗ f (K,N)≥N−⌊N/K⌋ + 1 under the condition N≤Kd− 1 can be derived given the above result and the induction assumption, for any function f with degree d 0 + 1. (c) Thus, a matching converse holds for any d∈N + , which proves Lemma 7.8. 7.4.7 Proof of Lemma 7.9 We first prove that f ′ is multilinear with respect to the d inputs. Recall that by definition, f is a linear combination of monomials, and f ′ is constructed based on f through a linear operation. By exploiting the commutativity of these these two linear relations, we only need to show individually that each monomial in f is transformed into a multilinear function. More specifically, let f be the sum of monomials h k ≜ U k · d k Q ℓ=1 h k,ℓ (·) where k belongs to a finite set, U k ∈U, d k ∈{0, 1,...,d}, and each h k,ℓ is a linear map from V to . Let h ′ k denotes the contribution of h k in f ′ , then for any Z = (Z 1 ,...,Z d )∈V d we have h ′ k (Z) = X S⊆[d] (−1) |S| h k X j∈S Z j = X S⊆[d] (−1) |S| U k · d k Y ℓ=1 h k,ℓ X j∈S Z j . (7.27) 162 By utilizing the linearity of each h k,ℓ , we can write h ′ k as h ′ k (Z) =U k · X S⊆[d] (−1) |S| d k Y ℓ=1 X j∈S h k,ℓ (Z j ) =U k · X S⊆[d] (−1) |S| d k Y ℓ=1 d X j=1 1(j∈S)·h k,ℓ (Z j ) (7.28) Then by viewing each subsetS of [d] as a map from [d] to{0, 1}, we have ∗ h ′ k (Z) =U k X s∈{0,1} d d Y m=1 (−1) sm ! · d k Y ℓ=1 d X j=1 s j ·h k,ℓ (Z j ) =U k X j∈[d] d k X s∈{0,1} d d Y m=1 (−1) sm ! · d k Y ℓ=1 (s j ℓ ·h k,ℓ (Z j ℓ )). (7.29) Note that the product d k Q ℓ=1 s j ℓ can be alternatively written as d Q m=1 s #(m in j) m , where #(m in j) denotes the number of elements inj that equals m. Hence h ′ k (Z) =U k · X j∈[d] d k X s∈{0,1} d d Y m=1 (−1) sm s #(m in j) m ! · d k Y ℓ=1 h k,ℓ (Z j ℓ ) =U k · X j∈[d] d k d Y m=1 X s∈{0,1} (−1) s s #(m in j) · d k Y ℓ=1 h k,ℓ (Z j ℓ ). (7.30) ∗ Here we define 0 0 = 1. 163 The sum P s∈{0,1} (−1) s s #(m in j) is non-zero only if m appears inj. Consequently, among all terms that appear in (7.30), only the ones with degree d k = d and distinct elements in j have non-zero contribution. More specifically, † h ′ k (Z) = (−1) d · 1(d k =d)·U k · X g∈S d d Y j=1 h k,g(j) (Z j ). (7.31) Recall that f ′ is a linear combination of h ′ k ’s. Consequently, it is a multilinear function. Now we prove that f ′ is non-zero. From equation (7.31), we can show that when all the elementsZ j ’s are identical, f ′ (Z) equals the evaluation of the highest degree terms of f multiplied by a constant (−1) d d! with Z j as the input for any j. Given that the highest degree terms can not be zero, and (−1) d d! is non-zero as long as the characteristic of the field is greater than d, we proved that f ′ is non-zero. 7.5 Appendix for Chapter 6 7.5.1 Convergence Analysis for Fitting ReLUs via SGD (Proof of Theorem 6.1) To show E wt h ||w t −w ∗ || 2 2 i ≤ρ||w t−1 −w ∗ || 2 2 , (7.32) † Here S d denotes the symmetric group of degree d. 164 we begin with a useful definition. In particular, we denote the mini-batch empirical loss as L Im(t) (w) 1 m P m j=1 ℓ i (j) t (w), where I m (t) ={i (1) t ,i (2) t ,...,i (m) t } is the set of indices chosen at itera- tion t. Therefore, we can rewrite the mini-batch SGD updates as w t −w ∗ =w t−1 −w ∗ −η∇L Im(t) (w t−1 ). To upper boundE||w t −w ∗ || 2 2 in terms ofE||w t−1 −w ∗ || 2 2 , we expandE||w t −w ∗ || 2 2 in the form E Im(t) ||w t −w ∗ || 2 2 =E Im(t) ||w t−1 −w ∗ || 2 2 (7.33) +E Im(t) − 2η⟨w t−1 −w ∗ ,∇L Im(t) (w t−1 )⟩ +η 2 ||∇L Im(t) (w t−1 )|| 2 2 To draw the desired conclusion of the theorem (i.e. (7.32)), it suffices to upper bound the second expectation which consists of two terms . To upper bound the first term, We apply the expectation to the inner product to conclude that E Im(t) h ⟨w t−1 −w ∗ ,∇L Im(t) (w t−1 )⟩ i =⟨w t−1 −w ∗ ,∇L(w t−1 )⟩ (7.34) To lower bound the inner product we utilize the following lemma proved in [99]. Lemma 7.10. For the loss function defined in (6.1), we have ⟨u,w−w ∗ −∇L(w)⟩ ≤ 2 δ + √ 1 +δ 1 + 2 (1−ϵ) 2 δ + r 21 20 ϵ !! ·||w−w ∗ || 2 , 165 holding for allu∈B d , unit ball, andw∈E(ϵ) = n w∈R d :||w−w ∗ || 2 ≤ϵ||w ∗ || 2 o with probability at least 1− 16e −γδ 2 n − (n + 10)e −γn . Specifically for δ = 10 −4 and ϵ = 7/200, we obtain ⟨u,w t−1 −w ∗ −∇L(w t−1 )⟩≤ 1 4 ||w t−1 −w ∗ || 2 . (7.35) In[99]uneedstobelongtotheintersectionofunitballandthedescentconeoftheregularization. Since here we do not have any regularization the intersection reduces to the unit ball . Using Lemma 7.10 we have ⟨u,w t−1 −w ∗ −∇L(w t−1 )⟩≤ 1 4 ||w t−1 −w ∗ || 2 for allu with||u|| 2 ≤ 1. By choosing u = w t−1 −w ∗ ||w t−1 −w ∗ || 2 we can conclude that ⟨w t−1 −w ∗ ,∇L(w t−1 )⟩≥ 3 4 ||w t−1 −w ∗ || 2 2 (7.36) holds with probability at least 1− 16e −10 −8 γn − (n + 10)e −γn . Thus, by combining (7.33), (7.34), and (7.36) we can conclude that E Im(t) ||w t −w ∗ || 2 2 ≤(1− 3η 2 )||w t−1 −w ∗ || 2 2 +η 2 E Im(t) ||∇L Im(t) (w t−1 )|| 2 2 . (7.37) Next, in order to upper bound the second term in the right hand side of (7.37), we use following lemma proven in Section 7.5.1.1. Lemma 7.11. The following inequality E Im(t) ||∇L Im(t) (w t−1 )|| 2 2 ≤ 36d m + 25 16 ||w t−1 −w ∗ || 2 2 (7.38) 166 holds with probability at least 1− 2(n + 1)e −γd − (n + 25)e −γn . Hence, by (7.37) and (7.38) with η ∗ = 3 4( 36d m + 25 16 ) we arrive at E Im(t) ||w t −w ∗ || 2 2 ≤ 1− 3 2 η + ( 36d m + 25 16 )η 2 ||w t−1 −w ∗ || 2 2 ≤ (1− 9 16( 36d m + 25 16 ) )E||w t−1 −w ∗ || 2 2 =ρ·E||w t−1 −w ∗ || 2 2 . (7.39) This holds with probability at least 1− 2(n + 1)e −γd − 2(n + 25)e −γn . We note however, that the proof is not yet complete. We have not shown that all subsequent iterations will also lie in the local neighborhood dictated by Lemma 7.10. We have only shown that on average they belong to this neighborhood. To overcome this challenge we utilize some techniques from stochastic processes to obtain a conditional convergence. Our argument heavily borrows from [101] with some parts directly adapted. Conditional linear convergence. In order to make our notations compatible with typical nota- tions used in stochastic processes theory, we denoteW k as the random vector of estimated solution at iterationk andw k as a realization of that random vector. Using these conventions the result we have proven so far can be rewritten as E[||W k+1 −w ∗ || 2 2 |W k =w k ]≤ρ||w k −x|| 2 2 . LetF k denote theσ-algebra generated by indices chosen in the steps from 1 tok. Also letB⊂R d be the region consisting of all points which are in E(ϵ) = n w∈R d :||w−w ∗ || 2 ≤ϵ||w ∗ || 2 o where ϵ = 7 200 . Finally, assume an initial estimate which is fixed obeying w 0 ∈ B and||w 0 −w ∗ || 2 ≤ 167 √ δ 1 ϵ||w ∗ || 2 . Now define a stopping time τ as τ := min{k :W k / ∈B}. For each k, andw k ∈B, we have E[||W k+1 −w ∗ || 2 2 1 τ>k+1 |W k =w k ]≤E[||W k+1 −w ∗ || 2 1 τ>k |W k =w k ] =E[||W k+1 −w ∗ || 2 2 1 τ>k |W k =w k ,F k ] =E[||W k+1 −w ∗ || 2 2 |W k =w k ,F k ]1 τ>k ≤ρ||w k −w ∗ || 2 2 1 τ>k Hence E[||W k+1 −w ∗ || 2 2 1 τ>k+1 ]≤ρE[||W k −w ∗ || 2 2 1 τ>k ] Therefore, we obtain E[||W k −w ∗ || 2 2 1 τ>k ]≤ρ k ||w 0 −w ∗ || 2 2 . (7.40) To show that the probability of leaving this neighborhood is low we boundP(τ <∞). We begin by showing that Z k = ||W τ∧k −w ∗ || 2 2 ρ τ∧k is a supermartingale. E[Z k+1 |F k ] =E[ ||W τ∧(k+1) −w ∗ || 2 2 ρ τ∧(k+1) 1 {τ≤k} |F k ] +E[ ||W τ∧(k+1) −w ∗ || 2 2 ρ τ∧(k+1) 1 {τ>k} |F k ] =E[ ||W τ∧(k) −w ∗ || 2 2 ρ τ∧(k) 1 {τ≤k} |F k ] +E[ ||W k+1 −w ∗ || 2 2 ρ k+1 1 {τ>k} |F k ] ≤Z k 1 {τ≤k} +ρ 1 ρ k+1 E[||W k −w ∗ || 2 2 |F k ]1 {τ>k} =Z k 1 {τ≤k} +Z k 1 {τ>k} =Z k . 168 Using the fact that Z k is a supermartingale we have Z 0 ≥E[Z k |F 0 ] ≥E[ ||W k∧τ −w ∗ || 2 2 ρ k∧τ 1 k≥τ |F 0 ] ≥E[ ||W τ −w ∗ || 2 2 ρ τ 1 k≥τ |F 0 ]. Now By the definition of stopping time, ||W τ −w ∗ || 2 2 ≥ϵ 2 ||w ∗ || 2 2 and using||w 0 −w ∗ || 2 ≤ϵ √ δ 1 ||w ∗ || 2 , we arrive at ϵ 2 δ 1 ||w ∗ || 2 2 ≥E[ ϵ 2 ||w ∗ || 2 2 ρ τ 1 {k≥τ} |F 0 ]. This in turn implies that δ 1 ≥E[ 1 {k≥τ} ρ τ |F 0 ]. Thus, δ 1 ≥E[ 1 {k≥τ} ρ τ |F 0 ]≥E[1 {k≥τ} |F 0 ] =P{k≥τ}. Whence, δ 1 = lim k→∞ δ 1 ≥ lim k→∞ P{k≥τ} =P{∞>τ}. 169 Hence we conclude thatP{τ <∞}≤δ 1 ≤ 1/2. Thus E[||W t −w ∗ || 2 2 1 τ=∞ ] =E[||W t −w ∗ || 2 2 |τ =∞]P(τ =∞) + 0·P(τ <∞) ≥ 1 2 E[||W t −w ∗ || 2 2 |τ =∞]. By (7.40) we have E[||W t −w ∗ || 2 2 |τ =∞]≤ 2ρ t ||w 0 −w ∗ || 2 2 . Thus, using Markov’s inequality we conclude that P(||W t −w ∗ || 2 2 >ϵ||w 0 −w ∗ || 2 2 |τ =∞)≤ E[||W t −w ∗ || 2 2 |τ =∞] ϵ||w 0 −w ∗ || 2 2 ≤ 2ρ t ϵ , which is bounded by δ 2 . 7.5.1.1 Proof of Lemma 7.11 We use the following identity shown in [74]. E Im(t) ||∇L Im(t) (w t−1 )|| 2 2 = m− 1 m ||∇L(w t−1 )|| 2 2 + 1 m E I 1 (t) h ||∇L I 1 (t) (w t−1 )|| 2 2 i (7.41) 170 To upper bound the first term, using (7.35) with u = −∇L(w t−1 ) ||∇L(w t−1 )|| 2 we conclude that ||∇L(w t−1 )|| 2 ≤⟨ ∇L(w t−1 ) ||∇L(w t−1 )|| 2 ,w t−1 −w ∗ ⟩ + 1 4 ||w t−1 −w ∗ || 2 ≤||w t−1 −w ∗ || 2 + 1 4 ||w t−1 −w ∗ || 2 = 5 4 ||w t−1 −w ∗ || 2 holds with probability at least 1− (n + 25)e −γn . In order to upper bound the second term on the right hand side of (7.41) we use the following chain of inequalities E I 1 (t) ||∇L I 1 (t) (w t−1 )|| 2 2 = 4 n n X i=1 || ReLU(⟨w t−1 ,x i ⟩)−ReLU(⟨w ∗ ,x i ⟩) || 2 2 · (1 +sgn ⟨w t−1 ,x i ⟩) 2 ||x i || 2 2 ≤ 16 n n X i=1 ||x i || 2 2 |⟨w t−1 −w ∗ ,x i ⟩| 2 a ≤4· 3d 2 (w t−1 −w ∗ ) T 1 n n X i=1 x i x T i (w t−1 −w ∗ ) b ≤16· 3d 2 · 3 2 ||w t−1 −w ∗ || 2 2 =36d||w t−1 −w ∗ || 2 2 This holds with probability at least 1−2(n+1)e −γd . In (a) we used the fact that a high-dimensional Gaussian random vector is well-concentrated on the sphere of radius √ d with high probability and in the last inequality we used a well-known upper bound on the spectral norm of the sample covariance matrix∥ 1 n P n i=1 x i x T i ∥≤ 3 2 which holds with high probability. 171 7.5.2 Convergence of Quantized SGD (Theorem 6.2) The proof of this theorem is similar to its counter part in Theorem 6.1. We begin by defining L Im k (t) (w) 1 m k P m k j=1 ℓ i (j) t (w). We can repeat the argument of the proof of Theorem 6.1 and rewrite (7.33) using the updates (6.3). We begin by taking expectations with respect to the randomness in the quanitzation procedure (denoted by E Qs ). This allows us to conclude that E Qs ||w t −w ∗ || 2 2 =E Qs ||w t−1 −w ∗ || 2 2 − 2 η K ⟨w t−1 −w ∗ , K X k=1 Q s ∇L Im k (t) (w t−1 ) ⟩ + η 2 K 2 || K X k=1 Q s ∇L Im k (t) (w t−1 ) || 2 2 . To continue we use a Lemma from [3] which shows the stochastic gradient is unbiased and bounds its variance. Lemma 7.12. For any vectorv∈R d , (i) E[Q s (v)] =v. (ii) E[||Q s (v)|| 2 ]≤ 1 +min( d s 2 , √ d s ) ||v|| 2 2 . Using Lemma 7.12, we conclude that E Qs ⟨w t−1 −w ∗ , 1 K K X k=1 Q s ∇L Im(t) (w t−1 ) ⟩ =⟨w t−1 −w ∗ , 1 K K X k=1 ∇L Im k (t) (w t−1 )⟩ =⟨w t−1 −w ∗ ,∇L Im(t) (w t−1 )⟩, 172 and E Qs || K X k=1 Q s ∇L Im k (t) (w t−1 ) || 2 2 =E Qs K X k=1 ||Q s ∇L Im k (t) (w t−1 ) || 2 2 + X i̸=j ⟨Q s ∇L Im i (t) (w t−1 ) ,Q s ∇L Im j (t) (w t−1 ) ⟩ = K X k=1 E Qs ||Q s ∇L Im k (t) (w t−1 ) || 2 2 + X i̸=j ⟨∇L Im i (t) (w t−1 ),∇L Im j (t) (w t−1 )⟩ ≤ K X k=1 1 +min( d s 2 , √ d s ) ! ||∇L Im k (t) (w t−1 )|| 2 2 + X i̸=j ⟨∇L Im i (t) (w t−1 ),∇L Im j (t) (w t−1 )⟩ Therefore, E Qs,Im(t) || K X k=1 Q s ∇L Im k (t) (w t−1 ) || 2 2 ≤ K X k=1 1 +min( d s 2 , √ d s ) ! E||∇L Im k (t) (w t−1 )|| 2 2 + X i̸=j ⟨E ∇L Im i (t) (w t−1 ) ,E ∇L Im j (t) (w t−1 ) ⟩ ≤K 1 +min( d s 2 , √ d s ) ! 36dK m + 25 16 ||w t −w ∗ || 2 2 + (K 2 −K)||∇L(w t−1 )|| 2 2 ≤K 1 +min( d s 2 , √ d s ) ! 36dK m + 25 16 ||w t −w ∗ || 2 2 + 25 16 (K 2 −K)||w t −w ∗ || 2 2 ≤K 2 1 +min( d s 2 , √ d s ) ! ( 36d m + 25 16 ) + 25 16 ! ·||w t −w ∗ || 2 2 . 173 Whence, E Qs,Im(t) ||w t −w ∗ || 2 2 ≤||w t−1 −w ∗ || 2 2 − 2η⟨w t−1 −w ∗ ,∇L(w t−1 )⟩ +η 2 1 +min( d s 2 , √ d s ) ! ( 36d m + 25 16 ) + 25 16 ! ·||w t −w ∗ || 2 2 . The remainder of the proof is exactly the same as that of Theorem 6.1. 7.5.3 Ensemble Method In the statement of the two Theorems, the probability of success depends on the proximity of the initial estimate to the solution. To alleviate this dependency we directly adapt the Ensemble Algorithm used in [101]. See Algorithm 1 for details of this approach. We now state a result regarding the performance of the ensemble method. We note that the proof of this lemma is essentially identical to its counterpart in [101] and requires only minor modifications. Proposition 7.1. (Guarantees for ensemble methods). Consider the setup and assumptions of The- orem 6.1. Furthermore assume that Prob(failure)≤ 1/3. Then, for any δ ′ > 0 there is an absolute constant C such that if L≥ C log (1/δ ′ ) then the estimate ˆ w obtained via Algorithm 1 satisfies || ˆ w−w ∗ || 2 ≤ 9ϵ||w 0 −w ∗ || 2 . 174 Algorithm 1 Ensemble Method Input: Feature vectors x 1 ,x 2 ,...,x n , labels y 1 ,y 2 ,...,y n , relative error tolerance ϵ, iteration count K to achieve the desired error ϵ, trial count L. Output: An estimate ˆ w forw ∗ . 1: Initializew 0 to satisfy the condition of Theorems 6.1 and 6.2. 2: for l = 1,...,L run K SGD updates from the initial pointw 0 to obtainw (l) K . 3: for l = 1,...,L, do 4: if|B(w (l) k , 2 √ ϵ)∩{w (1),...,w (L) K K }|≥L/2 then 5: Return ˆ w :=w (l) K 175
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Quickly solving new tasks, with meta-learning and without
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Leveraging training information for efficient and robust deep learning
PDF
Transfer learning for intelligent systems in the wild
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
No-regret learning and last-iterate convergence in games
PDF
Learning shared subspaces across multiple views and modalities
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Learning logical abstractions from sequential data
PDF
Practice-inspired trust models and mechanisms for differential privacy
Asset Metadata
Creator
Mousavi Kalan, Mohammadreza
(author)
Core Title
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
07/26/2022
Defense Date
05/10/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data scarcity,distributed computing,machine learning,OAI-PMH Harvest,transfer learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Avestimehr, Salman (
committee chair
), Luo, Haipeng (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
mm_mousavi_kalan@yahoo.com,mmousavi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375222
Unique identifier
UC111375222
Legacy Identifier
etd-MousaviKal-10985
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Mousavi Kalan, Mohammadreza
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
data scarcity
distributed computing
machine learning
transfer learning