Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Quickly solving new tasks, with meta-learning and without
(USC Thesis Other)
Quickly solving new tasks, with meta-learning and without
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Quickly solving new tasks, with meta-learning and without by Sébastien M. R. Arnold A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2023 Copyright 2023 Sébastien M. R. Arnold Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background and motivation: learning from little data . . . . . . . . . . . . . . . . 3 1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: A Peak Under the Hood of Meta-Learning . . . . . . . . . . . . . . . . . . . . 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Theoretical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 SHALLOW fails; DEEP meta-learns . . . . . . . . . . . . . . . . . . . . . . 15 2.4.3 Analysis of the SHALLOW model . . . . . . . . . . . . . . . . . . . . . . 16 2.4.4 Analysis of the DEEP model . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.5 Insights from SHALLOW versus DEEP . . . . . . . . . . . . . . . . . . . . 19 2.5 How to be (more) meta-learnable . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Linear layers improve shallow linear models . . . . . . . . . . . . . . . . 21 ii 2.5.2 Linear layers improve deep nonlinear models . . . . . . . . . . . . . . . . 22 2.5.3 Meta-Optimizer for fast adaptation . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 3: When Do We Need Meta-Learning? . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Taskset generation: easy or hard? . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5.2 Controlling information overlap . . . . . . . . . . . . . . . . . . . . . . . 40 3.5.3 Comparing semantic vs embedding clusters . . . . . . . . . . . . . . . . . 43 3.5.4 Is a good embedding really enough? . . . . . . . . . . . . . . . . . . . . . 45 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: Solving New Reinforcement Learning Tasks without Meta-Learning . . . . . . 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Related works and background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Understanding when to freeze and when to finetune . . . . . . . . . . . . . . . . . 53 4.3.1 MSR Jump [216] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.1.1 Pretraining setup . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1.2 Transfer setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.2 DeepMind Control [217] . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.2.1 Pretraining setup . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.2.2 Transfer setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.3 Habitat [189] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 iii 4.3.3.1 Pretraining setup . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.3.2 Transfer setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.4 When freezing or finetuning works . . . . . . . . . . . . . . . . . . . . . . 59 4.3.5 Freezing fails even when learned representations are useful . . . . . . . . . 60 4.3.6 Finetuning improves learnability and robustness to noise . . . . . . . . . . 61 4.3.7 When and why is representation finetuning required? . . . . . . . . . . . . 62 4.4 Finetuning with a policy-induced self-supervised objective . . . . . . . . . . . . . 65 4.5 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.6.1 Partial freezing and policy-induced supervision improve RL finetuning . . 67 4.6.2 Policy-induced supervision improves upon contrastive predictive coding . . 70 4.6.3 Selecting frozen layers inHabitat . . . . . . . . . . . . . . . . . . . . . 71 4.6.4 DeepMind Control transfer from ImageNet . . . . . . . . . . . . . . . . 72 4.6.5 De Novo finetuning withPiSCO . . . . . . . . . . . . . . . . . . . . . . . 73 4.6.6 PiSCO without projection layers . . . . . . . . . . . . . . . . . . . . . . . 74 4.6.7 Comparing and combining withSPR . . . . . . . . . . . . . . . . . . . . . 74 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter 5: Sampling Tasks for Better Meta-Learning . . . . . . . . . . . . . . . . . . . . . 77 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.1 Episodic sampling and training . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.2 Few-shot algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.3 Episode difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.1 Importance sampling for episodic training . . . . . . . . . . . . . . . . . . 82 5.3.2 Modeling the proposal distribution . . . . . . . . . . . . . . . . . . . . . . 83 5.3.3 Modeling the target distribution . . . . . . . . . . . . . . . . . . . . . . . 84 iv 5.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.5.2 Understanding episode difficulty . . . . . . . . . . . . . . . . . . . . . . . 88 5.5.2.1 Episode difficulty is approximately normally distributed . . . . . 89 5.5.2.2 Independence from modeling choices . . . . . . . . . . . . . . . 89 5.5.3 Comparing episode sampling methods . . . . . . . . . . . . . . . . . . . . 91 5.5.4 Online approximation of the proposal distribution . . . . . . . . . . . . . . 93 5.5.5 Better sampling improves cross-domain transfer . . . . . . . . . . . . . . . 94 5.5.6 Better sampling improves few-shot classification . . . . . . . . . . . . . . 95 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter 6: Variance-Reduced Optimization for Better Meta-Learning . . . . . . . . . . . . 99 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Momentum and other approaches to dealing with variance . . . . . . . . . . . . . 101 6.2.1 Momentum and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2.2 SAG and Hessian modelling . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 Converging optimization through implicit gradient transport . . . . . . . . . . . . 104 6.3.1 Implicit gradient transport . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.2 Combining increasing momentum and implicit gradient transport . . . . . 105 6.3.3 IGT as a plug-in gradient estimator . . . . . . . . . . . . . . . . . . . . . 106 6.4 IGT and Anytime Tail Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.5 Impact of IGT on bias and variance in the ideal case . . . . . . . . . . . . . . . . . 108 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.6.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.6.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.6.3 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.7 Conclusion and open questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 v Chapter 7: Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.1 Avenues for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 B Proofs for Implicit Gradient Transport . . . . . . . . . . . . . . . . . . . . . . . . 145 B.1 Transport formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 B.2 Proof of Prop. 6.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 B.3 Proof of Proposition 6.3.2 and Proposition 6.3.3 . . . . . . . . . . . . . . . 152 B.3.1 Proof of Proposition 6.3.2 . . . . . . . . . . . . . . . . . . . . . 160 B.3.2 Proof of Proposition 6.3.3 . . . . . . . . . . . . . . . . . . . . . 160 vi List of Tables 2.1 Accuracy improves by adding linear layers. . . . . . . . . . . . . . . . . . . . . . 22 2.2 Accuracy improves on ANIL trained CNN(2). . . . . . . . . . . . . . . . . . . . . 22 2.3 Meta-optimizers outperform MAML on CNN(2). . . . . . . . . . . . . . . . . . . 26 3.1 5-ways 5-shots classification accuracy of metric- and gradient-based methods when transfer is most challenging. In this regime, methods that adapt their embed- ding function (Finetune, MAML) outperforms those that do not, and which were thought to be sufficient for few-shot learning. . . . . . . . . . . . . . . . . . . . . 30 3.2 Correlation between divergence and accuracy for different choice of divergence D. Measuring the Euclidean distance between centroids performs worst, while the symmetrized KL divergence (used in our other experiments) is best. . . . . . . . . 42 3.3 Average hop distance between WordNet and hierarchies created from Imagenet-1k embeddings (via hierarchical clustering). Regardless of the network architecture, trees constructed from class embeddings are more similar to each other than Word- Net, indicating that class partitioning relies on attributes different from semantic relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Comparing classification accuracy of different tasksets for a same dataset across popular few-shot learning methods. Our proposed method, ATG, is capable of gen- erating simple tasksets (close to random partitioning) as well as challenging ones. In particular, it is often more challenging than tasksets built with class semantics (denoted with a † ), but unlike those it does not require additional information. Bolded results indicate most challenging taskset for a given method. . . . . . . . . 47 vii 3.5 Slope of the regression line between divergence and accuracy (in % points) for different methods. MAML degrades at slower rate than metric-based methods, suggesting that it is better suited when transfer is challenging. . . . . . . . . . . . 48 5.1 Few-shot accuracies on benchmark datasets for 5-way few-shot episodes in the offline setting. The mean accuracy and the 95% confidence interval are reported for evaluation done over 1k test episodes. The first row in every scenario denotes baseline sampling. Best results for a fixed scenario are shown in bold. Results where a sampling technique is better than or comparable to baseline sampling are denoted by †. Overall, UNIFORM is among the best sampling methods in 19=24 scenarios. . . . . . . . . . . . . . . . . . . . . . 92 5.2 Few-shot accuracies on benchmark datasets for 5-way few-shot episodes in the of- fline and online settings. The mean accuracy and the 95% confidence interval are reported for evaluation done over 1k test episodes. The first row in every scenario denotes baseline sampling. Best results for a fixed scenario are shown in bold. Re- sults where a sampling technique is better than or comparable to baseline sampling are denoted by †. UNIFORM (Online) retains most of the performance of the offline formulation while being significantly easier to implement (online is competitive in 15=24 scenarios vs 16=24 for offline). . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 Few-shot accuracies on benchmark datasets after training on Mini-ImageNet for 5-way few-shot episodes in the offline and online settings. The mean accuracy and the 95% confidence interval are reported for evaluation done over 1;000 test episodes. Best results for a fixed scenario are shown in bold. The first row in every scenario denotes baseline sampling. Compared to baseline sampling, online UNIFORM does statistically better in 49=64 scenarios, comparable in 12=64 scenarios and worse in only 3=64 scenarios. . . . . 97 viii 5.4 Few-shot accuracies on benchmark datasets for 5-way few-shot episodes using FEAT. The mean accuracy and the 95% confidence interval are reported for evalu- ation done over 10k test episodes with a ResNet-12. The first row in every scenario denotes baseline sampling. Best results for a fixed scenario are shown in bold. UNIFORM (Online) improves FEAT’s accuracy in 3=4 scenarios, demonstrating that sampling matters even for state-of-the-art few-shot methods. . . . . . . . . . 98 ix List of Figures 1.1 How to construct few-shot image classification tasks. Given a base image classi- fication dataset, we first randomly sample n classes. Next, we sample k images for each of those n classes which form the support set – the dataset with limited labelled data used to adapt and quickly solve the task. Finally, we also sample a query set of k 0 images per class. The query set is used to measure how well the adapted model generalizes to unseen sample from the same task. . . . . . . . . . . 4 1.2 Structure of meta-learning solutions. Our studies let us characterize the solution found by MAML for deep networks as follows. Typically, early layers encode generic information that is task-agnostic; thus, they do not need adaptation. They are followed by task-specific layers for which adaptation is crucial – performance degrades significantly if those layers are frozen. They are often followed by opti- mization layers which enable fast-adaptation of task-specific layers by modifying the back-propagated residual error. The values in the parameter weights of opti- mization layers are crucial for fast adaptation but they need not be adapted for high performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Simple failure modes for MAML. SHALLOW models for regression (Left) and clas- sification (Right) fail but overparameterized DEEP models are able to meta-learn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Meta-learning of a 1D linear regression model. (§2.4.1) (Left) MAML loss of DEEP, showing multiple (local) minima with deep valleys. (Right) 4 meta-training trajectories (of parameters) converging to each of the 4 solutions. . . . . . . . . . 18 x 2.3 Meta-training logistic regression models with MAML on Omniglot, CIFAR-FS, and mini-ImageNet led to poor performances. Adding linear nets improves meta- learning significantly, without changing the model’s capacity. . . . . . . . . . . . 21 2.4 The effect of the number of convolutional layers on adaptation performance. First, as the model size increases, the performances of both methods improve. Besides better meta-learning, the improvement can also be caused by the model’s increased capacity to learn the target tasks. Secondly, the “net gain” from the META-KFO has the diminishing trend as the size increases. In other words, the benefits of directly transforming gradients with an external meta-optimizer reduce as the upper layers of the larger models have more capacity to meta-learn to control their own bottom layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 ATG, a method to generate tasksets of varying difficulty. First, we compute each class embedding by averaging the embeddingf(x) of all images x associated with that class. Then, we partition those class embeddings using a penalized clustering objective. If we want easy tasksets, we find clusters such that train and test classes are pulled together; for hard tasksets, we push those distributions apart. . . . . . . 30 3.2 Accuracy of a Multiclass-trained netwok as we increase the divergence between train and test class distributions. As the divergence increases, accuracy drops sug- gesting that the divergence can be used to generate tasksets of varying difficulty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Low-dimensional projections of class embeddings and taskset centroids. As we increase the divergence penalty, the centroids spread further apart. . . . . . . . . . 42 xi 4.1 When should we freeze or finetune pretrained representations in visual RL? Re- ward and success weighted by path length (SPL) transfer curves on MSR Jump, DeepMind Control, and Habitat tasks. Freezing pretrained representations can underperform no pretraining at all (Figures 4.1a and 4.1b) or outperform it (Figures 4.1c and 4.1d). Finetuning representations is always competitive, but fails to significantly outperform freezing on visually complex domains (Figures 4.1c and 4.1d). Solid lines indicate mean over 5 random seeds, shades denote 95% confidence interval. See Section 4.3.4 for details. . . . . . . . . . . . . . . . . . . 51 4.2 (a) Annotated observation from MSR Jump. On this simple game, the optimal policy consists of moving right until the agent (white) is 14 pixels away from the obstacle (gray), at which point it should jump. (b) Diagram for our proposed PiSCO consistency objective. It encourages the policy’s actions to be consistent for perturbation of state s, as measured through the actions induced by the policy. See Section 4.4 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 xii 4.3 (a) Transfer can still fail even though frozen representations are informative enough. OnMSR Jump tasks, we can perfectly regress the optimal actions (accuracy= 1:0) and agent-obstacle distance (mean square error < 1:0) with frozen or finetuned representations. Combined with Figure 4.1a, those results indicate that capturing the right information is not enough for RL transfer. (b) Finetuned representations yield purer clusters. Given a state, we measure the expected purity of the cluster consisting of the 5 closest neighbours of that state in representation space. For Finetuned representations, this metric is significantly higher (98:75%) than for Frozen (91:41%) orRandom (82:98%), showing that states which beget the same actions are closer to one another and thus easier to learn. (c) Finetuned repre- sentations are more robust to perturbations. For source and downstream tasks, the classification error (1 - action accuracy) degrades significantly more slowly for Finetuned representations than for Frozen ones under increasing data aug- mentation noise, suggesting that robustness to noise improves learnability. See Section 4.3.5 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Frozen layers that retain the same information as finetuned layers can be kept frozen. (a) Mean squared error when linearly regressing action values from rep- resentations computed by a given layer. Some early layers are equally good at predicting action values before or after finetuning. This suggests those layers can be kept frozen, potentially stabilizing training. See Section 4.3.7 for details. (b) Area under the reward curve when freezing up to a given layer and finetuning the rest. Freezing early layers does not degrade performance. . . . . . . . . . . . . . 63 xiii 4.5 Frozen layers that retain the same information as finetuned layers can be kept frozen. Replicates Figure 4.4a on Walker, Cartpole, and Hopper DeepMind Control domains. (top) Mean squared error when linearly regressing action val- ues from representations computed by a given layer. (bottom) Area under the reward curve when freezing up to a given layer and finetuning the rest. Freezing early layers sometimes improves performance (e.g., up toConv2 onWalker). . . 64 4.6 Partial freezing improves convergence; adding our policy-induced consistency ob- jective improves further. As suggested by Section 4.3.7, we freeze the early layers of the feature extractor and finetune the rest of the parameters without (Frozen+Finetuned) or with our policy-induced consistency objective (Frozen+PiSCO). On challeng- ing tasks (e.g.,Habitat), partial freezing dramatically boosts downstream perfor- mance, whileFrozen+PiSCO further improves uponFrozen+Finetuned across the board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.7 Policy-induced supervision is a better objective than representation alignment for finetuning. We compare our policy-induced consistency objective as a measure of representation similarity against popular representation learning objectives (e.g. SimSiam, CURL). Policy supervision provides more useful similarities for RL finetuning, which accelerates convergence and reaches higher rewards. . . . . . . 70 4.8 Identifying which layers to freeze on Habitat tasks. We replicate the layer-by- layer linear probing experiments on Habitat with the ImageNet pretrained feature extractor. Although the downward trend is less evident than in Figure 4.4a, we clearly see that layers Conv8 and Conv7 yield lowest value prediction error on Gibson and Matterport3D, respectively. . . . . . . . . . . . . . . . . . . . . . . . 72 xiv 4.9 Pretraining on large and diverse data can hurt transfer when the generalization gap is too large. When transferring representations that are pretrained on ImageNet to DeepMind Control tasks, we see a signficant decrease in convergence rate. We hypothesize this is due to the lack of visual similarity between ImageNet and DeepMind Control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.10 PiSCO can also improves representation learning on source tasks but shines with pretrained representations. When benchmarkingPiSCO combined with fully fine- tunable features (De Novo+PiSCO), we observe that it marginally outperforms DrQ-v2 on the downstream tasks (De Novo). However, the best performance is obtained when also transferring task-agnostic features (Frozen+PiSCO). . . . . . 74 4.11 Is the projector h() necessary in PiSCO’s formulation? Yes. Removing projec- tion layer when finetuning task-specific layers (and freezing task-agnostic layers) drastically degrades performance on allDeepMind Control tasks. . . . . . . . . 75 4.12 CombiningSPR withPiSCO significantly improves the performance ofSPR. Swap- ping the cosine similarity objective in SPR [200] for the policy-induced objective suggested by our analysis significantly improves finetuning. Still, finetuning with PiSCO (based on SimSiam [39]) yields the best performance, while remaining eas- ier to implement and faster in terms of wall-clock time. . . . . . . . . . . . . . . . 76 5.1 Episode difficulty is approximately normally distributed. Density (left) and Q-Q (right) plots of the episode difficulty computed by conv(64) 4 ’s on Mini-ImageNet (1-shot 5-way), trained using ProtoNets (cosine) and MAML (depicted in the leg- ends). The values are computed over 10k test episodes. The density plots follow a bell curve, with the density peak in the middle, which quickly drops-off on either side of the peak. The Q-Q plots are close to the identity line (in black). The closer the curve is to the identity line, the closer the distribution is to a normal. Both suggest that the episode difficulty distribution can be normally approximated. . . . 88 xv 5.3 Episode difficulty transfers across network architectures. Scatter-plots (with re- gression lines) of the episode difficulty computed by conv(64) 4 and ResNet-12’s trained using different algorithms. This is computed for 10k 1-shot 5-way test episodes from Mini-ImageNet. We observe a strong positive correlation between the computed values for both network architectures. . . . . . . . . . . . . . . . . 90 5.4 Episode difficulty is transferred across model parameters during training. We se- lect the 50 easiest and hardest episodes and track their difficulty during training. This is done for conv(64) 4 ’s trained on Mini-ImageNet (1-shot 5-way) with differ- ent algorithms. The average difficulty of the episodes decreases over time, until convergence (vertical line), after which the model overfits. Additionally, easier episodes remain easy while harder episodes remain hard, indicating that episode difficulty transfers from one set of parameters to the next. . . . . . . . . . . . . . 91 5.2 Episode difficulty transfers across training algorithms. Scatter plots (with regres- sion lines) of the episode difficulty computed on 1k Mini-ImageNet test episodes (1-shot 5-way) by conv(64) 4 ’s trained using different algorithms. The positive cor- relation suggests that an episode that is difficult for one training algorithm will be difficult for another. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1 Variance over time and total variance for the stochastic gradient. Variance induced over time by the noise from three different datapoints (i= 1, i= 25 and i= 50) as well as the total variance for SG (g = 0, top left), momentum with fixed g = 0:9 (top right), momentum with increasing g t = 1 1 t without (bottom left) and with (bottom right) transport. The impact of the noise of each gradiente i increases for a few iterations then decreases. Although a largerg reduces the maximum impact of a given datapoint, the total variance does not decrease. With transport, noises are now equal and total variance decreases. The y-axis is on a log scale. . . . . . . 103 xvi 6.2 Analysis of IGT on quadratic loss functions. (a) Comparison of convergence curves for multiple algorithms. As expected, the IGT family of algorithms con- verges to the solution while stochastic gradient algorithms can not. (b) The blue and orange curves show the norm of the noise component in the SGD and IGT gradient estimates, respectively. The noise component of SGD remains constant, while it decreases at a rate 1= p t for IGT. The green curve shows the norm of the IGT gradient estimate. (c) Cosine similarity between the full gradient and the SGD/IGT estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3 Resnet-56 on CIFAR10. Left: Train loss. Center: Train accuracy. Right: Test accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4 ResNet-50 on ImageNet. Left: Train loss. Center: Train accuracy. Right: Test accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.5 Validation curves for different large-scale machine learning settings. Shading indi- cates one standard deviation computed over three random seeds. Left: Reinforce- ment learning via policy gradient on a LQR system. Right: Meta-learning using MAML on Mini-Imagenet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 xvii Abstract The success of modern machine learning (ML) stems from the unreasonable effectiveness of large data. But what about niche tasks with limited data? Some methods are able to quickly solve those tasks by first pretraining ML models on many generic tasks in a way that lets them quickly adapt to unseen new tasks. Those methods are known to “learn how to learn” and thus fall under the umbrella of meta-learning. While meta-learning can be successful, the inductive biases that enable fast adaptation remain poorly understood. This thesis takes a first step towards an understanding of meta-learning, and reveals a set of guidelines which help design novel and improved methods for fast adaptation. Our core contribu- tion is a study of the solutions found by meta-learning. We uncover the working principles that let them adapt so quickly: their parameters partition into three groups, one to compute task-agnostic features, another for task-specific features, and a third that accelerates adaptation to new tasks. Building on those insights we introduce several methods to drastically speed up adaptation. We propose Kronecker-factored meta-optimizers which significantly improve post-adaptation per- formance of models that are otherwise too small to meta-learn. We also show how to apply our insights to a visual reinforcement learning setting where meta-learning is impractical. Freezing task-agnostic parameters and adapting task-specific ones with policy-induced self-supervision en- ables adaptation to unseen tasks with large feature extractors pretrained on generic vision datasets. Next we investigate when we need adaptation and when representation transfer is sufficient. Our results intuitively suggest that transfer suffices if a new task is similar to tasks seen during pretraining; else, adaptation is required. xviii We conclude this thesis with methods designed to improve training on multiple tasks. First, we show that sampling task uniformly with respect to difficulty improves performance on unseen tasks for a wide range of model architectures, (meta-)learning methods, and pretraining datasets. Second, we observe that we can reduce gradient variance by reusing information across tasks during multi-task pretraining. Our implicit gradient transport estimator accelerates convergence and improves the generalization of meta-learning solutions. We hope this thesis can inspire future work on quickly and effectively solving real-world niche tasks. xix Chapter 1 Introduction Machine learning (ML) systems have become excellent on a wide range of tasks that impact nearly all humans. They ease our interactions with computers through smart keyboards and speech-to- text interfaces, help us sift through large amounts of information, and shape the online content we consume. One day, they might even drive us to work. Their successes directly stem from the unreasonable effectiveness of data [86]: since most of us already solve these tasks on a daily basis, it is easy to collect large amounts of data to train sophisticated models that learn from our demonstrations. In contrast, these learning systems haven’t met the same success on niche tasks where the amount of data is limited. For example, we have yet to see ML skiers, ML numismatists, or ML geocachers. The reason is simple: collecting data for these tasks is prohibitively expensive because human experts are few and far between, and they prefer skiing, studying coins, or hiding treasures rather than annotating datasets. This lack of data significantly hinders the application of machine learning, forcing practitioners to collect and deal with data from heterogeneous sources, e.g., snowboarding, philately, or hiking. Because of those challenges and their (perceived) limited impact, niche tasks remain understudied in favor of other application domains where data are abundant such as computer vision, natural language understanding, or speech processing. Yet, niche tasks matter. Failing to solve them hinders the potential of learning systems because, when aggregated, these tasks account for a large fraction of our collective time. This is evidenced by the heavy-tail distribution of Twitter followers, YouTube subscribers, and subreddits’ sizes 1 indicating that human attention spreads thin over a wide variety of interests [31, 135, 143]. In other words, we fail to reap all the benefits of ML systems if we only improve them for the few tasks that affect many people and discard the many tasks that each affect few people. Naturally, the main difficulty is keeping our improvements generic enough that they apply to a majority of tasks—we cannot possibly devise bespoke solutions for every niche. Niche tasks also point to a core challenge of modern machine learning; namely that, regardless of the number of tasks seen during training, there will always be more new and interesting tasks we wish our ML systems could solve. This challenge arises from the infinite richness and complexity of the real-world, while our ML systems have finite modelling capacity. One solution is to pretrain a generic system on a large set of training tasks and finetune it (i.e., adapt it) with data from the task of interest. Shedding light on how and when we can adapt effectively is of scientific interest to improve learning in machines and humans. Together, these challenges pose the following question: How should we design generic learning systems that can quickly adapt to new tasks? This thesis takes a step in answering this question through the study and design of systems that are specifically trained to adapt quickly. Because they have “learned how to learn fast”, these systems are said to be meta-learned. Our main contribution is to characterize the inductive biases found by meta-learning. We show that the parameter weights of meta-learning solutions can be categorized in three groups: task-agnostic parameters, task-specific parameters, and optimization parameters. Optimization parameters are of special interest because they only appear with meta- learning. We build on this characterization to design novel methods for fast adaptation, both when meta-learning is feasible and when it is not. Then, we study when meta-learning is required for fast adaptation. Intuitively, our results show that fast adaptation (and thus, meta-learning) is not required when the seen and new tasks are similar—in those scenarios, simpler methods work just as well. We conclude with the design of methods that improve learning with many pretraining tasks. We show that we can significantly improve adaptation to new tasks by sampling tasks carefully during pretraining or with optimization algorithms designed for multi-task learning. 2 1.1 Background and motivation: learning from little data To understand the challenges of learning with limited labelled data and the properties of meta- learning methods, the majority of our studies leverage the setting of few-shot image classification. The idea behind few-shot image classification is to artificially break down existing large computer vision datasets into a large number of small tasks – also known as episodes – each with a handful of labelled samples. This process is illustrated in Figure 1.1, and proceeds in three steps: 1. Randomly pick n classes from the given base image classification dataset. 2. Randomly pick k images for each of the n classes. The learning model should use this support set of n k labelled images to solve the task as quickly and effectively as possible. 3. Randomly pick another k 0 images for each of the n classes. This query set of n k 0 images is used to evaluate generalization accuracy, and measures how successful the model was in solving the task. Tasks constructed with the above recipe are known as n-ways (the number of classes) k-shot (the number of labelled samples per class). In practice, we typically partition the classes of the base dataset into train, validation, and test sets to ensure that test data are never used to build train or validation tasks. It may seem counterintuitive to study niche tasks with datasets from computer vision where data is plentiful; as we’ll see in Chapter 3, this turns out to be instructive as we can interpolate between the small and large data regimes and establish strong baselines trained on more data than typically available. How can we solve few-shot image classification tasks? There exists a plethora of methods but we are specifically interested in those that learn to adapt fast. From this family of methods, the representative method is the Model Agnostic Meta-Learning (MAML) of Finn et al. [67]. Accordingly, it is the center of our focus for the first part of this thesis. The main idea behind MAML training is to find parameters of a statistical model such that they perform well on any task after a few updates of gradient descent. The constraint that parameters ought to perform well after adaptation (not necessarily before) lets MAML learn how to adapt quickly; we will formalize 3 Pick n classes Query Set Support Set Base Dataset Pick k samples per class Pick k’ samples per class Figure 1.1: How to construct few-shot image classification tasks. Given a base image classifi- cation dataset, we first randomly sample n classes. Next, we sample k images for each of those n classes which form the support set – the dataset with limited labelled data used to adapt and quickly solve the task. Finally, we also sample a query set of k 0 images per class. The query set is used to measure how well the adapted model generalizes to unseen sample from the same task. this intuition in Chapter 2. At test-time, we only need to adapt these meta-trained parameters with gradient descent on the support set, and evaluate the quality of the solution on the query set. While MAML has been successful on a wide range of tasks [65], its inner-workings remain poorly understood. In particular, why is it so effective at solving new tasks? We answer this ques- tion through empirical and theoretical studies of surprisingly simple failure modes of MAML, 4 namely, linear regression and binary classification tasks. Our studies reveal the stringent require- ment for deep architectures in meta-learning: intuitively, the deeper layers learn to modify the residual error during back-propagation thus providing a better update direction for earlier layers. At a high level, we observe the following picture. The weights of a deep network found through MAML can be categorized into three groups: • task-agnostic parameters, which do not need to be adapted, • task-specific parameters, which absolutely need to be adapted, and • optimization parameters, which modify the residual error thus enabling fast-adaptation of task-specific parameters. This understanding is illustrated in Figure 1.2. C o n v C o n v T a s k - a g n o s t i c P a r a m e t e r s F C F C T a s k - s p e c i fi c P a r a m e t e r s O p t i m i z a t i o n P a r a m e t e r s C o n v C o n v Figure 1.2: Structure of meta-learning solutions. Our studies let us characterize the solution found by MAML for deep networks as follows. Typically, early layers encode generic information that is task-agnostic; thus, they do not need adaptation. They are followed by task-specific layers for which adaptation is crucial – performance degrades significantly if those layers are frozen. They are often followed by optimization layers which enable fast-adaptation of task-specific layers by modifying the back-propagated residual error. The values in the parameter weights of optimization layers are crucial for fast adaptation but they need not be adapted for high performance. Our analysis begs the question: how can characterizing the structure of MAML solutions in- form better methods for fast-adaptation? First, we observe that the end result of optimization parameters can be achieved by learning a separate neural network whose sole purpose is to model optimization. We call this network a meta-optimizer, devise an efficient formulation to compute its forward pass, and demonstrate that it unlocks meta-learning for models which, alone, don’t have enough parameters to successfully meta-learn. As an added benefit, the meta-optimizer can be discarded post-adaptation and so does not incur parameter bloat at inference time. Second, we show that freezing task-agnostic parameters (i.e., preventing their adaptation) is crucial in visual 5 reinforcement learning (RL) where meta-learning is infeasible. We propose a method to cheaply identify those task-agnostic parameters based on their ability to predict action values. Combined with an auxiliary objective which stabilizes gradients in visual RL, this simple change accelerates convergence by factors of 2x-5x and significantly outperforms state-of-the-art alternatives. Taking a step back, we ask: is meta-learning always necessary to adapt quickly and effectively? We return to few-shot image classification to answer this question, and pitch MAML against var- ious transfer and meta-learning alternatives. Unlike prior studies, we pay particular attention to how train and test tasksets are constructed. We propose a method to interpolate different levels of similarity between train and test tasks. Our large empirical study yields an intuitive interpreta- tion: transfer learning methods are sufficient when train and test tasks are similar since little new knowledge needs to be acquired; however, meta-learning methods like MAML shine with dissim- ilar train and test tasks as new knowledge needs to be quickly picked up. Those results contribute towards guidelines for practitioners who may wish to pursue meta-learning approaches for their application. Stepping back further, we investigate how to best train transfer and meta-learning methods on a large number of tasks. First, we home in on how to “randomly pick” classes and images. We build on importance sampling to implement various sampling strategies with respect to task difficulty. Our results show that sampling uniformly with respect to task difficulty during pretraining im- proves accuracy on unseen tasks. Again, these results admit an intuitive explanation: the uniform distribution is the least informative prior and “prepares the model for all test task distributions”, whether similar to train tasks or not. Second, we study multi-task pretraining through the lens of online stochastic optimization — i.e., where we have an infinite number of tasks which are indepen- dent and identically distributed. Under simplifying assumptions (namely, all tasks are quadratics with identical Hessians), we show that we can reuse the information from tasks seen early dur- ing learning by biasing the gradient estimate of the current task. This implicit gradient transport mechanism accelerates convergence and improves the generalization of MAML. Together, those 6 results demonstrate the importance of carefully setting up the pretraining stage to ensure the best performance on the new task. 1.2 Thesis outline This thesis is structured as follows. Chapter 1 introduces the challenges associated with niche tasks, provides high level background, and outlines the scope of the thesis. Chapter 2 provides detailed background and in-depth analysis of MAML. It characterizes the solution found by MAML and introduces meta-optimizers, which decouple modelling from opti- mization. This chapter is based on our work “When MAML Can Adapt Fast and How to Assist When It Cannot” [11]. Chapter 3 covers relevant background on transfer learning methods, and discusses and when those methods should be preferred over meta-learning. It introduces Automatic Taskset Generation (ATG), a method to automatically devise train-test task splits of desired difficulty. This chapter is based on our work “Embedding Adaptation is Still Needed for Few-Shot Learning” [7]. Chapter 4 shows how the insights of Chapter 2 apply to the visual reinforcement learning setting. It teases apart the dynamics of representation finetuning in visual RL, introduces one method to identify which layers to freeze and another method to compute high quality gradients that acceler- ate RL finetuning. This chapter is based on our work “Policy-Induced Self-Supervision Improves Representation Finetuning in Visual RL” [8]. Chapter 5 investigates the best way to sample tasks during pretraining. It introduces a simple scheme to approximate a large number of task sampling methods, and reveals sampling uniformly with respect to task difficulty as the best performer. This chapter is based on our work “Uniform Sampling over Episode Difficulty” [10]. Chapter 6 shows how to reuse information from task seen early during training and introduces implicit gradient transport, a low-variance gradient estimator which builds on this information 7 reuse by implicitly modelling the Hessian of all training tasks. This chapter is based on our work “Reducing the variance in online optimization by transporting past gradients” [9]. Chapter 7 concludes the thesis with parting remarks and possible future research directions. 8 Chapter 2 A Peak Under the Hood of Meta-Learning 2.1 Introduction Meta-learning or learning to learn has been an appealing idea for addressing several important challenges in machine learning [193, 21, 224, 67]. In particular, learning from prior tasks but being able to adapt quickly to new tasks improves learning efficiency with fewer samples, i.e., few-shot learning [227]. A promising set of techniques, Model-Agnostic Meta-Learning or MAML [67] and its variants – often referred as Gradient-based Meta-Learning (GBML) – have attracted a lot of interest [151, 124, 83, 73]. In GBML, the learning model is “meta”-trained on a set of meta-training tasks and is expected to perform well on meta-testing (i.e., post-training-adaptation) tasks. In the phase of meta-training, the model parameters are optimized so that when applied to meta-testing tasks, a few gradient- based parameter updates lead to a significant reduction in the learning losses, a desideratum re- ferred as “fast adaptation”. To this end, MAML optimizes what is called MAML loss (§3.3). In this chapter, we take an unexplored direction to understand how MAML and its alike work: we investigate what types of model can meta-learn. Our work answers a few questions inspired by existing work. First, most research work in the literature focuses on deep learning models — presumably one can posit that a sufficiently large deep learning model should be able to learn the right inductive bias to meta-learn as neural networks are universal approximators. While the argument is patently 9 valid, our research work aims to refine it: what sense do we mean with sufficiently large? Is there a regime where the model is not sufficiently large such that it cannot meta-learn? Second, the recently proposed ANIL algorithm suggests that for deep learning models, there is almost no need to use the MAML loss to optimize the bottom layers of the neural network [169]. This observation is closely related to multitask learning [18, 34] but does not explain what the special roles of the models’ heads are in ensuring the bottom layers are updated as effectively as the original MAML. Third, preconditioning methods introduce additional parameters to control the gradients during the meta-testing to improve fast adaptation [127, 161, 124, 73]. They assume those additional parameters, after being meta-trained, generalize to new tasks. However, those works do not explain why the original model can adapt fast without those parameters. Moreover, if given a model that is not “sufficiently large” to meta-learn, how effective would those methods be? For example, imagine those methods were given the bottom layers of a deep neural network model, could they update those layers to match the performance of the original bigger neural network? To answer these questions, we need a way to measure how large a model is and a metric to measure how effective meta-learning is. For the former, we use the depth, ie, the number of layers in deep models as it is one of the most frequently cited quantity to characterize the size of a model. For the latter, we use the performance metrics (error rates or accuracies) on meta-testing tasks, a common practice in existing literature. We concentrate on few-shot learning tasks and leave other application scenarios of MAML and its alike to future study. We use a theoretical analysis (of mathematically tractable models) to gain insights and to gen- erate hypotheses around how meta-learning is enabled in deep learning models. We then use empirical studies to validate those hypotheses and inspire a new algorithm, dubbed META-KFO, for meta-learning. We summarize the key findings from our research. We conclude that the depth of a deep model is important to meta-learning. Even if a task is solvable with a shallow or linear model, a deeper 10 model with at least one hidden layer is required for meta-learning. The reason is that the meta- learner needs to use the upper layers of a deep model to control how the bottom layers’ parameters are to be updated. This control can be achieved in three ways. The first, which is the default and implicit strategy in existing work, is to use a sufficiently large deep neural network. The second one is to add linear layers to the output of a shallower network to increase the depth. This has the advantage that the adapted model is smaller than the first approach as the linear layers can be absorbed into the shallower network. The third one is to use especially designed preconditioning methods that directly control the gradients that update the shallower network. Those methods, including previous work [127, 161, 124] and the proposed META-KFO algorithm (§2.5.3), improve meta- learnability of shallower networks that otherwise do not meta-learn well. Moreover, the proposed algorithm and its empirical behavior yields new insights: we surmise that in a deep neural network, the upper layers have the equivalent effect of transforming the gradients of the bottom layers as if the upper layers were an external meta-optimizer, operating on a smaller network that is composed only of the bottom layers. While it is plainly correct to state that the upperlayers of deep models affect the bottom layers’ parameter updates (as in any gradient-based learning), our work is the first to refine this argument by pointing out this influence is crucial for enabling meta-learning. This is established through a mix of theoretical analysis (§2.4) and empirical studies (§2.5) carefully designed to reveal how increasing the depth enables meta-learning. The proposed META-KFO algorithm is motivated by our findings and also contributes to the work on meta-learning by enabling shallower models to meta-learn and attaining state-of-the-art performance on several benchmarks. 2.2 Background and notation In MAML and its many variants, we have a model whose parameters are denoted byq. We would like to optimize q such that the resulting model can adapt to new and unseen tasks fast. We are 11 given a set of meta-training tasks, indexed by t. To each such task, we associate a loss ` t (q). Distinctively, MAML minimizes the expected task loss after an adaptation phase, consisting of a few steps of gradient descent from the model’s current parameters. Since we do not have access to how the target tasks are distributed, we use the expected loss over the meta-training tasks, L MAML (q)= E tp(t) [` t (qaÑ` t (q))] (2.1) where the expectation is taken with respect to the distribution of the training tasks. p(t) is a short- hand for the distribution of the task: p(t)= p(q t )p(x x x;y;q t ), for a set of conditional regression tasks where the data(x x x;y) follows a distribution parameterized byq t . a is the learning rate for the adaptation phase. The right-hand-side of eq. (2.1) uses only one step gradient descent such that the aim is to adapt fast: in one step, we would like to reduce the loss as much as possible. In practice, a few more steps are often used in the meta-training phase. We use q MAML = argmin q L MAML (q) (2.2) to denote the minimizer of this loss, i.e., the MAML solution. Note that it is most likely different from each` t ’s minimizer. If we use gradient descent during meta-training, the parameter is updated as follows: (META-TRAINING) q qb ¶L MAML (q) ¶q (2.3) where the step size b is called meta-update learning rate. During the meta-testing, the MAML solution is used as an initialization for solving new tasks with regular (stochastic) gradient descent: (META-TESTING) q qa ¶` t 0(q) ¶q (2.4) wheret 0 denote a new task, anda is the adaptation learning rate. 12 2.3 Overview of our approach To understand the relationship between the depth and the meta-learnability, we start by creating a failure scenario by identifying a base model and task setup where MAML fails to meta-learn. Then we employ a strategy of increasing the depth of the base model such that it becomes meta- learnable. Finally, we elucidate what the increased depth achieves and how it is related to existing methods of improving meta-learnability. To achieve these 3 desiderata, however, is challenging with deep learning models. In essence, when the depth is increased, the improvement in performance metrics could be caused by several entangled factors. To see this, consider a base modelM 1 and a bigger modelM 2 which has more layers and thus at least as powerful, if not strictly more. Suppose the adaptation performances (say, classification accuracy) are suchM MAML 1 M MAML 2 . With respect to these models’ Bayes optimal performances, what we would like to have first is the following relation: M MAML 1 M MAML 2 M BAYES 1 M BAYES 2 (2.5) where we can identify that the increase in performance metrics is solely due to the improved meta- learning when the depth is increased 1 . It is hard to guaranteeM MAML 2 M BAYES 1 on real-world data as we do not know their true un- derlying distributions. However, in theoretical analysis, this can be achieved by analyzing problem settings where the (base) modelM BAYES 1 (and thusM BAYES 2 also) achieves 100% accuracy. §2.4 follows this design thinking by applying (correctly specified) models of linear regression and logis- tic regression to data. The design also enables us to recognize “failure mode” of MAML when the base modelM MAML 1 is significantly worse thanM BAYES 1 , say, at a chance level for classification. 1 Consider the alternative relationM MAML 1 M BAYES 1 M MAML 2 M BAYES 2 . Then the observed increase in per- formance has several possible explanations: the increased depth makes meta-learning more effective, improves the model’s power in solving the tasks, or results in a combination of the both. Our design needs to rule the latter out. 13 Secondly, to ensureM 2 does not increase the power ofM 1 in solving the tasks, our design of theoretical analysis in §2.4 and empirical studies in §2.5.1 and §2.5.2 increases the depth by adding linear layers to the outputs of the base models. We refer this as “LinNet” strategy for adaptation. While linear layers are often cited for implicit regularization to improve generalization of models, [80, 190] in our settings, there is no overfitting. So those linear layers are indeed the only explanation for why meta-learning has been improved (not increased model representational capacity). To hypothesize how the depth facilitates meta-learning, our design goes beyond the standard argument that the upper layers of a deep neural net or added linear nets influence the bottom layers’ gradients. We specifically design an algorithm called META-KFO (§2.5.3) where a separate neural network learns to explicitly transform the gradients and enable meta-adaptation on shallower models that otherwise adapt poorly. The most important feature of this algorithm is to keep the base model’s modeling capacity unchanged. This allows us to directly compare to deep models. Our empirical observations support the hypothesis that in a deep neural network, the upper layers transform the gradients of the bottom layers as if the bottom layers alone were being meta-trained. 2.4 Theoretical analysis We conduct theoretical analysis on mathematically tractable models and task setups, following the design outlined in the previous section. We start with creating a failure mode of MAML by employing a 1-D regression as a base model that is not meta-learnable but nonetheless can solve the meta-testing tasks. We then increase the size of the base model to make it meta-learnable by overparameterzing it (i.e., adding a “linear layer”). Albeit unrealistic, this setup is of didactic value and also admits a simple analytical solution 2 . We describe the setup in §2.4.1 and empirical observations of the base and the overparameterized models in §2.4.2. We analyze them in §2.4.3 2 Others have studied the multi-dimensional version of this setup under different perspectives [16, 188, 49] 14 and §2.4.4 and contrast the difference in parameter updates for both meta-training and meta-testing. The insights are discussed in §2.4.5, which motivates our empirical studies in §2.5. 2.4.1 Setup We consider the task of one-dimensional linear regression. Let the task parameter q t N(0;1) be a normal distributed scalar and likewise, the covariate x N(0;1). The observed outcome is y N(q t x;1). We investigate two models for their meta-learning performance: SHALLOW: ˆ y= cx (2.6) DEEP: ˆ y= abx (2.7) Note that the “deep” model is overparameterized and can be seen as two-layer neural nets with weights being a and b respectively. For each task, we use the least-square loss ` t (c)= E p(x;yjq t ) (y cx) 2 (2.8) ` t (ab)= E p(x;yjq t ) (y abx) 2 (2.9) Note that the data of these tasks are generated according to the models used for meta-learning. 2.4.2 SHALLOW fails; DEEP meta-learns Fig. 2.1(left) contrasts the two models’ surprising differences in performance on meta-learning. While the DEEP (the red curve) quickly reduce the MSE on meta-testing tasks, the black curve demonstrates the poor performance of the MAML algorithm on the SHALLOW model. The results are unexpected as both models are fully capably of solving the problem given enough data — in particular, the SHALLOW model has only one parameter c to learn. 15 0 250 500 750 1000 Iterations 0.0 0.2 0.4 0.6 0.8 1.0 Post-Adaptation MSE MAML Shallow Deep Linear Regression 0.0 0.5 1.0 1.5 2.0 Iterations × 10 4 0.0 0.2 0.4 0.6 0.8 1.0 Post-Adaptation Accuracy MAML LogR LogR+LinNet(2) LogR+LinNet(3) Upper Bound Linear Classification Figure 2.1: Simple failure modes for MAML. SHALLOW models for regression (Left) and clas- sification (Right) fail but overparameterized DEEP models are able to meta-learn. In Fig. 2.1(right), we show a similar study of meta-learning linear classifiers where the data is generated according to the models. The base model is ˆ y= BERNOULLI(s(c T x)) and its overpa- rameterized version ˆ y= BERNOULLI(s(a T Bx)) where a;B and c are matrices or vectors and x is a vector. The base model attains an accuracy at the chance level while the overparameterized one reaches near-perfect classification accuracy. As in the 1-D regression, the overparameterization (equivalent to adding two or three linear layers) enables meta-learning and fast adaptation. What roles could those additional parameters have played? 2.4.3 Analysis of the SHALLOW model The MAML Solution It is easy to see that the MAML solution is the origin c MAML = 0 given the symmetries in both x andq t . We state the following results: ` t (caÑ` t )=(1a) 2 (cq t ) 2 + CONST (2.10) L MAML SHALLOW = 2(1a) 2 c 2 + CONST (2.11) 16 where the MAML loss is a convex function with the minimizer at c MAML = 0, in accordance with our intuition. The gradient of thetth task is given by ¶` t (caÑ` t ) ¶c µ(1a) 2 (cq t ) (2.12) Note the the gradient is proportional to the deviation from the “ground-truth” parameter q t . The parameter updates during meta-training and adaptation are given by (cf. §3.3)) (META-TRAINING) c cb(1a) 2 (cq t ) (2.13) (META-TESTING) c ca(cq t ) (2.14) No One Step Adaptation Suppose we would like to adapt from a task whose parameter is q 0 , from the MAML solution c MAML = 0, we get c c MAML a(c MAML q 0 )=aq 0 (2.15) Thus, unlessa happens to be 1, the optimal solution cannot be achieved in one step of adaptation. However, whena = 1, the gradient ofL MAML SHALLOW is zero (cf. eq (2.12)), thus meta-learning cannot occur. 2.4.4 Analysis of the DEEP model The MAML solution Unfortunately, for the DEEP model, both the gradients and the losses are very complicated. We state the following • The origin of the parameter space (a= 0;b= 0) is a stationary point and the Hessian is H =4aI I I. Thus, the origin is a local maximum, thus not a MAML solution. • The following 4 pairs of(a MAML =1= p a;b MAML = 0) and(a MAML = 0;b MAML =1= p a) are locally minimum, with the Hessian given by DIAG(8a;6a 3 ), are thus MAML solutions. 17 a − 4 − 2 0 2 4 b − 4 − 2 0 2 4 logL MAML 2 4 6 8 10 12 14 16 Overparameterized Linear Regression − 4 − 2 0 2 4 a − 4 − 2 0 2 4 b Start End 0 2 4 6 8 10 12 14 16 18 logL MAML Overparameterized Linear Regression Figure 2.2: Meta-learning of a 1D linear regression model. (§2.4.1) (Left) MAML loss of DEEP, showing multiple (local) minima with deep valleys. (Right) 4 meta-training trajectories (of parameters) converging to each of the 4 solutions. We visualizeL MAML DEEP in Fig. 2.2, where we can see clearly the 4 local minimum (as well as the deep valleys, “ravines”) and how trajectories of parameter updates converge to them. The gradients involve high-order polynomials of a and b. To gain insights, in the below, we hold a fixed and examine the gradient with respect to b during meta-training. This is reminiscent of ANIL [169]. The resulting form of the gradients is greatly simplified yet remains insightful: ¶` t (a(baÑ b ` t )) ¶b µ a(1aa 2 ) 2 (abq t ) (2.16) Note that the symbolÑ b indicate that only b is meta-learned with a fixed. This leads to the follow- ing: (META-TRAINING) b bba(1aa 2 ) 2 (abq t ) (2.17) (META-TESTING) b baa(abq t ) (2.18) a aab(abq t ) (2.19) One Step Fast Adaptation The DEEP model has qualitatively very different adaptation behavior from the SHALLOW model. As before, at the MAML solution (a MAML = 1= p a;b MAML = 0), we 18 perform an adaptation on a new task with the ground-truthq 0 . Holding a MAML = 1= p a fixed, the update to b is b NEW b MAML aa MAML (a MAML b MAML q 0 )= p aq 0 (2.20) Note that(a MAML = 1= p a;b NEW = p aq 0 ) is precisely the optimum solution to the task as a MAML b NEW = q 0 . In other words, we need only one parameter update to arrive at the optimum solution! In fact, this fast adaptation does not depend on whata is and does not even depend on whether we adapt from the right b MAML — for any random b RANDOM , the update in eq. (2.20) immediately brings b RANDOM to b NEW = p aq 0 ! 2.4.5 Insights from SHALLOW versus DEEP We examine the gradient updates of the two models. First, for adaptation during meta-testing, both eq. (2.18) and eq. (2.14) share the same element being proportional to the error signal: (abq t ) for DEEP and(cq t ) for SHALLOW. However the additional factor a in the DEEP model enables one-step fast adaptation that is not possible to attain by the SHALLOW MODEL, as shown in the previous section. Turning to the meta-training, we also notice the different scaling factors to the error signals. Contrasting eq. (2.17) to eq. (2.13), the effective step size for the former depends on a(1aa 2 ) 2 and cannot be absorbed into the meta-learning rate b if a is also updated. Namely, in the meta- training phase, the step size for updating model parameters is dynamically adjusted, while the step size for the SHALLOW model is fixed. While fully characterizing how a’s dynamics change parameter updates is left for future work, we concentrate our analysis in the neighborhood of the solutions, say,(a MAML = 1= p a;b MAML = 0) (the other 3 are symmetric to this one). Note that the farther a is away from a MAML , the bigger the step size (in magnitude) is to amplify the error signal(abq t ). This has the effect to move b more quickly toward the solution q t =a for the t-th task, or toward the MAML solution b MAML = 0 (as the expectation with respect to the task distribution is 0). 19 Furthermore, when at the MAML solution(a MAML = 1= p a;b MAML = 0), the update eq. (2.17) is stationary for any task. Imagine a new task t 0 with ground-truth parameter q 0 is randomly sampled for meta-training. Even if a MAML b MAML 6=q 0 , the gradient-based update will not change b MAML from 0. On the other hand, for the SHALLOW model, even if the model is at the solution c MAML = 0, the randomly sampled task will update the solution to c aq 0 , drifting away from the MAML solution. In other words, the MAML solution is more stable in DEEP than in SHALLOW, when (as a common practice) stochastic gradient descent is used. We leave more comprehensive characterization of the local and global dynamics of the param- eter updates to future work. In this work, our focus is: how these observations shed light on more complicated models used in practice? The first insight is that even with the same modeling capacity, depth plays an important role in enabling meta-learning, even if the depth is the form of an additional scalar parameter or additional linear layers. The second insight comes from extending the 1-D linear regression to multi-dimensional re- gression where the base model is ˆ y= Cx and its overparameterized version as ˆ y= a T Bx. It is not hard to see the forms of the gradients suggest that the additional parameters affect not only scal- ing factors (i.e., magnitude) but also transforming gradient directions through those additional parameters. While hard to analyze mathematically, our empirical studies below provide strong evidences. 2.5 How to be (more) meta-learnable The analysis on highly idealized models in the previous section needs to be empirically verified on real-world datasets. We first validate the findings in §2.4.2 by showing that adding linear layers (“LinNet”) as a general strategy of increasing the depth of the models improves meta-learning in both shallow models (§2.5.1) and deep models (§2.5.2). To further clarify the roles of the upper layers in deep models for meta-learning, we propose a new algorithm for meta-learning in §2.5.3, 20 0.0 0.5 1.0 1.5 2.0 Iterations × 10 4 0.2 0.4 0.6 0.8 1.0 Post-Adaptation Accuracy MAML LogR LogR+LinNet Upper Bound Omniglot 0.0 0.5 1.0 1.5 2.0 Iterations × 10 4 0.2 0.4 0.6 0.8 1.0 Post-Adaptation Accuracy MAML LogR LogR+LinNet Upper Bound CIFAR-FS 0 2 4 Iterations × 10 2 0.2 0.4 0.6 0.8 1.0 Post-Adaptation Accuracy MAML LogR LogR+LinNet Upper Bound mini-ImageNet Figure 2.3: Meta-training logistic regression models with MAML on Omniglot, CIFAR-FS, and mini-ImageNet led to poor performances. Adding linear nets improves meta-learning sig- nificantly, without changing the model’s capacity. and conduct additional empirical studies. While this algorithm META-KFO is primarily used in this work for investigating how MAML works, it also attains state-of-the-art performances. We perform our study on common benchmarks datasets of meta-learning (for few-shot classification). Datasets and settings In the following study, we use the standard 5-ways and 5-shots setting on the Omniglot [114], CIFAR-FS [24], and mini-ImageNet [227] datasets. We denote by CNN(X) the convolutional network with X convolutional layer; for example, CNN(4) corresponds to the baseline network also used in Finn et al. [67] and Raghu et al. [169] among many others. To ensure fair comparison, we independently reimplemented each algorithmic variant and found the best hyper-parameter values for each architecture-algorithm pair via grid-search. 2.5.1 Linear layers improve shallow linear models Fig. 2.3 displays the results of meta-learning using MAML on two standard benchmark datasets with logistic/softmax regression models. The light blue horizontal lines denote the best perfor- mance if the models are trained by sufficient data from the meta-testing tasks. The black lines are the meta-learning performance, which only slightly improve upon chance levels (20% on Ominglot and CIFAR-FS). 21 Table 2.1: Accuracy improves by adding linear layers. Method MAML MAML w/ LinNet CNN Layers 2 3 4 6 2 3 4 6 Omniglot 66.8 93.5 98.5 97.6 88.1 95.5 98.1 97.6 CIFAR-FS 62.2 68.9 70.9 71.3 66.1 71.1 74.4 71.9 mini-ImageNet 52.6 54.0 64.1 64.6 60.5 60.2 64.9 64.1 Table 2.2: Accuracy improves on ANIL trained CNN(2). Dataset w/o LinNet w/ LinNet Omniglot 91.00 93.02 CIFAR-FS 66.10 67.55 mini-ImageNet 56.42 56.64 However, as in Fig. 2.1, when linear layers are added to these linear models, the meta-learning performances are significantly improved. 2.5.2 Linear layers improve deep nonlinear models Table 2.1 lists the positive results of adding two linear layers to different CNN architectures. When the number of CNN layers are less than 6, the addition improves meta-learning performances. At CNN(6), there are degradations in performance by the original MAML on Omniglot and CIFAR- FS such that LinNet does not improve further. On mini-ImageNet, while MAML improves, Lin- Net decreases though its performance at CNN(4) is still the best. When degradation occurs, it is marginal (0:4% for CNN4 on Omniglot and0:5% for CNN6 on mini-ImageNet) and not alarming: for Omniglot, the standard deviation on accuracies is0:76% and on mini-ImageNet it is1:12%. Table 2.2 generalizes the positive findings in Table 2.1 to ANIL [169]. Thus, we believe LinNet is a broadly applicable strategy for improving meta-learning. 22 2.5.3 Meta-Optimizer for fast adaptation Main idea It is straightforward to see that the added linear layers (LinNets) function similarly to the upper layers of deep learning models. The parameter updates for the bottom layers before such layers are modulated by the parameters in the upper layers or the LinNets. However, in what ways does this modulation help meta-learning? Related to this question is meta-learning via learning to optimize, i.e., transforming the gradi- ents of the models [127, 161, 124, 73]. Those types of preconditioning techniques could also be used to make a (smaller) model (more) meta-learnable. Thus, are the parameter updates in deep models equivalent to transformed gradient updates by such techniques? Note that there is a subtle difference: in some of these techniques (such as T-Nets and WarpGrad), the loss function used to compute the gradients with respect to the bottom layers prior to transformation actually contains the the transformation parameters themselves, cf. eq.(2.25) for an example. This type of “inline” transformations de facto increase the model capacity by injecting more parameters. Our goal, however, is different and aims to disentangle the increase in model capacity from the ability to transform gradients. The empirical observation of this approach will enable us to answer the aforementioned question more clearly. In the following we give a brief account of various approaches for learning to optimize and our proposed META-KFO algorithm. META-KFO is able to merely transforming the gradients of a smaller model without increasing its modeling capacity but still results in better meta-learnability. Furthermore, the improvement diminishes when the smaller model gets bigger. We surmise why sufficiently large deep models can meta-learn: the upper layers have the equivalent effect of trans- forming the gradients of the bottom layers as if the upper layers were an external meta-optimizer, operating on a smaller network that is composed only of the bottom layers. 23 META-KFO and other meta-optimizion methods A meta-optimizer is a parameterized function U x defining the model’s parameter updates. For example, a linear meta-optimizer might be defined as: U x (g)= Ag+ b; (2.21) where x =(A;b) is the set of parameters of the linear transformation. The objective is to jointly learn the model and optimizer’s parametersx to accelerate optimization. Motivated by the analysis of meta-learning in deep nets, we propose to use such an optimizer to transform the gradient updates: L MAML MO (q)= E tp(t) [` t qaU x (Ñ` t (q)) ] (2.22) that takes a similar role of the upper-layers in deep nets in minimizing the MAML loss: q qb ¶L MAML MO (q) ¶q ; x xb ¶L MAML MO (q) ¶x (2.23) whereb is the meta-update learning rate. In this notation, Meta-Curvature [161] implements (MC) U x (Ñ`(q))= MÑ`(q) (2.24) where M is a matrix ( block-diagnonal tensor factorized). When M is diagonal, this becomes the Meta-SGD [127]. Furthermore, when M is identity, this become MAML. For T-Nets (to be used with MAML-loss), the model parameters are expanded with affine transformations, (T-NETS) ` t (A(q)aÑ` t (A(q)))) (2.25) where the transformationA() contains two components(W;T).T is shared by all the tasks and W is task-specific. SinceA is linear, it can be absorbed into the original model after adaptation. For WarpGrad [73], the transformationA is defined with nonlinear layers, thus strictly increasing the size of the original model (thus, is not considered in this work). 24 Our method takes the form (META-KFO) U x (Ñ`(q))= f(Ñ`(q);f) (2.26) where f() is a nonlinear function parameterized by a set of parameters f that is independent of the model’s parametersq. This approach generalizes MC (eq. (2.24)), as it is more adaptive since the gradientÑ`(q) is used as the inputs. For models with a large number of parameters, the transformation U (ie,A , M, and f()) could contain a lot of parameters and incur high computational cost. For details, please refer to the cited references. Essentially, f() is parameterized with a neural network where the gradientsÑ`(q) are manipluated with Kronecker products. 1 2 3 4 6 Num. Layers 0.5 0.6 0.7 0.8 0.9 1.0 Post-Adaptation Accuracy MAML KFO Omniglot 1 2 3 4 6 Num. Layers 0.5 0.6 0.7 Post-Adaptation Accuracy MAML KFO CIFAR-FS 2 3 4 6 Num. Layers 0.525 0.550 0.575 0.600 0.625 0.650 0.675 Post-Adaptation Accuracy MAML KFO mini-ImageNet Figure 2.4: The effect of the number of convolutional layers on adaptation performance. First, as the model size increases, the performances of both methods improve. Besides better meta-learning, the improvement can also be caused by the model’s increased capacity to learn the target tasks. Secondly, the “net gain” from the META-KFO has the diminishing trend as the size increases. In other words, the benefits of directly transforming gradients with an external meta- optimizer reduce as the upper layers of the larger models have more capacity to meta-learn to control their own bottom layers. Results Table 2.3 contrasts different approaches for improving meta-learning by MAML on CNN(2) without increasing the size of the model after adaptation. All methods improve the origi- nal MAML while META-KFO improves the most on Omniglot and CIFAR-FS. On mini-ImageNet, 25 Table 2.3: Meta-optimizers outperform MAML on CNN(2). Dataset MAML MAML w/ MSGD MC T-Nets META-KFO Omniglot 66.6 74.07 94.63 92.27 96.62 CIFAR-FS 62.2 62.82 68.37 66.42 69.64 mini-ImageNet 52.6 59.90 58.95 58.47 59.08 all methods improve about the same amount. Again, all methods improve the baseline ANIL and META-KFO improves the most significant. Fig. 2.4 examines the improvement of META-KFO over MAML with respect to the network size. As expected, META-KFO improves the most when the model is small and the improvement reduces when the model is sufficiently large. In other words, when the model is deep enough to meta-learn by itself using its top-layers to control the gradients of bottom layers, there is less advantage of using an external meta-optimizer to learn the bottom layers. We view this as a strong evidence to support the theory that for deep neural networks that can meta-learn well, the upper layers have the equivalent effect of transforming the gradients of the bottom layers as if the upper layers were an external meta-optimizer, operating on a smaller network that is composed of the bottom layers. They are the “external meta-optimizers that work from the inside. ”. 2.6 Related work Understanding how MAML and its alike work continues to draw research interests [66, 59, 169, 188]. Many such studies have left open questions to be carefully analyzed, and hypotheses to be tested. Finn and Levine [66] showed that, when combined with deep architectures, GBML is able to approximate arbitrary meta-learning schemes. That work would have assumed the model is meta- learnable to begin with, relying on the argument that deep models are universal approximators. 26 Fallah et al. [59] provided convergence guarantees for MAML. Other analyses have attempted to explain the generalization ability of GBML [84, 151], the bias induced by restricting the number of adaptation steps [244], or the effect of higher-order terms in the meta-gradient estimation [74, 181]. Those work do not directly investigate what elements in deep models make them meta-learn well. [169] suggested that the bottom layers of a neural network learn representations while the upper layers are responsible for the inductive bias to adapt fast. This observation echoes the success of some other approaches for meta-learning [210, 122]. But that work does not explain what is in the “magic” of the top layers to enable meta-learning. We also investigate how adaptation could be provided by a meta-optimizer. Meta-SGD meta- learns per-parameter learning rates [127] while Alpha MAML adapts those learning rates during fast adaptation [19]. Meta-Curvature learns a block-diagonal pre-conditioning matrix to compute fast-adaptation updates [161] and T-Nets extends that by decomposing all weight matrices of the model in two separate components [124]. WarpGrad further extends T-Nets by allowing both components to be non-linear functions [73]. The most salient difference of our work from existing ones is our focus on studying what makes deep models meta-learnable. Not only do we conclude being sufficiently deep is essential for meta-learning to succeed but we also theorize that the upper layers in the deep models essen- tially function as “embedded meta-optimizers”. Our extensive empirical studies complement the theoretical work in Saunshi et al. [188] which suggests that deep models might attain a lower loss than shallow ones. 2.7 Conclusion Are deep architectures necessary for meta-learning, even if the tasks can be solved with shallow (linear) networks? Our analysis suggests so. How does the depth benefit meta-learning? Our studies theorize that the upper layers of deep models learn to transform the gradients of a smaller network composed of only the bottom layers. Thus, appending a few linear layers to a shallower 27 network is simple yet surprisingly effective way to boosts its ability to adapt. A more powerful but more involved one is to resort to external meta-optimizers. We hope our observations can inspire future algorithms and studies. 28 Chapter 3 When Do We Need Meta-Learning? 3.1 Introduction Few-shot learning, the ability to learn from limited supervision, is essential for the real-world deployment of adaptive machines. Although proposed more than 20 years ago [144, 62], this field has recently been the focus of vast research efforts and a plethora of methods were proposed to tackle many of its challenges, including knowledge transfer and adaptation. However, the fundamental problem of evaluating those methods remains largely unaddressed. Although many standardized benchmarks exist, they follow one of two recipes to generate classi- fication tasks – they either partition classes at random (e.g. [24, 227]) or leverage class semantic relationships (e.g. [155, 176, 221]). The former implicitly assumes that train and test tasks come from the same distribution, leading to overly optimistic evaluation. The latter, although more re- alistic, requires additional human knowledge which can be expensive to gather, when available at all. This status quo is unsatisfactory because different applications call for different benchmarking schemes: a model that performs best when train and test tasks are similar does not necessarily achieve top accuracy when the two tasksets significantly differ. In other words, the quality of a few-shot learning algorithm depends on both train and test tasks, and their relative similarity. Without more fine-grained benchmarks, we might miss important properties of our algorithms and hinder their deployment to real-world scenarios. For example, recent work suggests that simply learning a good feature extractor might be all we need for few-shot classification. [219, 169] 29 Table 3.1: 5-ways 5-shots classification accuracy of metric- and gradient-based methods when transfer is most challenging. In this regime, methods that adapt their embedding function (Fine- tune, MAML) outperforms those that do not, and which were thought to be sufficient for few-shot learning. CIFAR100 mini-IN tiered-IN LFW10 EMNIST ANIL 49.27% 54.42% 73.24% 77.86% 86.54% ProtoNet 50.46% 60.36% 78.64% 85.48% 89.81% Multiclass 53.02% 61.86% 81.90% 82.26% 90.96% MAML 55.73% 62.25% n/a 86.36% 91.92% Finetune 70.98% 70.47% 83.57% 83.35% 93.51% Notably, they match and often surpass the performance of gradient-based methods by sharing a feature extractor across all tasks and only adapting a final classification layer. But, it stands to reason that in the extreme case where train tasks contain no information relevant to the test ones (i.e. transfer is impossible), those methods will underperform those that are allowed to adapt their feature extractor. Table 3.1 diplays results for such an instances where gradient-based methods (MAML [68]) dominate on tasksets carefully designed to challenge transfer. Large Small Transfer is Hard Transfer is Easy Class Embeddings Figure 3.1: ATG, a method to generate tasksets of varying difficulty. First, we compute each class embedding by averaging the embedding f(x) of all images x associated with that class. Then, we partition those class embeddings using a penalized clustering objective. If we want easy tasksets, we find clusters such that train and test classes are pulled together; for hard tasksets, we push those distributions apart. 30 In this chapter we propose Automatic Taskset Generation (ATG), a method to automatically generate tasksets from existing datasets, with fine-grained control over the transfer difficulty be- tween train and test tasks. (c.f . Figure 3.1) Our method can be understood as a penalized clustering objective that enforces a desired similarity between train and test tasks. Importantly, it does not require additional human knowledge and is thus amenable to settings where this information is not available. We use ATG to study and evaluate the two main families of few-shot classification algorithms: gradient-based and metric-based methods. Our results on 5 tasksets, including two new ones, show that gradient-based method become particularly compelling when transfer is most challenging. Contributions We make the following contributions: • ATG, a method to automatically generate tasksets that does not require additional human knowledge. • Extensive validation and study of our method, showing it can effectively control the degree of transfer difficulty between train and testk tasks. • An empirical analysis of popular few-shot learning methods, suggesting that gradient-based method outperform metric-based ones in the most challenging transfer regimes. Our implementations and tasksets are publicly available at: http://seba1511.net/projects/ atg 3.2 Related works Few-shot learning The goal of few-shot learning is to produce a model able to solve new tasks with access to only limited amounts of data. [62, 144] It is closely related to meta-learning – de- vising models that learn to learn [21, 192] – but with a particular emphasis on the small quantity of available data. This research direction has received a lot of interest in recent years, due to the numerous applications in natural language processing [149, 259, 249], medicine [33, 4, 166], and 31 more [248, 23, 28, 233]. In the computer vision domain, a wide-range of approaches were pro- posed to tackle the few-shot image classification challenge. Those include learning fast optimiza- tion schemes [172], weight-imprinting [167], memory-augmented neural networks [187], casting the problem as stochastic process learning [77], or going about it from a Bayesian perspective[157]. This chapter focuses on two major families of few-shot learning algorithms which we review be- low; for a more exhaustive account of few-shot and meta-learning, we refer the reader to the surveys of [232] and [223]. Gradient-based few-shot learning The main idea behind gradient-based few-shot learning al- gorithms is to discover a model whose parmeters can be adapted with just a few steps of gradient descent. [64] The representative algorithm of this family, MAML [68], does so by learning an initialization end-to-end while adaping all weights of the model. Due to its flexible formulation, MAML was successfully applied to vision [68], robotics [44], lifelong learning [70], and more [270]. Variations of MAML improved upon its adaptation ability, reduce its computational foot- print, or both. Notable attempts to improve MAML’s performance include meta-learning param- eters dedicated to optimization, implicit [123, 72] or explicit [126, 162, 20], probabilistic regu- larization schemes [83, 257], as well as various training refinements [6]. To reduce the burden of computing second-order derivatives induced by the MAML objective, the authors suggested omitting those derivatives altogether at the cost of decreased performance. When taking sev- eral gradient steps, one can leverage the implicit function theorem to more cheaply obtain the model updates. [170] Other options to mitigate the expense of second-order derivatives consists of conditioning the model on a latent embedding, and only updating that embedding during adapta- tion. [184, 171, 272] In the same vein, the authors of ANIL [169] suggest that it is sufficient to update the very last layer of neural network to reap the benefits of MAML; that is, the model learns a feature extractor shared across all tasks and simply update the linear classifier a few times for each new task. 32 Metric-based few-shot learning Similar to ANIL, metric-based methods also share a feature extractor across tasks to extract high-dimensional embedding representations. But, rather than adapting a linear classifier, they compute the distance of new instances’ embeddings to the few- shot set of reference embeddings, akin to nearest neighbour classification. Matching Networks [227], which was proposed to tackle the special case of one-shot learning, uses the (negative) co- sine similarity as distance measure and relies on two different networks, one for the query and one for the reference embedding. Prototypical Networks (ProtoNet) [209] generalizes Matching Networks to the few-shot setting and measures similarity between embeddings with the Euclidean distance. Some extensions to ProtoNet centered around learning and improving a specialized dis- tance metric for a given task, typically by solving a convex optimization problem. [24, 121, 264] Others attempted to improve the embedding representations of a task. [155, 180, 254, 128] Finally, ProtoMAML [221] combined MAML with ProtoNet such that the classification layer is initialized with the reference embedding values and the model is adapted for a few steps of gradient descent. A recent line of work put a lot of those advances in question, and show that a simple baseline matches – and often surpasses – state-of-the-art performance in few-shot classification. [219, 231, 53, 41] Rather than learning by sampling many different tasks, they suggest to aggregate the classes from all tasks and to train a model via standard one-vs-rest multiclass classification. At test time, the last linear layer of the model is removed to obtain a trained feature extractor, which can be used à la ProtoNet. Directly [219] or not, those results beg the question: do we still need to adapt our features for few-shot learning? Tasksets for few-shot learning This chapter tackles this question when transfer to new tasks is especially challenging. In this context, few-shot classification tasksets can be broadly catego- rized in two groups: those that leverage additional human information and those that don’t. In the latter case, the standard approach consists in taking an existing classification dataset and ran- domly partitioning classes in train and test sets. Typical examples include CIFAR-FS [24], mini- ImageNet [227, 172], TCGA [186], MultiMNIST [185, 201], CU Birds [234], FGVC Planes [139] 33 and Fungi [211, 197], VGG Flowers [152], and Omniglot [113, 116]. Tasksets that do take ad- vantage of human knowledge (usually in the form of semantic class relationships) attempt to min- imize information overlap to challenge transfer. To the best of our knowledge, there are two such tasksets: FC100 [155] and tiered-ImageNet [176]. The former is built on CIFAR100 [108] and leverages its superclass structure, while tiered-Imagenet is a subset of ImageNet-1k [183] and uses the WordNet [145] database. Building on those datasets (and some more) is Meta-Dataset [221]: a large collection of tasksets aimed at mimicking transfer in real-world scenarios. For example, one benchmarking scenario consists of training a model on ImageNet-1k and evaluating its performance on the remaining 9 tasksets. Similar in spirit is VTAB [260], which also consists of a large collection of synthetic and natural images aimed at evaluating representation learning and transfer. In contrast to prior work, we study the question raised by Tian et al. [219]: is a good embedding sufficient for few-shot learning? We approach this question from the standpoint where transfer is particularly challenging. Since training gradient-based methods on large benchmarks such as Meta-Datasets and VTAB is prohibitively expensive, we devise a method to automatically generate tasksets with explicit control over transfer difficulty. 3.3 Background and notation We now present the few-shot learning setting, introduce notation, and review the few-shot algo- rithms of interest to the remainder of this chapter. Datasets, tasksets, and tasks We designate by dataset a set of input-class pairs (x;y), where the dependent variable y takes values from a finite set of classes y 1 ;:::;y M . This dataset induces a classification task, which consists of finding a predictive function mapping x to y. One common approach to construct a set of such tasks, i.e. a taskset, is to first sample a subset of N classes, and then sample K input-class pairs for each of those classes. This setting is usually referred to as N-ways K-shots. 34 As we often wish to evaluate the transferability of few-shot models from one taskset to another, we first partition the M classes of the dataset into train and test splits and construct tasksets from those splits. As mentioned in Section 3.1, partitioning can either be random, in which case train and test classes come from the same distribution, or it can leverage additionaal human knowledge. This additional information typically defines a hierarchy over the classes (e.g. semantic class re- lationships), which can be used for partitioning. For example, the 100 classes of CIFAR100 are grouped into 20 superclasses, and FC100 uses 12 of those superclasses (60 classes) for training and 4 (20 classes) for validation and testing, each. Once tasksets are defined, the goal of a few-shot learning algorithm is to find a model of the train tasks that generalizes to the test tasks. To achieve this goal, we associate to each taskt a lossL t and a parameterized model p t (yj x). Without loss of generality, we write this model as the composition of a linear layer w and a feature extractorf (e.g. a neural network). Then, finding a maximum likelihood solution forf and w boils down to minimizing the average task loss Et [L t (f;w)] over the training tasks. Algorithms Gradient-based methods, and MAML in particular, compute the likelihood after adaptingf and w for one (or a few) gradient step. For the ith class, this likelihood takes the form: p t (y= y i j x)= softmax(f 0 (x) > w 0 i ) (MAML) s.t. w 0 = waÑ w L t f 0 =faÑ f L t wherea denotes the adaptation learning rate, and w 0 i is the ith column of the adapted linear layer. In the adaptation step, the gradient of the loss is computed over the set of few-shot examples; differentiating it requires Hessian-gradient products which makes MAML an expensive method for large feature extractorf. For this reason, MAML is rarely used with architectures larger than the standard 4-layer CNN. To scale up MAML to the 12-layer ResNet of our experiments, we implement a data-parallel version of MAML where different GPUs are responsible for different subsets of the few-shot inputs. 35 To alleviate this computational burden, the authors of ANIL suggest that adapting w is sufficient to claim many of MAML’s benefits: p t (y= y i j x)= softmax(f(x) > w 0 i ) (ANIL) s.t. w 0 = waÑ w L t : Contrasting with MAML, the difference lies in the feature extractorf being shared – not adapted – across tasks. In that sense ANIL resembles metric-based based methods, despite its original motivation of approximating MAML. Sincef is responsible for the bulk of the computation, ANIL becomes increasingly more efficient as the feature extractor grows in size. 1 A representative metric-based algorithm, ProtoNet also sharesf across tasks and replaces the final linear classifier by a nearest neighbour one. It measures this nearest neighbour distance be- tween the query embeddingf(x) and the ith class embeddingf i : p t (y= y i j x)= softmax(d(f(x);f i )) (ProtoNet) wheref i = E xp t (jy i ) [f(x)] is the average embedding of the few samples with class y i , and d(;) is a distance function – common choices include the Euclidean norm or the (negative) cosine similarity. Since the ProtoNet classifier is nonparametric, it can be used with any feature extractor f. Thus, a simple yet effective baseline, Multiclass, consists of collapsing all train tasks into a single large tasks and learning a feature extractor via standard one-vs-rest multiclass classification. At test-time, we can readily use this learned feature extractor with the ProtoNet classifier to solve unseen tasks. 1 In our experiments, we saw speed-ups as large as 9.25x over MAML. 36 3.4 Taskset generation: easy or hard? This section describes ATG, a method for generating tasksets without requiring additional human knowledge, and which provides fine-grained control over the transfer difficulty between train and test tasks. Our goal is to partition classes into train and test sets, such that we can control the difficulty of transferring a model trained on one set to the other. At a high-level, ATG finds those class partitions based on their class embedding f 1 ;:::;f M . Each class embedding f i is obtained by averaging the embedding of all samples from its corresponding class y i : f i = E xp(jy=y i ) [f(x)]; were f is a pretrained feature extractor. In practice, we find that pretraining f on the dataset to be partitioned leads to satisfying results, but simpler approaches might also work. (e.g. an off-the- shelf model pretrained on ImageNet) The key feature of ATG is that train and test clusters are encouraged to stay at a prespecified distance of each other in the class embedding space. Intuitively, the farther and tighter the test set, the more difficult it will be for few-shot methods to discriminate between those classes. On the other hand, a test task is easily solved if its classes are dissimilar (i.e. pooly clustered) and each class is fairly similar to classes seen during training (i.e. train and test clusters are adjacent). To formalize this intuition, we pretend that the class embeddings were sampled from the mix- ture of two distributions p train (y) and p test (y). We model p train and p test as multinomials such that their density for class y i depends on its distance to the distribution’s centroid: p train (y= y i )= softmax(jjf i m train jj 2 ); and p test (y= y i )= softmax(jjf i m test jj 2 ); 37 where the centroidsm train andm test are learnable parameters. Maximizing the log-likelihood of the mixture 1 2 (p train + p test ) is identical to a soft K-means objective if p train and p test were Gaussians. [137, Chapter 22] In practice, we preferred multinomials because Gaussians led to numerical in- stabilities when optimizing the penalized objective (see below) by gradient descent. We can incentivize p train and p test to lie at a given distance R 0 by including a penalty (D(p train jjp test ) R) 2 based on our choice of statistical divergence D. Combining this penalty with the mixture’s log-likelihood yields our final penalized clustering objective: J= M å i=0 log 1 2 (p train (y i )+ p test (y i )) +l(D(p train jjp test ) R) 2 ; where l 0 balances the value of clustering versus the penalty. Assuming train and test classes tightly cluster, controlling the transfer difficulty amounts to controlling the distance D(p train jjp test ) between the two distributions, which we can achieve by adjusting R. Train, validation, and test assignments When J is minimized with respect to m train and m test , we get a solution we can use to partition y 1 ;:::;y M . One approach is to assign class y i to the train set if the ratio p train (y i ) p test (y i ) is greater than 1, and to the test set if less. 2 Partitioning according to this decision rule is theoretically sound but might lead to slightly degenerate solutions in the context of few-shot learning. For example, we observed instances where only 3 classes were assigned to the test set, which prevents evaluation in the 5-ways setting. It also doesn’t prescribe how to devise a validation set from the test distribution. To resolve those issues, we select the top 60% scoring classes and assign them to the train set. The remaining 40% classes are split among validation and test sets in turn, according to their score: the least scoring class is assigned to the test set, the second-least scoring to the validation, etc... Thus, validation and test classes retain roughly equal probability under p test and the resulting tasks are of similar difficulty. 2 Ties are broken at random, so that if D(p train jjp test )= 0 assignments are also random. 38 3.5 Experiments Our experiments focus on three aspects. First, we empirically validate our proposed method to control the information overlap between train and test classes. To provide an intuitive picture of our method, we also perform an ablation study on the different measures of information and vi- sualize low-dimensional projections of the resulting partitions. Second, we compare the partitions obtained from our method to the ones obtained using semantic class relationships. Our results sug- gest that both types of partitions are significantly different; moreover, they confirm that our method is capable of generating partitions that are more challenging than the ones resting on semantics. Finally, we leverage our partitions to compare gradient-based and metric-based classification meth- ods. Those experiments indicate that metric-based methods – which were recently thought to be sufficient for few-shot learning – tend to underperform in challenging transfer settings when com- pared to their gradient-based counterpart. 3.5.1 Setup Our experiments build on architectures and datasets widely used in the computer vision and few- shot learning literature. We denote by CNN4 the 4-layer CNN with 64 hidden described in [209], which we use for few-shot learning experiments on FC100, CIFAR-FS, EMNIST, and LFW10. (The latter two datasets are described below.) On mini-ImageNet and tiered-ImageNet, we use the 12-layer residual network (ResNet12) described in [146], and add the DropBlock layers proposed in [121] for regularization. (But unlike Lee et al. we keep the final average pooling layers.) For experiments involving Imagenet-1k – and including Birds, Planes, Flowers, Fungi – we use 3 architectures: a 121-layer DenseNet (DenseNet121; [96]), a 18-layer residual network (ResNet18; [89]), and a GoogLeNet (GoogLeNet; [214]). For all models, we define the feature extractor to be the architecture up to the last fully-connected layer. 39 All tasksets built with ATG follow an identical recipe. We pretrain f on the same data used to compute the class embeddings f i . We minimize J by gradient descent withl = 1, and, unless specified otherwise, use the symmetrized version of the KL divergence for D. With this chapter, we also contribute tasksets for two datasets previously not used in the few- shot learning setting: EMNIST and LFW10. EMNIST The Extended MNIST dataset (EMNIST; [46]) is a variant of the MNIST dataset con- sisting of 814,255 grayscale images of 62 handwritten characters (digits, lowercase and uppercase alphabet). Each image contains a single character scaled to 28x28 pixels. Using ATG, we partition its classes in 37 characters for training, 12 for validation, and 13 for testing, filling a niche in the few-shot dataset landscape: EMNIST is lightweight (roughly 610Mb) with few training classes, but plenty of data (>10k) available per class. LFW10 We bootstrap LFW10 from the Labelled Faces in the Wild dataset [97], which contains 13,233 pictures of 5,749 famous personalities. Of those 5k personalities, we select all 158 that have at least 10 images in the dataset – fewer would constrain the few-shot learning setup – and partition them in tasksets of 94 train, 32 validation, and 32 test classes. Each image is rescaled to 62x47 pixels with RGB colors. LFW10 is an example of dataset for which collecting class relationships is difficult if not impossible, as it requires putting a semantic hierarchy over each individual in the dataset. 3.5.2 Controlling information overlap We first test whether our method is effective in producing tasksets of varying difficulty. To that end, we use ATG to generate tasksets of increasing difficulty on the classes of 5 datasets: CIFAR100, mini-ImageNet, tiered-ImageNet, LFW10, and EMNIST. First, we train a convolutional network over the entire set of classes, using the standard cross-entropy minimization. We use a 121-layer 40 0.0 0.5 1.0 D(p train||p test ) 0.5 0.6 0.7 0.8 Accuracy CIFAR100 0.0 0.5 1.0 D(p train||p test ) 0.6 0.7 0.8 0.9 Accuracy mini-ImageNet 0.0 0.5 1.0 D(p train||p test ) 0.825 0.850 0.875 0.900 0.925 0.950 Accuracy tiered-ImageNet 0.0 0.5 1.0 D(p train||p test ) 0.90 0.92 0.94 0.96 Accuracy EMNIST 0.0 0.5 1.0 D(p train||p test ) 0.85 0.90 0.95 1.00 Accuracy LFW10 0.0 0.5 1.0 D SKL (y train ||y test ) 0.5 0.6 0.7 0.8 Accuracy Multiclass Train Tasks Validation Tasks Test Tasks CIFAR100 Figure 3.2: Accuracy of a Multiclass-trained netwok as we increase the divergence between train and test class distributions. As the divergence increases, accuracy drops suggesting that the divergence can be used to generate tasksets of varying difficulty. DenseNet for CIFAR100, mini- and tiered-ImageNet dataset, and the CNN4 for LFW10 and EM- NIST. We then remove the last fully-connected layer of the network, and use this feature extractor to compute the mean embedding of each class. Finally, we create train, validation, and test class partitions for various target divergence between train and test classes. We measure transfer difficulty with the 5-ways 5-shots classification accuracy of a model trained with Multiclass on the train taskset only. We use the ResNet12 for ImageNet-based tasks, and the CNN4 for the others. Figure 3.2 reports train, validation, and test accuracies as a function of the divergence for all datasets. Across all datasets test accuracy decreases as the divergence be- tween train and test class distributions increases. Thus, we conclude that our method is effective in finding partitions of desired transfer difficulty. We also observe that validation and test accuracies similarly challenge the transfer ability of the Multiclass model, which is expected as both sets of classes come from the same distribution. PCA visualizations We perform two more ablative studies to provide further understanding of ATG. In Figure 3.3, we plot the 2-dimensional PCA projection of the CIFAR100 mean class em- beddings, together with their assignments and the assigned centroid. We observe that centroids separate further as the divergence increases from 0.04 to 1.28 – the greater the divergence between p train and p test , the furtherm train andm test . 41 ||μ train−μ test|| = 0.17 D(p train ||p test ) = 0.04 ||μ train−μ test|| = 0.29 D(p train ||p test ) = 0.32 ||μ train−μ test|| = 0.32 D(p train ||p test ) = 0.64 ||μ train−μ test|| = 0.40 D(p train ||p test ) = 0.96 ||μ train −μ test || = 0.42 Train classes Test classes Train centroid Test centroid D(p train ||p test ) = 1.28 Figure 3.3: Low-dimensional projections of class embeddings and taskset centroids. As we increase the divergence penalty, the centroids spread further apart. Comparing divergences Although intuitive, this description is not sufficient to explain why ATG works because it lacks a measure of how difficult it is to discriminate between the classes of a task. To show this, we compare different methods for measuring the distance of train and test class distributions between ImageNet-1k and 4 downstream datasets. (Birds, Planes, Fungi, and Flowers) We take a 121-layer DenseNet pretrained on ImageNet-1k, and compute the mean class embedding for all 1,000 classes as well as the ImageNet centroid. Then, for each of the downstream dataset, we sample 100 tasks, each consisting of 5 classes and 5 samples. Finally, we compute the distance between the train distribution (defined by the ImageNet centroid) and the task distribution (defined by the centroid of the task) over all 1,005 classes. Table 3.2: Correlation between divergence and accuracy for different choice of divergence D. Measuring the Euclidean distance between centroids performs worst, while the symmetrized KL divergence (used in our other experiments) is best. Birds Planes Fungi Flowers Euclidean Distance -0.50 -0.08 -0.17 -0.33 Wasserstein-2 -0.57 -0.24 -0.23 -0.30 Kullback-Leibler -0.63 -0.41 -0.51 -0.29 Symmetrized KL -0.73 -0.43 -0.49 -0.43 Table 3.2 reports the Pearson’s correlation coefficient between those distance measurements and the task accuracies for all datasets. The symmetrized KL, which we use in our remaining 42 experiments, performs best while naively measuring the Euclidean distance between centroids (Euclidan) performs worse. As a final remark, let us observe that correlations have fairly low magnitudes, indicating that our method is ill-suited to accurately measure individual task difficulty. Assessing task similarity and difficulty is an open research question; we refer the reader to related litterature for more details. [2, 63, 53, 242] 3.5.3 Comparing semantic vs embedding clusters We continue the study of our method and zero-in on the difference between class embeddings and semantic to encode class similarities. Comparison to WordNet The first question we ask is whether class embeddings capture in- formation similar to semantics. To that end, we compare hierarchies created by clustering class embeddings to WordNet, the hierarchy over ImageNet-1k classes induced by class semantics. We construct three sets of ImageNet-1k embeddings with three different pretrained architec- tures. For each set, we obtain a tree of their classes via Ward clustering. To compare those trees against the WordNet graph of classes, we define the hop distance – the average difference of dis- tances between two classes a;b: å a;b jd Clustering (a;b) d WordNet (a;b)j; where d Clustering ;d WordNet are the normalized minimum number of nodes separating a to b in the clustering tree and WordNet graph, respectively. We use such a cumbersome metric because Word- Net contains cycles as well as several vertices of degree 2; intuitively, this metric can be understood as the average difference in the number “hops” necessary to reach b starting from a between the two graphs. Table 3.3 reports the hop distance between WordNet and clustering trees, when the class em- beddings are computed using DenseNet121, GoogLeNet (embedding sizes 104) and ResNet18. 43 Table 3.3: Average hop distance between WordNet and hierarchies created from Imagenet- 1k embeddings (via hierarchical clustering). Regardless of the network architecture, trees con- structed from class embeddings are more similar to each other than WordNet, indicating that class partitioning relies on attributes different from semantic relationships. ResNet18 DenseNet121 GoogLeNet ResNet18 0.0 1.99 2.18 DenseNet121 1.99 0.0 2.26 GoogLeNet 2.18 2.26 0.0 WordNet 8.60 8.35 8.59 (embedding size 512) Regardless of embedding size or network architecture, the clustering struc- tures differ from the WordNet structure by a factor of 4, indicating that class embeddings encode attributes significantly different from class semantics. Anecdotally, we manually inspect the trees generated from embeddings in an attempt to shed some light on what those poperties might be. We find that the mean class embeddings tended to encode visual properties. For example, the traffic light and theatre classes are clustered close-by (3 hops) due to both containing many red pixels. On the other hand, the Orangutan and Macaque classes (fairly close semantic-wise) are separated by 9 hops — likely due to the difference in color and texture of both animal’s furs. Comparison across methods Having established that ATG can generate tasksets different from those built on semantics, we answer the next natural question: How do those tasksets compare to existing tasksets? Accordingly, we select the easy (D= 0:04) and hard (D= 0:96) tasksets gen- erated with ATG and train networks with 4 popular few-shot methods (MAML, ANIL, ProtoNet, Multiclass) on the 5-ways 1-shot and 5-ways 5-shots setting on the tasksets of CIFAR100, mini- ImageNet, and tiered-ImageNet. To ensure fair comparison, we do not augment the data and use the exact same architecture for all methods within a dataset. (CNN4 for CIFAR100, ResNet12 for 44 mini/tiered-ImageNet. 3 ) Each (method, taskset) pair is tuned independently, using a logarithmic- spaced grid-search. For MAML and ANIL, we measure accuracy after 5 adaptation steps. Table 3.4 reports the test accuracy obtained at the best validation iteration. In almost all sce- narios, ATG is able to generate tasksets that are as — or even more — challenging as the tasksets constructed from semantic relationships. This makes ATG a compelling solution to taskset gen- eration in settings where additional inter-class information is hard or even impossible to obtain. Conversely, we see that the easier tasksets generate by our method (the ones with 0.04 divergence) are significantly easier, sometimes even approaching purely random assignments. (i.e. with a di- vergence of 0) 3.5.4 Is a good embedding really enough? A closer inspection of the results in Table 3.4 hints at an unexpected trend: MAML becomes in- creasingly more competitive as the transfer from train to test classes become more challenging. To verify this hypothesis, we train all 4 few-shot algorithms on all tasksets of 4 datasets highlighted in Section 3.5.2; then, we compute the slope of the regression line between divergence and test accuracies. Table 3.5 reports those slopes, and confirms our hypothesis: gradient-based methods degrade slower than metric-based ones. In turn, those experiments suggest a follow-up hypothesis: when train-test transfer is challeng- ing enough, methods that adapt their embedding function should outperform those that do not. We verify this hypothesis in Table 3.1 where we compare accuracies of each algorithms on the most challenging taskset of each dataset. We also include a Finetune entry, which corresponds to Multiclass with the embedding function updated by 5 gradient steps on test tasks. Confirming our hypothesis, MAML and Finetune dominate on all datasets. 3 Training a ResNet12 with MAML on tiered-ImageNet was still too computationally expensive, despite our data- parallel implementation. 45 3.6 Conclusion This chapter focuses on the evaluation and analysis of few-shot classification methods. Accord- ingly, we propose ATG to generate tasksets of desired transfer difficulty, and with no requirement for additional human information about classes or their relationships. After empiricaly validating ATG, we generate tasksets to study the two main families of few-shot classification algorithms: gradient-based and metric-based. As opposed to recent work suggesting that a good feature ex- tractor might be enough for few-shot classification [219, 169], we find that gradient-based methods outperform metric-based ones, especially when transfer is challenging. Although seemingly contradicting, we believe the two hypotheses are compatible: when ap- plied out of domain the metric implicitly learned by ProtoNet and Multiclass will likely require adaptation, e.g. to adjust for domain shift or to discover new features. For similar reasons and as highlighted in our experiments, ANIL’s approximation of MAML will break since it does not adapt its feature extractor to avoid expensive second-order derivatives. On the other hand, those methods don’t have to pay the price of adaptation when knowledge transfer is sufficient to reach good performance, and can thus be scald to much larger datasets. In fact, when test tasks are similar to training, MAML’s adaptation by gradient descent can even lead to overfitting especially when working with limited labelled data. We hope our contribution can help researchers analyse few-shot learning methods and answer some of the above questions. Future research directions for ATG include extensions to regression and multi-label classification tasks. 46 Table 3.4: Comparing classification accuracy of different tasksets for a same dataset across popular few-shot learning methods. Our proposed method, ATG, is capable of generating simple tasksets (close to random partitioning) as well as challenging ones. In particular, it is often more challenging than tasksets built with class semantics (denoted with a † ), but unlike those it does not require additional information. Bolded results indicate most challenging taskset for a given method. 5-ways 1-shot 5-ways 5-shots Dataset Tasksets Backbone MAML ANIL ProtoNet Multiclass MAML ANIL ProtoNet Multiclass CIFAR100 CIFAR-FS CNN4 56.96% 54.47% 54.97% 54.82% 72.99% 69.44% 72.00% 68.83% FC100 † CNN4 36.99% 35.54% 36.25% 36.59% 51.48% 50.12% 51.16% 51.22% Random CNN4 56.66% 51.43% 51.25% 52.39% 73.16% 69.81% 71.12% 69.24% Easy (ours) CNN4 48.35% 46.03% 46.68% 46.92% 65.19% 59.43% 63.66% 63.97% Hard (ours) CNN4 35.86% 33.55% 35.59% 35.44% 55.73% 49.27% 50.46% 53.02% mini-ImageNet Original ResNet12 58.80% 55.02% 56.68% 57.12% 72.56% 64.74% 71.05% 71.88% Random ResNet12 53.12% 50.34% 51.58% 51.88% 72.25% 62.77% 69.70% 72.03% Easy (ours) ResNet12 53.75% 51.22% 52.25% 52.53% 67.92% 61.98% 66.64% 70.27% Hard (ours) ResNet12 44.87% 42.54% 41.83% 44.62% 62.25% 54.42% 60.36% 61.86% tiered-ImageNet Original † ResNet12 n/a 56.99% 61.59% 66.75% n/a 74.81% 80.02% 82.53% Random ResNet12 n/a 62.69% 69.35% 70.89% n/a 80.60% 85.46% 86.98% Easy (ours) ResNet12 n/a 59.53% 64.39% 68.16% n/a 76.57% 84.33% 85.50% Hard (ours) ResNet12 n/a 56.19% 62.91% 65.48% n/a 73.24% 78.64% 81.90% 47 Table 3.5: Slope of the regression line between divergence and accuracy (in % points) for different methods. MAML degrades at slower rate than metric-based methods, suggesting that it is better suited when transfer is challenging. CIFAR100 mini-ImageNet LFW10 EMNIST MAML -11.53 -6.68 -3.10 -2.55 ANIL -12.44 -7.32 -3.61 -5.70 ProtoNet -14.88 -7.18 -4.45 -4.71 Multiclass -12.41 -7.41 -4.91 -3.23 48 Chapter 4 Solving New Reinforcement Learning Tasks without Meta-Learning 4.1 Introduction Learning representations via pretraining is a staple of modern transfer learning. Typically, a feature encoder is pretrained on one or a few source task(s). Then it is either frozen (i.e., the encoder stays fixed) or finetuned (i.e., the parameters of the encoders are to be updated) when solving a new downstream task [258]. While whether to freeze or finetune is application-specific, finetuning outperforms in general freezing when there are sufficient (labeled) data and compute. This pretrain- then-transfer recipe has led to many success stories in vision [175, 40], speech [5, 3], and NLP [30, 42]. For reinforcement learning (RL), however, finetuning is a costly option as the learning agent needs to collect its own data specific to the downstream task. Moreover, when the source tasks are very different from the downstream task, the first few updates from finetuning destroy the representations learned on the source tasks, cancelling all potential benefits of transferring from pretraining. For those reasons, practitioners often choose to freeze representations, thus completely preventing finetuning. 49 But representation freezing has its own shortcomings, especially pronounced in visual RL, where the (visual) feature encoder can be pretrained on existing image datasets such as Ima- geNet [50] and even collections of web images. Such generic but easier-to-annotate datasets are not constructured with downstream (control) tasks in mind and the pretraining does not necesarily capture important attributes used to solve those tasks. For example, the downstream embodied AI task of navigating around household items [189, 105] requires knowing the precise size of the objects in the scene. Yet this information is not required when pretraining on visual objection cat- egorization tasks, resulting in what is called negative transfer where a frozen representation hurts downstream performance. More seriously, even when the (visual) representation needed for the downstream task is known a priori, it is unclear that learning it from the source tasks then freezing it should be preferred to finetuning, as shown in Figure 4.1. On the left two plots, freezing representations (Frozen) underperforms learning representations using only downstream data (De Novo). On the right two plots, we observe the opposite outcome. Finetuning representations (Finetuned) performs well overall, but fails to unequivocally outperform freezing on the rightmost plots. Contributions When should we freeze representations, when do they require finetuning, and why? this chapter answers those questions through several empirical studies on visual RL tasks ranging from simple game and robotic tasks [216, 217] to photo-realisticHabitat domains [189]. Our studies highlight properties of finetuned representations which improve learnability. First, they are more consistent in clustering states according to the actions they induce on the downstream task; second, they are more robust to noisy state observations. Inspired by these empirical findings, we propose PiSCO, a representation finetuning objective which encourages robustness and consistency with respect to the actions they induce. We also show that visual percept feature encoders first compute task-agnostic information and then refine this information to be task-specific (i.e., predictive of rewards and / or dynamics) — a well-known lesson from computer vision [258] but, to the best of our knowledge, never demon- strated for RL so far. We suspect that finetuning with RL destroys the task-agnostic (and readily 50 0 2 4 Steps × 10 2 40 50 60 70 80 Rewards (↑ better) MSR Jump De Novo Frozen Finetuned Upper Bound Evaluation Tasks (a) 0.5 1.0 1.5 2.0 Steps × 10 5 100 200 300 400 500 Rewards (↑ better) DMC De Novo Frozen Finetuned Walker-Run (b) 0 2 4 Steps × 10 6 0.0 0.2 0.4 0.6 0.8 1.0 SPL (↑ better) Habitat De Novo Frozen Finetuned Gibson (c) 0 2 4 6 Steps × 10 6 0.0 0.2 0.4 0.6 0.8 SPL (↑ better) Habitat De Novo Frozen Finetuned Matterport3D (d) Figure 4.1: When should we freeze or finetune pretrained representations in visual RL? Re- ward and success weighted by path length (SPL) transfer curves onMSR Jump, DeepMind Con- trol, and Habitat tasks. Freezing pretrained representations can underperform no pretraining at all (Figures 4.1a and 4.1b) or outperform it (Figures 4.1c and 4.1d). Finetuning representations is always competitive, but fails to significantly outperform freezing on visually complex domains (Figures 4.1c and 4.1d). Solid lines indicate mean over 5 random seeds, shades denote 95% confi- dence interval. See Section 4.3.4 for details. transferrable) information found in lower layers of the feature encoder, thus cancelling the benefits of transfer. To retain this information, we show how to identify transferrable layers, and propose to freeze those layers while adapting the remaining ones with PiSCO. This combination yields excellent results on all testbeds, outperforming both representation freezing and finetuning. 4.2 Related works and background Learning representations for visual reinforcement learning (RL) has come a long way since its early days. Testbeds have evolved from simple video games [107, 147] to self-driving simu- lators [202, 57], realistic robotics engines [217, 141], and embodied AI platforms [189, 105]. Visual RL algorithms have similarly progressed, and can now match human efficiency and perfor- mance on the simpler video games [85, 256], and control complex simulated robots in a handful of hours [253, 117, 87]. In spite of this success, visual representations remain challenging to transfer in RL [118]. Prior work shows that learned representations can be surprisingly brittle, and fail to generalize to minor changes in pixel observations [240]. This perspective is often studied under the umbrella 51 of generalization [261, 45, 159] or adversarial RL [163, 103]. In part, those issues arise due to our lack of understanding in what can be transferred and how to learn it in RL. Others have argued for a plethora of representation pretraining objectives, ranging from capturing policy values [125, 131] and summarizing states [142, 200, 1, 129] to disentangled [93] and bi-simulation metrics [262, 35]. Those objectives can also be aggregated in the hope of learning more generic representations [78, 251]. Nonetheless, it remains unclear which of these methods should be preferred to learn generic transferrable representations. Those issues are exacerbated when we pretrain representations on generic vision datasets (e.g., ImageNet [50]), as in ourHabitat experiments. Most relevant to this setting are the recent works of Xiao et al. [247] and Parisi et al. [160]: both pretrain their feature encoder on real-world images and freeze it before transfer to robotics or embodied AI tasks. However, neither reports satisfying results as they both struggle to outperform a non-visual baseline built on top of proprioceptive states. This observation indicates that more work is needed to successfully apply the pretrain- then-transfer pipeline to visual RL. A potentially simple source of gain comes finetuning those generic representations to the downstream task, as suggested by Yamada et al. [250]. While this is computationally expensive, prior work has hinted at the utility of finetuning in visual RL [200, 230]. In this work, we zero-in on representation freezing and finetuning in visual RL. Unlike prior representation learning work, our goal is not to learn generic representations that transfer; rather, we would like to specialize those representations to a specific downstream task. We contribute in Section 4.3 an extensive empirical analysis highlighting important properties of finetuning, and take the view that finetuning should emphasize features that easily discriminate between good and bad action on the downstream task — in other words, finetuning should ease decision making. Building on this analysis, we propose a novel method to simplify finetuning, and demonstrate its effectiveness on challenging visual RL tasks where naive finetuning fails to improve upon repre- sentation freezing. 52 (a) Enc o d e r Enc o d e r P r o j e c t o r Da t a A ug m e nt a t i o n Da t a A ug m e nt a t i o n (b) Figure 4.2: (a) Annotated observation fromMSR Jump. On this simple game, the optimal policy consists of moving right until the agent (white) is 14 pixels away from the obstacle (gray), at which point it should jump. (b) Diagram for our proposedPiSCO consistency objective. It encourages the policy’s actions to be consistent for perturbation of state s, as measured through the actions induced by the policy. See Section 4.4 for details. 4.3 Understanding when to freeze and when to finetune We presents empricial analysis of representation freezing and finetuning. In Section 4.3.4 we detail the results from Figure 4.1 mentioned above. Then, we show in Section 4.3.5 a surprising result on the MSR Jump tasks: although freezing the pretrained representations results in negative transfer, those representations are sufficiently informative to perfectly solve the downstream task. This result indicates that capturing the relevant visual information through pretraining is not always effective for RL transfer. Instead, Section 4.3.6 shows that representations need to emphasize task- relevant information to improve learnability – an insight we explore in Section 4.3.7 and build upon in Section 4.4 to motivate a simple and effective approach to finetuning. 4.3.1 MSR Jump [216] The agent (a gray block) needs to cross the screen from left to right by jumping over an obstacle (a white block). The agent’s observations are video game frames displaying the entire world state (see Figure 4.2a) based on which it can choose to “move right” or “jump”. We generate 140 tasks by changing the position of the obstacle vertically and horizontally; half of them are used for pretraining, half for evaluation. We train the convolutional actor-critic agent from Mnih et al. 53 [148] with PPO [199] for 500 iterations on all pretraining tasks, and transfer its 4-layer feature encoder to jointly solve all evaluation tasks. Like pretraining, we train the actor and critic heads of the agent with PPO for 500 iterations during transfer. The agent always starts on the left-hand side and has to jump over a single gray obstacle. The agent observes the entire screen from pixel values, and the only factors of variation are the x and y coordinates of the obstacle (i.e., its horizontal position and the floor height). Its only possible actions are to move right or jump. The jump dynamics are shared across tasks and mid-air actions don’t affect the subsequent states until the agent has landed (i.e., all jumps go up by 15 pixels diagonally, and down by 15 pixels diagonally). Our experiments carefully replicate the reward and dynamics of Tachet des Combes et al. [216], save for one aspect: we increase the screen size from 64 pixels to 84 pixels, as this significantly accelerated training with the DQN [147] feature encoder. 4.3.1.1 Pretraining setup Source tasks We pretrain on 70 source tasks, varying both the obstacle x (from 15 to 50 pixels) and y (from 5 to 50) coordinates, every 5 pixel. We don’t use frame stacking nor action repeats. Learning agent We train with PPO [199] for 500 iterations, where each iteration collects 80 episodes from randomly sampled source tasks. We use those episodes to update the agent 320 times with a batch size of 64, a discount factor of 0:99, and policy and value clip of 0:2. The learning rate starts at 0:0005 and is decayed by 0:99 once every 320 update. The agent consists of the 4-layer DQN feature encoder and linear policy and value heads. All models use GELU [92] activations. 4.3.1.2 Transfer setup Downstream tasks At transfer time, we replace the 70 source tasks with 70 unseen downstream tasks. Those downstream tasks are obtained by shifting the obstacle position (x and y coordinates) 54 of each source task by 2 pixels. For example, if a source task has an obstacle at x= 20 and y= 30 there is a corresponding downstream task at x= 22 and y= 32. Learning agent The learning agent is identical to the source tasks, but rather than randomly initializing the weights of the feature encoder we use the final weights from pretraining. As men- tioned in the above, we also introduce one additional hyper-parameter as we decouple the learning rate for the feature encoder and policy / value heads. For Finetuned, these learning rates are 0:001 for the heads and 0:0001 for the feature encoder; both learning rates are decayed during finetuning. 4.3.2 DeepMind Control [217] This robotics suite is based on the MuJoCo [220] physics engine. We use visual observations as described by Yarats et al. [252], and closely replicate their training setup. The learning agent consists of action-value and policy heads, together with a shared convolutional feature encoder whose representations are used in lieu of the image observations. We pretrain with DrQ-v2 [253] on a single task from a given domain, and transfer only the feature encoder to a different task from the same domain. For example, ourWalker agent is pretrained onWalker-Walk and transfers to Walker-Run. TheDeepMind Control tasks build on top of the MuJoCo physiscs engine [220] and imple- ment a suite of robotic domains. Our experiments focus on Walker, Cartpole, and Hopper. Walker andHopper are robot locomotion domains, whileCartpole is the classic cartpole envi- ronment. On all domains, the agent receives an agent-centric visual 8484 RGB observation, and controls its body with continuous actions. Our task implementation exactly replicate the one from DrQ [252]. 55 4.3.2.1 Pretraining setup Source task OnWalker, the source task isWalk, where the agent is tasked to walk in the forward direction. On Cartpole it is Balance, where the pole starts in a standing position and the goal is to balance it. And on Hopper it is Stand, where the agent needs to reach a (still) standing position. For all tasks, we keep a stack of the last 3 images and treat those as observations. Learning agent Our learning agent is largely inspired by DrQ-v2 [253], and comprises a 4-layer convolutional feature encoder, a policy head, and a twin action value head. Both the policy and action value heads consist of a projection layer, followed by LayerNorm [13] normalization, and a 2-layer MLP with 64 hidden units and GELU [92] activations. We train with DrQ-v2 for 200k iterations, starting the policy standard deviation at 1:0, decaying it by 0:999954 every iteration until it reaches 0:1. We use a 512 batch size and don’t use n-steps bootstrapping nor delayed updates. For pretraining, we always a learning rate of 3e 4 and default Adam [104] hyper-parameters. 4.3.2.2 Transfer setup Downstream task OnWalker, we transfer toRun, where the agent needs to run in the forward direction. On Cartpole, the downstream task is Swingup, where the pole starts in downward position and should be swung and balanced upright. On Hopper, we use Hop, where the agents move in the forward direction by hopping. Learning agent We use the same learning agent as for pretraining, but initialize the feature encoder with the weights obtained from pretraining. We also add a learning rate hyper-parameter when finetuning the feature encoder, which is tuned independently for each setting and method. 4.3.3 Habitat [189] In this setting, we pretrain a ConvNeXt (tiny) [133] to classify images from the ImageNet [50] dataset. We use the “trunk” of the ConvNeXt as a feature encoder (i.e., discard the classification 56 head) and transfer it to either Gibson [206] or Matterport3D [36] scenes simulated with theHabi- tat embodied AI simulator. The tasks inHabitat consist of navigating to a point from an indoor scene (i.e., PointNav) from visual observations, aided with GPS and compass sensing. Our imple- mentation replaces the residual network from “Habitat-Lab” [215] with our ConvNeXt 1 , and uses the provided DDPPO [237] implementation for transfer training. We train for 5M steps on Gibson and 7M steps on Matterport3D, include all scenes from the respective datasets, and report reward weighted by path length (SPL). On Habitat, the goal of the agent is to navigate indoor environments to reach a desired lo- cation. To solve this point navigation task, it relies on two sensory inputs: a GPS with compass and visual observations. The visual observations are 256 256 RGB images (n.b., without depth information). Our codebase builds on the task and agent definitions of “Habitat-Lab” [215], and only modifies them to implement our proposed methods. 4.3.3.1 Pretraining setup Source task We use the ImageNet [50] dataset as source task to pretrain our feature encoder. Learning agent The learning agent is a ConvNeXt-tiny [133] classifier, trained to discriminate between ImageNet-1K classes. We directly use the pretrained weights provided by the ConvNeXt authors, available at: https://github.com/facebookresearch/ConvNeXt 4.3.3.2 Transfer setup Downstream tasks For transfer, we load maps from the Gibson [246] and Matterport3D [36] datasets. Both are freely available for academic, non-commercial use at: • Gibson: https://github.com/StanfordVL/GibsonEnv/ • Matterport3D:https://github.com/niessner/Matterport/ 1 We choose a ConvNeXt because it removes batch normalization and dropout layers after pretraining. 57 Frozen Finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Action Accuracy (↑ better) Evaluation Tasks Frozen Finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Distance Estimation Error (↓ better) MSR Jump (a) Random Frozen Finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Cluster Purity (↑ better) MSR Jump Evaluation Tasks (b) 0 5 10 Noise Magnitude 10 − 4 10 − 3 10 − 2 10 − 1 Classification Error ( ↓ better) MSR Jump Random Frozen Finetuned Evaluation Tasks (c) Figure 4.3: (a) Transfer can still fail even though frozen representations are informative enough. On MSR Jump tasks, we can perfectly regress the optimal actions (accuracy = 1:0) and agent-obstacle distance (mean square error< 1:0) with frozen or finetuned representations. Com- bined with Figure 4.1a, those results indicate that capturing the right information is not enough for RL transfer. (b) Finetuned representations yield purer clusters. Given a state, we measure the expected purity of the cluster consisting of the 5 closest neighbours of that state in represen- tation space. For Finetuned representations, this metric is significantly higher (98:75%) than for Frozen (91:41%) or Random (82:98%), showing that states which beget the same actions are closer to one another and thus easier to learn. (c) Finetuned representations are more robust to perturbations. For source and downstream tasks, the classification error (1 - action accuracy) degrades significantly more slowly for Finetuned representations than for Frozen ones under increasing data augmentation noise, suggesting that robustness to noise improves learnability. See Section 4.3.5 for details. Learning agent We take the pretrained ConvNeXt classifier, discard the linear classification head, and only keep its trunk as our feature encoder. We train a 2-layer LSTM policy head and a linear value head on top of this feature encoder with DDPPO [237] for 5M iterations on Gibson and 10M on Matterport3D. For both tasks, we collect 128 steps with 4 workers for every update, which consists of 2 PPO epochs with a batch size of 4. We set PPO’s entropy coefficient to 0:01, the value loss weight to 0:5, GAE’s t to 0:95, the discount factor to 0:99, and clip the gradient norm to have 0:2 l 2 -norm. The learning rate for the policy and value heads is set to 2:4e 4 , and the feature encoder’s learning rate to 1e 4 for Frozen+Finetuned. For Frozen+Varnish, we use a single learning rate set to 1e 4 for the feature encoder and policy / value heads (but 2:5e 4 gave similar results). We never decay learning rates. 58 4.3.4 When freezing or finetuning works As a first experiment, we take the pretrained feature encoders from the previous section and transfer them to their respective downstream tasks. We consider three scenarios. • De Novo: where the feature encoder is randomly initialized (not pretrained) and its repre- sentations are learned on the downstream task only. The feature encoder might use different hyper-parameters (e.g., learning rate, momentum, etc) from the rest of the agent. • Frozen: where the feature encoder is pretrained and its representations are frozen (i.e., we never update the weights of the feature encoder). • Finetuned: where the feature encoder is pretrained and its representations are finetuned (i.e., we update the weights of the feature encoder using the data and the algorithm of the downstream task). As for De Novo, the hyper-parameters for the feature encoder are tuned separately from the other hyper-parameters of the model. Figure 4.1 displays those results. ComparingDe Novo andFrozen shows that freezing repre- sentations works well onHabitat (see Figures 4.1c and 4.1d), likely because these tasks demand significantly richer visual observations. However, freezing representations completely fails on the simplerMSR Jump andDeepMind Control tasks (see Figures 4.1a and 4.1b). The pretrained feature encoders don’t provide any benefit over a randomly initialized ones, and even significantly hurt transfer onMSR Jump. The latter is especially surprising because there is very little difference between pretraining and evaluation tasks (only the obstacle position is changed). Those results beg the question: why does transfer fail so badly onMSR Jump? On the other hand, Finetuned performs reasonably well across all benchmarks but those results hide an important caveat: finetuning can sometimes collapse to representation freezing. This was the case on Habitat, where our hyper-parameter search resulted in a tiny learning rate for the feature encoder (1e 6 , the smallest in our grid-search), effectively preventing the encoder to adapt. We hypothesize that on those large-scale and visually challenging tasks,Finetuned offers no improvement overFrozen because RL updates are too noisy for finetuning. Thus, we ask: how can we help stabilize finetuning in RL? 59 We answer both questions in the remaining of this chapter. We take a closer look atMSR Jump tasks and explain why transfer failed. We show that a core role of finetuning is to emphasize task-specific information, and that layers which compute task-agnostic information can be frozen without degrading transfer performance. In turn, this suggests an instinctive strategy to help stabi- lize RL training: freezing task-agnostic layers, and only finetuning task-specific ones. 4.3.5 Freezing fails even when learned representations are useful We now show that, on MSR Jump, the pretrained representations are informative enough to per- fectly solve all evaluation tasks. Due to the simplicity of the dynamics, we can manually devise the following optimal policy because the size of the agent, the size of the obstacle, and the jump height are shared across all tasks: p = 8 > > < > > : Jump; if the agent is 14 pixels to the left of the obstacle. Move right; otherwise. We empirically confirmed this policy reaches the maximal rewards (81) on all pretraining and evaluation tasks. Immediately, we see that the distance from the agent to the obstacle is the only information that matters to solveMSR Jump tasks. This observation suggests the following exper- iment. To measure the informativeness of pretrained representations, we regress the distance between the agent and the obstacle from pretrained representations. If the estimation error (i.e., mean square error) is lower than 1:0 pixel, the representations are informative enough to perfectly solve MSR Jump tasks. To further validate this scenario, we also regress the optimal actions from the simple policy above given pretrained representations. Again, if the acurracy is perfect the representations are good enough. We repeat this experiment for finetuned representations, and compare the results against one another. 60 Figure 4.3a reports results for pretrained and finetuned (i.e., after transfer) representations. We regress the distance and optimal actions with linear models from evaluation task observations, us- ing a mean square error and binary cross-entropy loss. Surprisingly, both pretrained and finetuned representations can perfectly regress the distance and optimal actions: they get sub-pixel estima- tion error (0:94 and 0:79, respectively) and perfect accuracy (both 100%). Combined with the previous section, these results indicate that capturing the right visual information is not enough for successful transfer in RL. They also put forward the following hypothesis: one role of finetuning is to emphasize information that eases decision making. This hypothesis is supported by the lower distance error forFinetuned in the right panel of Figure 4.3a. 4.3.6 Finetuning improves learnability and robustness to noise How is information refined for decision making? Our next set of experiments answers this question by highlighting differences between pretrained and finetuned representations. First, we measure the ability of each feature encoder to cluster states assigned to identical actions together. Intuitively, we expect states that require similar actions to have similar repre- sentations. To that end, we compute the average purity when clustering a state with its 5 closest neighbors in representation space. In other words, given state s, we find the 5 closest neighbors to x in representation space and count how many of them are assigned the same optimal action as s. Figure 4.3b the average cluster purity for random, pretrained, and finetuned representations. We find thatFinetuned representations yield clusters with significantly higher purity (98:75%) than Frozen (91:41%) or random (82:98%) ones, which explains why PPO converges more quickly when representations are finetuned: it is easier to discriminate between the states requiring “jump” v.s. “move right”, if the representations for those states cluster homogenously. Second, we uncover another property of finetuned representations, namely, that they are more robust to perturbations. We replicate the optimal action classification setup from Figure 4.3a but perturb states randomly rotating them. For each image, the degree of the rotation is randomly sampled from range (x;x). Figure 4.3c shows how classification accuracy degrades as we let x 61 increase from 0 to 12 . BothFrozen andFinetuned degrade with more noise, butFinetuned is more robust and degrades more slowly. Those results point to finetuned representations that consistently map to similar actions and are robust to perturbations They set the stage for Section 4.4 where we introducePiSCO, a policy consistency objective which builds on these insights. In the next section, we continue our analysis and further investigate how information gets refined in each layer of the feature encoder. 4.3.7 When and why is representation finetuning required? As the previous section uncovered important pitfalls of representation freezing, we now turn to finetuning in the hope of understanding why it succeeds when it does. To that end, we dive into the representations of individual layers in Finetuned feature encoders and compare them to the Frozen and random representations. We take inspiration from the previous section and hypothesize that “purpose drives adaptation”; that is, a layer only needs finetuning if it computes task-specific information that eases decision making on the downstream task. Conversely, layers that compute task-agnostic information can be frozen thus potentially stabilizing RL training. To test this hypothesis, we conduct two layer-by-layer experiments: linear probing and incremental freezing. With layer-by-layer linear probing, we aim to measure how relevant each layer is when choos- ing an action [56, 268]. To start, we collect a dataset of 10;000 state-action pairs and their cor- responding action values with the Finetuned policy on the downstream task. The action values are computed with an action value function trained to minimize the temporal difference error on the collected state-action pairs. We then proceed as follows for each layer l in the feature encoder. First, we store the representations computed by layer l, and extract their most salient features by projecting them to a 50-dimensional vector with PCA. Second, we regress action values from the projected representations and their corresponding actions using a linear regression model. Finally, we report the mean squared error for this linear model at layer l. Intuitively, the lower the linear probing error the easier it is to predict how good an action is for a given state 2 . We repeat this entire 62 Observ. Conv1 Conv2 Conv3 Linear When Using Features From Layer 10 − 2 Value Estimation Error (↓ better) Jump Random Frozen Finetuned Evaluation Tasks (a) Observ. Conv1 Conv2 Conv3 Linear Layers Frozen Up To 0.0 0.2 0.4 0.6 0.8 1.0 Area Under Reward Curve (↑ better) Jump Evaluation Tasks (b) Figure 4.4: Frozen layers that retain the same information as finetuned layers can be kept frozen. (a) Mean squared error when linearly regressing action values from representations com- puted by a given layer. Some early layers are equally good at predicting action values before or after finetuning. This suggests those layers can be kept frozen, potentially stabilizing training. See Section 4.3.7 for details. (b) Area under the reward curve when freezing up to a given layer and finetuning the rest. Freezing early layers does not degrade performance. process for Frozen and Finetuned feature encoders, as well as one with a randomly initialized weights (i.e.,Random). Results for this experiment are shown in Figure 4.4a. On all testbeds, bothFrozen andFine- tuned curves trend downwards and lie significantly lower than Random, indicating, as expected, that representations specialize for decision making as we move up the feature encoder. We also notice that the first few layers ofFrozen andFinetuned track each other closely beforeFrozen starts to stagnate. On MSR Jump, this happens after the last convolutional layer (Conv3) and on DeepMind Control after the second one (Conv2). This evidence further supports our previous insight that early layers compute task-agnostic information, and suggests a new hypothesis: can we freeze the early layers without degrading finetuning performance? If so, this strategy might help stabilize finetuning in visually complex tasks likeHabitat. We confirm this hypothesis by combining freezing and finetuning in our second layer-by-layer experiment. For each target layer l, we take the pretrained feature encoder and freeze all layers up to and including l; the remaning layers are finetuned using the same setup as Figure 4.1. 2 We found action values to be better targets to measure ease of decision making than, say, action accuracy which is upper bounded by 1:0 as in Figure 4.3a. 63 Observ. Conv1 Conv2 Conv3 Conv4 When Using Features From Layer 2 − 2 2 − 1 2 0 Value Estimation Error (↓ better) DMC Random Frozen Finetuned Walker-Run Observ. Conv1 Conv2 Conv3 Conv4 When Using Features From Layer 2 0 2 1 2 2 Value Estimation Error (↓ better) DMC Random Frozen Finetuned Cartpole-Swingup Observ. Conv1 Conv2 Conv3 Conv4 When Using Features From Layer 2 3 Value Estimation Error (↓ better) DMC Random Frozen Finetuned Hopper-Hop Observ. Conv1 Conv2 Conv3 Conv4 When Freezing Up To Layer 0.0 0.2 0.4 0.6 0.8 1.0 Area Under Reward Curve (↑ better) DMC Walker-Run Observ. Conv1 Conv2 Conv3 Conv4 When Freezing Up To Layer 0.0 0.2 0.4 0.6 0.8 1.0 Area Under Reward Curve (↑ better) DMC Cartpole-Swingup Observ. Conv1 Conv2 Conv3 Conv4 When Freezing Up To Layer 0.0 0.2 0.4 0.6 0.8 1.0 Area Under Reward Curve (↑ better) DMC Hopper-Hop Figure 4.5: Frozen layers that retain the same information as finetuned layers can be kept frozen. Replicates Figure 4.4a on Walker, Cartpole, and Hopper DeepMind Control do- mains. (top) Mean squared error when linearly regressing action values from representations com- puted by a given layer. (bottom) Area under the reward curve when freezing up to a given layer and finetuning the rest. Freezing early layers sometimes improves performance (e.g., up toConv2 onWalker). Figure 4.4b summarizes this experiment with the (normalized) area under the reward curve (AURC). We preferred this metric over “highest reward achieved” since the latter does not con- sider how quickly the agent learns. On MSR Jump, the pretrained layers that have similar value estimation error in Figure 4.4a can be frozen without degrading adaptation. But, freezing yields lower AURC when value estimation error stagnates (as for Linear). Similarly, freezing the last two layers onDeepMind Control (Conv3 andConv4, which did not matchFinetuned’s value estimation error) also degrades performance. We also see that adapting too many layers (i.e., when Conv1 is frozen but notConv2) reduces the AURC. The training curves show this is due to slower 64 convergence, suggesting thatConv2 already computes useful representations which can be readily used for the downstream task. Merging the observations from our layer-by-layer experiments, we conclude that pretrained layers which extract task-agnostic information can be frozen. We show in Section 6.6 that this conclusion significantly helps stabilize training when finetuning struggles, as inHabitat. 4.4 Finetuning with a policy-induced self-supervised objective Section 4.3.6 suggests that an important outcome of finetuning is that states which are assigned the same actions cluster together. This section builds on this insight and introducesPiSCO, a self- supervised objective which attempts to accelerate the discovery of representations which cluster together for similar actions. At a high level PiSCO works as follows: given a policyp, it ensures thatp(aj s 1 ) andp(aj s 2 ) are similar if states s 1 and s 2 should yield similar actions (e.g., s 1 ;s 2 are perturbed version of state s). This objective is self-supervised because it applies to any policy p (not just the optimal one) and doesn’t require knowledge of rewards nor dynamics. More formally, we assume access to a batch of statesB from which we can sample state s. Further, we compute two embedding representations for state s. The first, z=f(s), is obtained by passing s through an encoder f(); the second, p= h(z), applies a projector h() on top of representation z. Both represenations z and p have identical dimensionality and can thus be used to condition the policy p. The core contribution behind PiSCO is to measure the dissimilarity between z and p in terms of the distribution they induce throughp: D(z; p)= KL(?(p(j z))jjp(j p)); where KL(jj) is the Kullback-Leibler divergence, and?(x) is the stop-gradient operator, which sets all partial derivatives with respect to x to 0. The choice of the Kullback-Leibler is arbitrary — other statistical divergence are valid alternatives. 65 The final ingredient inPiSCO is a distribution T(s 0 j s) over perturbed states s 0 . This distribution is typically implemented by randomly applying a set of data augmentation transformations on the state s. Then, thePiSCO objective (for Policy-induced Self-Consistency Objective) is to minimize the dissimilarity between the representation of states sampled from T : L PiSCO = E sB s 1 ;s 2 T(js) 1 2 (D(z 1 ; p 2 )+D(z 2 ; p 1 )) ; where s 1 and s 2 are two different perturbations obtained from state s, and the objective uses a symmetrized version of the dissimilarity measureD. In practice, thePiSCO objective is added as an auxiliary term to the underlying RL objective and optimized for the encoderf and the projector h. Pseudocode is available in Section 4.5. Remarks PiSCO is reminiscent of SimSiam [39] and only differs in how dissimilarity between embeddings is measured. Our proposed policy-induced similarity measure is crucial for best per- formance, as shown in Section 4.6.2. In fact, similar representation learning algorithms for RL could be derived by replacing the embedding similarity measure in existing self-supervised algo- rithms, such as SimCLR [38], MoCo [91], or VICReg [17]. We chose SimSiam for its simplicity – no target encoder nor negative samples required – which suits the requirements of RL training. Alone,PiSCO is not a useful objective to solve RL tasks; rather, its utility stems from assisting the underlying RL algorithm in learning representations that are robust to perturbations. Could we obtain the same benefits by learning the policy with augmented data (and thus side-stepping the need forPiSCO)? In principle, yes. However, the policy objective is typically a function of the re- wards which is known to be excessively noisy to robustly learn representations. For example, DrQ avoids learning representations through the policy objective, and instead relies on Bellman error minimization. We hypothesize thatPiSCO succeeds in learning robust representations because its self-supervised objective is less noisy than the policy objective. 66 4.5 Pseudocode Here we provide pseudocode for implementingPiSCO on top of two popular reinforcement learn- ing algorithms. Algorithm 1 addsPiSCO to PPO [199], while Algorithm 2 adds it to DrQ-v2 [253]. Algorithm 1 PPO withPiSCO 1 # sample transitions from replay 2 s, a, r, s 0 = replay.sample(batch_size) 3 z = f(s) 4 5 # compute value loss 6 L V = 0.5 (V(z) - discount(r)) 2 7 8 # compute policy loss 9 D p = logp(aj z) logp old (aj z) 10 A = GAE(V(z), r, g, t) 11 L p = min(exp(D p ) A; clip(exp(D p ) A;1e;1+e) 12 13 # compute PiSCO loss 14 z 1 , z 2 = f(data_augment(s)), f(data_augment(s)) 15 p 1 , p 2 = h(z 1 ), h(z 2 ) 16 L PiSCO = 0.5 KL(?(p(j z 1 ))kp(j p 2 )) + 0.5 KL(?(p(j z 2 ))kp(j p 1 )) 17 18 adam.optimize(L p +nL V +lL PiSCO b H(p(j z))) # optimizes V, p, h, and f 4.6 Experiments We complete our analysis with a closer look atPiSCO and partially frozen feature encoders. First, we check whether partial freezing and policy-induced supervision can help accelerate finetuning with RL; second, we compares policy-induced supervision as an RL finetuning objective against more common similarity measures in self-supervised learning. 4.6.1 Partial freezing and policy-induced supervision improve RL finetuning This section shows that partially freezing a pretrained feature encoder can significantly help stabi- lize downstream training. We revist the experimental setup of Figure 4.1, this time including two 67 Algorithm 2 DrQ-v2 withPiSCO 1 # sample transitions from replay 2 s, a, r, s 0 = replay.sample(batch_size, n_steps, g) 3 4 # compute policy loss 5 z = ?(f(data_augment(s))) 6 ˆ a = p(j z).rsample() 7 L p = - min(Q 1 (z, ˆ a), Q 2 (z, ˆ a)) 8 adam.optimize(L p ) # only optimizes p 9 10 # compute action-value loss 11 z = f(data_augment(s)) 12 z 0 = f(data_augment(s 0 )) 13 a 0 = p(j z 0 ).sample() 14 q 1 , q 2 = Q 1 (z, a), Q 2 (z, a) 15 q 0 = ?(r + g min(Q 1 (z 0 , a 0 ), Q 2 (z 0 , a 0 ))) 16 L Q = 0.5 (q 1 - q 0 ) 2 + 0.5 (q 2 - q 0 ) 2 17 18 # compute PiSCO loss 19 z 1 , z 2 = f(data_augment(s)), f(data_augment(s)) 20 p 1 , p 2 = h(z 1 ), h(z 2 ) 21 L PiSCO = 0.5 KL(?(p(j z 1 ))kp(j p 2 )) + 0.5 KL(?(p(j z 2 ))kp(j p 1 )) 22 23 adam.optimize(L Q +lL PiSCO ) # only optimizes Q 1 , Q 2 , h, and f 68 0 2 4 Steps × 10 2 40 50 60 70 80 Rewards (↑ better) MSR Jump Frozen Finetuned Frozen+Finetuned Frozen+PiSCO Upper Bound Evaluation Tasks 0.5 1.0 1.5 2.0 Steps × 10 5 100 200 300 400 500 600 Rewards (↑ better) DMC Frozen Finetuned Frozen+Finetuned Frozen+PiSCO Walker-Run 0 2 4 Steps × 10 6 0.0 0.2 0.4 0.6 0.8 1.0 SPL (↑ better) Habitat Frozen Finetuned Frozen+Finetuned Frozen+PiSCO Gibson 0 2 4 6 Steps × 10 6 0.0 0.2 0.4 0.6 0.8 SPL (↑ better) Habitat Frozen Finetuned Frozen+Finetuned Frozen+PiSCO Matterport3D Figure 4.6: Partial freezing improves convergence; adding our policy-induced consistency objective improves further. As suggested by Section 4.3.7, we freeze the early layers of the fea- ture extractor and finetune the rest of the parameters without (Frozen+Finetuned) or with our policy-induced consistency objective (Frozen+PiSCO). On challenging tasks (e.g., Habitat), partial freezing dramatically boosts downstream performance, while Frozen+PiSCO further im- proves uponFrozen+Finetuned across the board. new transfer variations. The first one is Frozen+Finetuned, where we freeze early layers and finetune the remaining ones as with Finetuned. The second, Frozen+PiSCO, additionally in- cludes thePiSCO objective as an auxiliary loss to help finetuning representations. Frozen+PiSCO involves one extra hyper-parameter, namely, the auxiliary loss weight, which we set to 0:01 for all benchmarks. In both cases, we identify which layers to freeze and which to finetune following a similar linear probing setup as for Figure 4.4a. For each layer, we measure the action value estimation error of pretrained and finetuned representations, and only freeze the first pretrained layers that closely match the finetuned ones. OnMSR Jump, we freeze up toConv3; onDeepMind Control (Walker) up to Conv2; and on Habitat, we freeze upto Conv8 for Gibson and upto Conv7 for Matterport3D. We report convergence curves in Figure 4.6. As suggested by our analysis of Figure 4.4b, freezing those early layers does not degrade performance; in fact, we see significant gains on Habitat both in terms of convergence rate and in asymptotic rewards. Those results are particu- larly noticeable since Finetuned struggled to outperform Frozen on those tasks. We also note that Frozen+PiSCO improves upon Frozen+Finetuned across the board: it accelerates con- vergence on MSR Jump, and also improves asymptotic performance on DeepMind Control and 69 0.5 1.0 1.5 2.0 Steps × 10 5 100 200 300 400 500 600 Rewards (↑ better) DMC Frozen+Finetuned Frozen+CURL Frozen+SimSiam Frozen+PiSCO Walker-Run 0.25 0.50 0.75 1.00 Steps × 10 5 200 300 400 500 600 700 800 900 Rewards (↑ better) DMC Frozen+Finetuned Frozen+CURL Frozen+SimSiam Frozen+PiSCO Cartpole-Swingup 0.5 1.0 1.5 2.0 Steps × 10 5 0 50 100 150 200 250 300 Rewards (↑ better) DMC Frozen+Finetuned Frozen+CURL Frozen+SimSiam Frozen+PiSCO Hopper-Hop Figure 4.7: Policy-induced supervision is a better objective than representation alignment for finetuning. We compare our policy-induced consistency objective as a measure of representation similarity against popular representation learning objectives (e.g. SimSiam, CURL). Policy su- pervision provides more useful similarities for RL finetuning, which accelerates convergence and reaches higher rewards. Habitat. These results tie all our analyses together: they show that (a) freezing early layers can help stabilize transfer in visual RL, (b) which layer to freeze is predicted by how well it encodes action value information, and (c) policy-induced supervision (and PiSCO) is a useful objective to accelerate RL finetuning. 4.6.2 Policy-induced supervision improves upon contrastive predictive coding As a final experiment, we answer how important it is to evaluate similarity through the policy rather than with representation alignment as in contrastive predictive coding. In other words, could we swap PiSCO for SimSiam and obtain similar results in Figure 4.6? Our ablation studies focus on DeepMind Control as it is more challenging than MSR Jump yet computationally tractable, unlikeHabitat. We include Frozen+SimSiam and Frozen+CURL in addition to Frozen+Finetuned and Frozen+PiSCO. Frozen+SimSiam is akin to Frozen+PiSCO but uses the negative cosine simi- larity to measure the dissimilarity between embeddings z and p. ForFrozen+CURL, we implement the contrastive auxiliary loss of CURL [117], a self-supervised method specifically designed for RL. Both of these alternatives use DrQ as the underlying RL algorithm. 70 AllDeepMind Control tasks use the same pretraining and transfer setups. InCartpole, we pretrain on theBalance task, where the pole starts in an upright position and our goal is to keep it blanced. We then transfer toSwingup, where the pole starts upside down and the goal is to swing and balance it upright. With Hopper, the pretraining task is Stand where the agent is asked to stand up still; the downstream task isHop where the agent should move in the forward directino as fast as it can. Note thatHopper-Hop is considered more challenging thanCartpole andWalker tasks, and DrQ-v2 fails to solve it even after 3M iterations [253]. Figure 4.7 reports convergence curves onWalkerWalk!Run andHopperStand!Hop 3 . In- cluding the SimSiam objective typically improves upon vanilla finetuning, but including CURL does not. We hypothesize that most of the benefits from CURL are already captured by DrQ, as shown in prior work [252]. PiSCO significantly outperforms the standard self-supervised meth- ods thus demonstrating the efficacy of using the policy when measuring similarity between em- beddings. We hypothesize that representation alignment objectives naturally map two similarly looking states to similar representations, even when they shouldn’t (e.g., when the agent is 15 and 14 pixels away from the obstacle in MSR Jump). Instead, PiSCO is free to assign different representations since those two states might induce different actions (“move right” and “jump”, respectively). 4.6.3 Selecting frozen layers inHabitat We decided to freeze up to layerConv8 andConv7 for the Gibson and Matterport3D experiments in Section 4.6.1. This section explain how our analysis informs the choice of those layers, without resorting to a finetuned policy nor layer-by-layer freezing experiments as in Figure 4.4b. Building on our analysis, we obtained 10;000 state-action pairs from a policy trained directly on Gibson and Matterport3D. This policy is provided by “Habitat-lab” but in practice a couple of expert demonstrations might suffice. As in Figure 4.4a, we fit an action value function and save the action values for the observation-action pairs from the collected state-action pairs. Then, we 3 Hopper-Hop is especially difficult; e.g., DrQ-v2 fails to solve it after 3M iterations [253]. 71 replicate the layer-by-layer linear probing experiment of Section 4.3.7 with a randomly initialized feature encoder and the one pretrained on ImageNet. Two differences from Section 4.3.7 stem from the large and visually rich observations onHabitat. First, we project each layer’s representations to a PCA embedding of dimensions 600 rather than 50 4 ; second, we omit the first convolutional layer as PCA is prohibitively slow on such large representations. Figure 4.8 displays those results. As expected, the downward trend is less evident than in Figure 4.4a since the transfer gap between ImageNet and Habitat is larger than between tasks in MSR Jump or DeepMind Control. Nonetheless, we can easily identify the representations in layersConv8 andConv7 as the most predictive of action value on Gibson and Matterport3D, thus justifying our choice. Conv2 Conv3 Conv4 Conv5 Conv6 Conv7 Conv8 AvgPool LayerNorm When Using Features From Layer 2 − 2 Value Estimation Error (↓ better) Habitat Random Frozen Gibson Conv2 Conv3 Conv4 Conv5 Conv6 Conv7 Conv8 AvgPool LayerNorm When Using Features From Layer 2 − 3 2 − 2 2 − 1 Value Estimation Error (↓ better) Habitat Random Frozen Matterport3D Figure 4.8: Identifying which layers to freeze on Habitat tasks. We replicate the layer-by-layer linear probing experiments on Habitat with the ImageNet pretrained feature extractor. Although the downward trend is less evident than in Figure 4.4a, we clearly see that layersConv8 andConv7 yield lowest value prediction error on Gibson and Matterport3D, respectively. 4.6.4 DeepMind Control transfer from ImageNet Given the promising results of large feature extractors pretrained on large amounts of data Fig- ures 4.1c and 4.1d, we investigate whether similarly trained feature extractors would also improve 4 Dimensions ranging from 500 to 750 work equally well. Smaller dimensions yield underfitting as PCA doesn’t retain enough information from the representations. Larger dimensions yield overfitting and pretrained representations don’t do much better than random ones. 72 transfer onDeepMind Control tasks. To that end, we use the same ConvNeXt as on ourHabi- tat experiments, freeze its weights, learn policy and action-value heads on each one of the three DeepMind Control tasks (replicating the setup of Figure 4.1). Figure 4.9 provides convergence curves comparing the 4-layer CNN pretrained on the source DeepMind Control task (i.e., either Walker-Walk, Cartpole-Balance, or Hopper-Stand) against the ConvNeXt pretrained on ImageNet. The CNN drastically outperforms the ConvNeXt despite having fewer parameters and trained on less data. We explain those results as follows. The smaller network is trained on data that is more relevant to the downstream task, thus has a smaller generalization gap to bridge. In other words, more data is useful insofar as it is relevant to the downstream task. More parameters might help (i.e., pretraining the ConvNeXt on DeepMind Control tasks) but it is notoriously difficult to train very deep architectures from scratch with reinforcement learning, as shown withDe Novo onHabitat tasks. 0.5 1.0 1.5 2.0 Steps × 10 5 0 100 200 300 400 500 600 Rewards (↑ better) DMC Frozen (Walk) Frozen (ImageNet) Walker-Run 0.25 0.50 0.75 1.00 Steps × 10 5 0 200 400 600 800 Rewards (↑ better) DMC Frozen (Balance) Frozen (ImageNet) Cartpole-Swingup 0.5 1.0 1.5 2.0 Steps × 10 5 0 50 100 150 200 250 300 Rewards (↑ better) DMC Frozen (Stand) Frozen (ImageNet) Hopper-Hop Figure 4.9: Pretraining on large and diverse data can hurt transfer when the generalization gap is too large. When transferring representations that are pretrained on ImageNet toDeepMind Control tasks, we see a signficant decrease in convergence rate. We hypothesize this is due to the lack of visual similarity between ImageNet andDeepMind Control. 4.6.5 De Novo finetuning withPiSCO As an additional ablation, we investigate the ability of PiSCO to improve reinforcement learning from scratch. In Figure 4.10, we compare the benefits of using PiSCO with when representations are (partly) frozen v.s. fully finetunable (De Novo). We see that PiSCO always improves upon 73 learning from scratch (De Novo+PiSCO outperformsDe Novo) but that it is most beneficial when the task-agnostic features are frozen (Frozen+PiSCO outperformsDe Novo+PiSCO). 0.5 1.0 1.5 2.0 Steps × 10 5 100 200 300 400 500 600 Rewards (↑ better) DMC De Novo De Novo+PiSCO Frozen Frozen+PiSCO Walker-Run 0.25 0.50 0.75 1.00 Steps × 10 5 200 300 400 500 600 700 800 900 Rewards (↑ better) DMC De Novo De Novo+PiSCO Frozen Frozen+PiSCO Cartpole-Swingup 0.5 1.0 1.5 2.0 Steps × 10 5 0 50 100 150 200 250 300 Rewards (↑ better) DMC De Novo De Novo+PiSCO Frozen Frozen+PiSCO Hopper-Hop Figure 4.10: PiSCO can also improves representation learning on source tasks but shines with pretrained representations. When benchmarking PiSCO combined with fully finetunable features (De Novo+PiSCO), we observe that it marginally outperforms DrQ-v2 on the downstream tasks (De Novo). However, the best performance is obtained when also transferring task-agnostic features (Frozen+PiSCO). 4.6.6 PiSCO without projection layers The original SimSiam formulation includes a projector layer h(), and Chen and He [39] show that this projector is crucial to prevent representation collapse. Does this also holds true for RL? To answer this question, we compare the performance of a (partially) frozen feature extractor finetuned with PiSCO, with and without a projection layer, in Figure 4.11. The results clearly answer our question in the affirmative: the projector is also required and removing it drastically degrades performance on the downstream task. 4.6.7 Comparing and combining withSPR To illustrate future avenues for extending the ideas motivating PiSCO, we combine them with the self-predictive representation (SPR) objective of Schwarzer et al. [200]. We replace SimSiam for SPR as the auxiliary self-supervised objective used during finetuning. Concretely, reimplemented SPR for DeepMind Control tasks and compare the original implementation (which uses cosine 74 0.5 1.0 1.5 2.0 Steps × 10 5 100 200 300 400 500 600 Rewards (↑ better) DMC Frozen Frozen+PiSCO (no proj.) Frozen+PiSCO (w/ proj.) Walker-Run 0.25 0.50 0.75 1.00 Steps × 10 5 200 300 400 500 600 700 800 900 Rewards (↑ better) DMC Frozen De Novo+PiSCO (no proj.) Frozen+PiSCO (w/ proj.) Cartpole-Swingup 0.5 1.0 1.5 2.0 Steps × 10 5 0 50 100 150 200 250 300 Rewards (↑ better) DMC Frozen Frozen+PiSCO (no proj.) Frozen+PiSCO (w/ proj.) Hopper-Hop Figure 4.11: Is the projector h() necessary inPiSCO’s formulation? Yes. Removing projection layer when finetuning task-specific layers (and freezing task-agnostic layers) drastically degrades performance on allDeepMind Control tasks. similarity cos(p 1 ; p 2 ) to compare representations) against a variant (PiSPR) which uses the policy- induced objective 1 2 (D(p 1 ; p 2 ))+D(p 2 ; p 1 )) 5 . In terms of implementation details, we found that setting the number of contiguous steps sam- pled from the replay buffer to K = 3 yielded the best results on DeepMind Control tasks. The transition model is a 2-layer MLP (1024 hidden units) with layer normalization and ReLU activa- tions. We used the same projector as as for SimSiam (which does not reduce dimensionality), and a linear predictor with layer-normalized inputs. Figure 4.12 reports the results for partially frozen feature extractors with different finetuning objectives. We observe that using SPR typically performs as well as finetuning the task-specific layers with the RL objective. In that sense, finetuning withPiSCO offers an compelling alternative: it is substantially cheaper ( ˜ 3x faster in wall-clock time), doesn’t require learning a transition model nor keeping track of momentum models, and is simpler to tune (fewer hyper-parameters). More in- terestingly, we find that replacing the cosine similarity inSPR for the policy-induced objective (i.e., PiSPR) significantly improves performance overSPR. Those results further validate the generality of our analysis insights. 5 Note: p 1 and p 2 correspond to ˆ y t+k and ˜ y t+k in the originalSPR notation. 75 0.5 1.0 1.5 2.0 Steps × 10 5 100 200 300 400 500 600 Rewards (↑ better) DMC Frozen+Finetuned Frozen+PiSCO Frozen+SPR Frozen+PiSPR Walker-Run 0.25 0.50 0.75 1.00 Steps × 10 5 200 300 400 500 600 700 800 900 Rewards (↑ better) DMC Frozen+Finetuned Frozen+PiSCO Frozen+SPR Frozen+PiSPR Cartpole-Swingup 0.5 1.0 1.5 2.0 Steps × 10 5 0 50 100 150 200 250 300 Rewards (↑ better) DMC Frozen+Finetuned Frozen+PiSCO Frozen+SPR Frozen+PiSPR Hopper-Hop Figure 4.12: CombiningSPR withPiSCO significantly improves the performance ofSPR. Swap- ping the cosine similarity objective inSPR [200] for the policy-induced objective suggested by our analysis significantly improves finetuning. Still, finetuning with PiSCO (based on SimSiam [39]) yields the best performance, while remaining easier to implement and faster in terms of wall-clock time. 4.7 Conclusion Our analysis of representation learning for transfer reveals several new insights on the roles of finetuning: similarity between representations should reflect whether they induce the same or sim- ilar distribution of actions; not all layers in encoders of states need to be adapted. Building on those insights, we develop a hybrid approach which partially freezes the bottom (and readily trans- ferrable) layers while finetuning the top layers with a policy-induced self-supervised objective. This approach is especially effective for hard downstream tasks, as it alleviates the challenges of finetuning rich visual representations with reinforcement learning. 76 Chapter 5 Sampling Tasks for Better Meta-Learning 5.1 Introduction Large amounts of high-quality data have been the key for the success of deep learning algorithms. Furthermore, factors such as data augmentation and sampling affect model performance signifi- cantly. Continuously collecting and curating data is a resource (cost, time, storage, etc.) intensive process. Hence, recently, the machine learning community has been exploring methods for per- forming transfer-learning from large datasets to unseen tasks with limited data. A popular genre of these approaches is called meta-learning few-shot approaches, where, in addition to the limited data from the task of interest, a large dataset of disjoint tasks is available for (pre-)training. These approaches are prevalent in the area of computer vision [115] and rein- forcement learning [27]. A key component of these methods is the notion of episodic training, which refers to sampling tasks from the larger dataset for training. By learning to solve these tasks correctly, the model can generalize to new tasks. However, sampling for episodic training remains surprisingly understudied despite numer- ous methods and applications that build on it. To the best of our knowledge, only a handful of works [271, 212, 130] explicitly considered the consequences of sampling episodes. In compar- ison, stochastic [178] and mini-batch [25] sampling alternatives have been thoroughly analyzed from the perspectives of optimization [75, 26], information theory [102, 47], and stochastic pro- cesses [269, 263], among many others. Building a similar understanding of sampling for episodic 77 training will help theoreticians and practitioners develop improved sampling schemes, and is thus of crucial importance to both. In this chapter, we explore many sampling schemes to understand their impact on few-shot methods. Our work revolves around the following fundamental question: what is the best way to sample episodes? Our focus will be restricted to image classification in the few-shot learning setting – where “best” is taken to mean “higher transfer accuracy of unseen episodes” – and leave analyses and applications to other areas for future work. Contrary to prior work, our experiments indicate that sampling uniformly with respect to episode difficulty yields higher classification accuracy – a scheme originally proposed to regularize metric learning [241]. To better understand these results, we take a closer look at the properties of episodes and what makes them difficult. Building on this understanding, we propose a method to approximate different sampling schemes, and demonstrate its efficacy on several standard few-shot learning algorithms and datasets. Concretely, we make the following contributions: • We provide a detailed empirical analysis of episodes and their difficulty. When sampled randomly, we show that episode difficulty (approximately) follows a normal distribution and that the difficulty of an episode is largely independent of several modeling choices including the training algorithm, the network architecture, and the training iteration. • Leveraging our analysis, we propose simple and universally applicable modifications to the episodic sampling pipeline to approximate any sampling scheme. We then use this scheme to thoroughly compare episode sampling schemes – including easy/hard-mining, curriculum learning, and uniform sampling – and report that sampling uniformly over episode difficulty yields the best results. • Finally, we show that sampling matters for few-shot classification as it improves transfer accuracy for a diverse set of popular [210, 79, 69, 168] and state-of-the-art [255] algorithms on standard and cross-domain benchmarks. 78 5.2 Preliminaries 5.2.1 Episodic sampling and training We define episodic sampling as subsampling few-shot tasks (or episodes) from a larger base dataset [37]. Assuming the base dataset admits a generative distribution, we sample an episode in two steps 1 . First, we sample the episode classesC t from class distribution p(C t ); second, we sample the episode’s data from data distribution p(x;yjC t ) conditioned onC t . This gives rise to the following log-likelihood for a model l q parameterized byq: L(q)= E tq() [logl q (t)]; (5.1) where q(t) is the episode distribution induced by first sampling classes, then data. In practice, this expectation is approximated by sampling a batch of episodesB each with their set t Q of query samples. To enable transfer to unseen classes, it is also common to include a small sett S of support samples to provide statistics aboutt. This results in the following Monte-Carlo estimator: L(q) 1 jBj å t2B 1 jt Q j å (x;y)2t Q logl q (yj x;t S ); (5.2) where the data int Q andt S are both distributed according to p(x;yjC t ). In few-shot classification, the n-way k-shot setting corresponds to sampling n=jC t j classes, each with k= jt S j n support data points. The implicit assumption in the vast majority of few-shot methods is that both classes and data are sampled with uniform probability – but are there better alternatives? We carefully examine these underlying assumptions in the forthcoming sections. 1 We use the notation for a probability distribution and its probability density function interchangeably. 79 5.2.2 Few-shot algorithms We briefly present a few representative episodic learning algorithms. A more comprehensive treat- ment of few-shot algorithms is presented in Wang et al. [232] and Hospedales et al. [95]. A core question in few-shot learning lies in evaluating (and maximizing) the model likelihood l q . These algorithms can be divided in two major families: gradient-based methods, which adapt the model’s parameters to the episode; and metric-based methods, which compute similarities between support and query samples in a learned embedding space. Gradient-based few-shot methods are best illustrated through Model-Agnostic Meta-Learning [69] (MAML). The intuition behind MAML is to learn a set of initial parameters which can quickly spe- cialize to the task at hand. To that end, MAML computes l q by adapting the model parametersq via one (or more) steps of gradient ascent and then computes the likelihood using the adapted pa- rameters q 0 . Concretely, we first compute the likelihood p q (yj x) using the support set t S , adapt the model parameters, and then evaluate the likelihood: l q (yj x;t S )= p q 0(yj x) s.t. q 0 =q+aÑ q å (x;y)2t S log p q (yj x); wherea> 0 is known as the adaptation learning rate. A major drawback of training with MAML lies in back-propagating through the adaptation phase, which requires higher-order gradients. To alleviate this computational burden, Almost No Inner Loop [168] (ANIL) proposes to only adapt the last classification layer of the model architecture while tying the rest of the layers across episodes. They empirically demonstrate little classification accuracy drop while accelerating train- ing times four-fold. Akin to ANIL, metric-based methods also share most of their parameters across tasks; however, their aim is to learn a metric space where classes naturally cluster. To that end, metric-based algo- rithms learn a feature extractorf q parameterized byq and classify according to a non-parametric rule. A representative of this family is Prototypical Network [210] (ProtoNet), which classifies 80 query points according to their distance to class prototypes – the average embedding of a class in the support set: l q (yj x;t S )= exp d(f q (x);f y q ) å y 0 2C t exp d(f q (x);f y 0 q ) s.t. f c q = 1 k å (x;y)2t S y=c f q (x); where d(;) is a distance function such as the Euclidean distance or the negative cosine simi- larity, and f c q is the class prototype for class c. Other classification rules include support vector clustering [122], neighborhood component analysis [112], and the earth-mover distance [265]. 5.2.3 Episode difficulty Given an episodet and likelihood function l q , we define episode difficulty to be the negative log- likelihood incurred on that episode: W l q (t)=logl q (t); which is a surrogate for how hard it is to classify the samples int Q correctly, given l q andt S . By definition, this choice of episode difficulty is tied to the choice of the likelihood function l q . Dhillon et al. [52] use a similar surrogate as a means to systematically report few-shot perfor- mances. We use this definition because it is equivalent to the loss associated with the likelihood function l q on episodet, which is readily available at training time. 5.3 Methodology In this section, we describe the core assumptions and methodology used in our study of sampling methods for episodic training. Our proposed method builds on importance sampling [81] (IS) which we found compelling for three reasons: (i) IS is well understood and solidly grounded from 81 Algorithm 3 Episodic training with Importance Sampling Input: target (p) and proposal (q) distributions, likelihood function l q , optimizer OPT. Randomly initialize model parametersq. repeat Sample a mini-batchB of episodes from q(t). for each episodet in mini-batchB do Compute episode likelihood: l q (t). Compute importance weight: w(t)= p(t) q(t) . end for Aggregate:L(q) å t2B w(t)logl q (t). Compute effective sample size ESS(B). Update model parameters: q OPT( L(q) ESS(B) ). until parametersq have converged. a theoretical standpoint, (ii) IS is universally applicable thus compatible with all episodic training algorithms, and (iii) IS is simple to implement with little requirement for hyper-parameter tuning. Why should we care about episodic sampling? A back-of-the-envelope calculation 2 suggests that there are on the order of 10 162 different training episodes for the smallest-scale experiments in Section 6.6. Since iterating through each of them is infeasible, we ought to express some preference over which episodes to sample. In the following, we describe a method that allows us to specify this preference. 5.3.1 Importance sampling for episodic training Let us assume that the sampling scheme described in Section 5.2.1 induces a distribution q(t) over episodes. We call it the proposal distribution, and assume knowledge of its density function. We wish to estimate the expectation in Equation (5.1) when sampling episodes according to a target distribution p(t) of our choice, rather than q(t). To that end, we can use an importance sampling 2 For a base dataset with N classes and K input-output pairs per class, there are a total of N n K k n possible episodes that can be created when sampling k pairs each from n classes. 82 estimator which simply re-weights the observed values for a given episode t by w(t)= p(t) q(t) , the ratio of the target and proposal distributions: E tp() [logl q (t)]= E tq() [w(t)logl q (t)]: The importance sampling identity holds whenever q(t) has non-zero density over the support of p(t), and effectively allows us to sample from any target distribution p(t). A practical issue of the IS estimator arises when some values of w(t) become much larger than others; in that case, the likelihoods l q (t) associated with mini-batches containing heavier weights dominate the others, leading to disparities. To account for this effect, we can replace the mini-batch average in the Monte-Carlo estimate of Equation (5.2) by the effective sample size ESS(B) [106, 132]: E tp() [logl q (t)] 1 ESS(B) å t2B w(t)logl q (t) s.t. ESS(B)= (å t2B w(t)) 2 å t2B w(t) 2 ; (5.3) whereB denotes a mini-batch of episodes sampled according to q(t). Note that when w(t) is constant, we recover the standard mini-batch average setting as ESS(B)=jBj. Empirically, we observed that normalizing with the effective sample size avoided instabilities. This method is summarized in Algorithm 3. 5.3.2 Modeling the proposal distribution A priori, we do not have access to the proposal distribution q(t) (nor its density) and thus need to estimate it empirically. Our main assumption is that sampling episodes from q(t) induces a normal distribution over episode difficulty. With this assumption, we model the proposal distribution by this induced distribution, therefore replacing q(t) withN (W l q (t)j m;s 2 ) where m;s 2 are the mean and variance parameters. As we will see in Section 5.5.2, this normality assumption is experimentally supported on various datasets, algorithms, and architectures. 83 We consider two settings for the estimation ofm ands 2 : offline and online. The offline setting consists of sampling 1;000 training episodes before training, and computing m;s 2 using a model pre-trained on the same base dataset. Though this setting seems unrealistic, i.e. having access to a pre-trained model, several meta-learning few-shot methods start with a pre-trained model which they further build upon. Hence, for such methods there is no overhead. For the online setting, we estimate the parameters on-the-fly using the model currently being trained. This is justified by the analysis in Section 5.5.2 which shows that episode difficulty transfers across model parameters during training. We update our estimates ofm;s 2 with an exponential moving average: m lm+(1l)W l q (t) and s 2 ls 2 +(1l)(W l q (t)m) 2 ; where l2[0;1] controls the adjustment rate of the estimates, and the initial values of m;s 2 are computed in a warm-up phase lasting 100 iterations. Keeping l = 0:9 worked well for all our experiments (Section 6.6). We opted for this simple implementation as more sophisticated ap- proaches like West [236] yielded little to no benefit. 5.3.3 Modeling the target distribution Similar to the proposal distribution, we model the target distribution by its induced distribution over episode difficulty. Our experiments compare four different approaches, all of which share parameters m;s 2 with the normal model of the proposal distribution. For numerical stability, we truncate the support of all distributions to[m2:58s;m+2:58s], which gives approximately 99% coverage for the normal distribution centered aroundm. 84 The first approach (HARD) takes inspiration from hard negative mining [207], where we wish to sample only more challenging episodes. The second approach (EASY) takes a similar view but instead only samples easier episodes. We can model both distributions as follows: U(W l q (t)jm;m+ 2:58s) (HARD) and U(W l q (t)jm 2:58s;m) (EASY) whereU denotes the uniform distribution. The third (CURRICULUM) is motivated by curriculum learning [22], which slowly increases the likelihood of sampling more difficult episodes: N (W l q (t)jm t ;s 2 ) (CURRICULUM) wherem t is linearly interpolated fromm 2:58s tom+ 2:58s as training progresses. Finally, our fourth approach, UNIFORM, resembles distance weighted sampling [241] and consists of sampling uniformly over episode difficulty: U(W l q (t)jm 2:58s;m+ 2:58s): (UNIFORM) Intuitively, UNIFORM can be understood as a uniform prior over unseen test episodes, with the intention of performing well across the entire difficulty spectrum. This acts as a regularizer, forcing the model to be equally discriminative for both easy and hard episodes. 5.4 Related Works This chapter studies task sampling in the context of few-shot [144, 62] and meta-learning [192, 218]. 85 Few-shot learning. This setting has received a lot of attention over recent years [228, 173, 187, 76]. Broadly speaking, state-of-the-art methods can be categorized in two major families: metric- based and gradient-based. Metric-based methods learn a shared feature extractor which is used to compute the distance between samples in embedding space [210, 24, 174, 112]. The choice of metric mostly differen- tiates one method from another; for example, popular choices include Euclidean distance [210], negative cosine similarity [79], support vector machines [122], set-to-set functions [255], or the earth-mover distance [265]. Gradient-based algorithms such as MAML [69], propose an objective to learn a network ini- tialization that can quickly adapt to new tasks. Due to its minimal assumptions, MAML has been extended to probabilistic formulations [83, 257] to incorporate learned optimizers – implicit [72] or explicit [162] – and simplified to avoid expensive second-order computations [150, 170]. In that line of work, ANIL [168] claims to match MAML’s performance when adapting only the last classification layer – thus greatly reducing the computational burden and bringing gradient and metric-based methods closer together. Sampling strategies. Sampling strategies have been studied for different training regimes. Wu et al. [241] demonstrate that “sampling matters” in the context of metric learning. They pro- pose to sample a triplet with probability proportional to the distance of its positive and negative samples, and observe stabilized training and improved accuracy. This observation was echoed by Katharopoulos and Fleuret [102] when sampling mini-batches: carefully choosing the con- stituent samples of a mini-batch improves the convergence rate and asymptotic performance. Like ours, their method builds on importance sampling [208, 58, 101] but whereas they compute im- portance weights using the magnitude of the model’s gradients, we use the episode’s difficulty. Similar insights were also observed in reinforcement learning, where Schaul et al. [191] suggests a scheme to sample transitions according to the temporal difference error. Closer to our work, Sun et al. [212] present a hard-mining scheme where the most challenging classes across episodes are pooled together and used to create new episodes. Observing that the 86 difficulty of a class is intrinsically linked to the other classes in the episode, Liu et al. [130] propose a mechanism to track the difficulty across every class pair. They use this mechanism to build a curriculum [22, 243] of increasingly difficult episodes. In contrast to these two approaches, our proposed method makes use of importance sampling to mimic the target distribution rather than sampling from it directly. This helps achieve fast and efficient sampling without any preprocessing requirements. 5.5 Experiments We first validate the assumptions underlying our proposed IS estimator and shed light on the prop- erties of episode difficulty. Then, we answer the question we pose in the introduction, namely: what is the best way to sample episodes? Finally, we ask if better sampling improves few-shot classification. 5.5.1 Experimental setup We review the standardized few-shot benchmarks. Datasets. We use two standardized image classification datasets, Mini-ImageNet [228] and Tiered- ImageNet [177], both subsets of ImageNet [51]. Mini-ImageNet consists of 64 classes for training, 16 for validation, and 20 for testing; we use the class splits introduced by Ravi and Larochelle [173]. Tiered-ImageNet contains 608 classes split into 351, 97, and 160 for training, validation, and testing, respectively. Network architectures. We train two model architectures. A 4-layer convolution network conv(64) 4 Vinyals et al. [228] with 64 channels per layer. And ResNet-12, a 12-layer deep residual network [90] in- troduced by Oreshkin et al. [156]. Both architectures use batch normalization [98] after every convolutional layer and ReLU as the non-linearity. 87 1 2 Difficulty 0.0 0.2 0.4 0.6 0.8 Density Algorithm Prototypical Network (cosine) MAML 1 2 Theoretical quantiles 0.5 1.0 1.5 2.0 Sample quantiles Algorithm Prototypical Network (cosine) MAML Figure 5.1: Episode difficulty is approximately normally distributed. Density (left) and Q-Q (right) plots of the episode difficulty computed by conv(64) 4 ’s on Mini-ImageNet (1-shot 5-way), trained using ProtoNets (cosine) and MAML (depicted in the legends). The values are computed over 10k test episodes. The density plots follow a bell curve, with the density peak in the middle, which quickly drops-off on either side of the peak. The Q-Q plots are close to the identity line (in black). The closer the curve is to the identity line, the closer the distribution is to a normal. Both suggest that the episode difficulty distribution can be normally approximated. Training algorithms. For the metric-based family, we use ProtoNet with Euclidean [210] and scaled negative cosine similarity measures [79]. Additionally, we use MAML [69] and ANIL [168] as representative gradient-based algorithms. Hyper-parameters. We tune hyper-parameters for each algorithm and dataset to work well across different few-shot settings and network architectures. Additionally, we keep the hyper- parameters the same across all different sampling methods for a fair comparison. We train for 20k iterations with a mini-batch of size 16 and 32 for Mini-ImageNet and Tiered-ImageNet re- spectively, and validate every 1k iterations on 1k episodes. The best performing model is finally evaluated on 1k test episodes. 5.5.2 Understanding episode difficulty All the models in this subsection are trained using baseline sampling as described in Section 5.2.1, i.e., episodic training without importance sampling. 88 5.5.2.1 Episode difficulty is approximately normally distributed We begin our analysis by verifying that the distribution over episode difficulty induced by q(t) is approximately normal. In Figure 5.1, we use the difficulty of 10k test episodes sampled with q(t). The difficulties are computed using conv(64) 4 trained with ProtoNet and MAML on Mini- ImageNet for 1-shot 5-way classification. The episode difficulty density plots follow a bell curve, which are naturally modeled with a normal distribution. The Q-Q plots, typically used to assess normality, suggest the same – the closer the curve is to the identity line, the closer the distribution is to a normal. Finally, we compute the Shapiro-Wilk test for normality [205], which tests for the null hypoth- esis that the data is drawn from a normal distribution. Since the p-value for this test is sensitive to the sample size 3 , we subsample 50 values 100 times and average rejection rates over these subsets. With a = 0:05, the null hypothesis is rejected 14% and 17% of the time for Mini-ImageNet and Tiered-ImageNet respectively, thus suggesting that episode difficulty can be reliably approximated with a normal distribution. 5.5.2.2 Independence from modeling choices By definition, the notion of episode difficulty is tightly coupled to the model likelihood l q (Sec- tion 5.2.3), and hence to the modeling variables such as learning algorithm, network architecture, and model parameters. We check if episode difficulty transfers across different choices for these variables. We are concerned with the relative ranking of the episode difficulty and not the actual values. To this end, we will use the Spearman rank-order correlation coefficient, a non-parametric measure of the monotonicity of the relationship between two sets of values. This value lies within[1;1], with 0 implying no correlation, and+1 and1 implying exact positive and negative correlations, respectively. 3 For a large sample size, the p-values are not reliable as they may detect trivial departures from normality. 89 Training algorithm. We first check the dependence on the training algorithm. We use all four algorithms to train conv(64) 4 ’s for 1-shot 5-way classification on Mini-ImageNet, then compute episode difficulty over 10k test episodes. The Spearman rank-order correlation coefficients for the difficulty values computed with respect to all possible pairs of training algorithms are> 0:65. This positive correlation is illustrated in Figure 5.2 and suggests that an episode that is difficult for one training algorithm is very likely to be difficult for another. 1.0 1.5 2.0 ResNet-12 0.75 1.00 1.25 1.50 1.75 2.00 conv(64) ×4 Prototypical Network (cosine) 1.0 1.5 2.0 ResNet-12 1.00 1.25 1.50 1.75 2.00 2.25 MAML Figure 5.3: Episode difficulty transfers across network architectures. Scatter-plots (with re- gression lines) of the episode difficulty computed by conv(64) 4 and ResNet-12’s trained using different algorithms. This is computed for 10k 1-shot 5-way test episodes from Mini-ImageNet. We observe a strong positive correlation between the computed values for both network architec- tures. Network architecture. Next we analyze the dependence on the network architecture. We trained conv(64) 4 and ResNet-12’s using all training algorithms for Mini-ImageNet 1-shot 5-way classi- fication. We compute the episode difficulties for 10k test episodes and compute their Spearman rank-order correlation coefficients across the two architectures, for a given algorithm. The corre- lation coefficients are 0:57 for ProtoNet (Euclidean), 0:72 for ProtoNet (cosine), 0:58 for MAML, and 0:49 for ANIL. Figure 5.3 illustrates this positive correlation, suggesting that episode difficulty is transferred across network architectures with high probability. 90 5000 10000 15000 20000 Algorithm 0.75 1.00 1.25 1.50 1.75 2.00 2.25 Prototypical Network (cosine) 5000 10000 15000 20000 Algorithm MAML Episode type All Easy Hard Figure 5.4: Episode difficulty is transferred across model parameters during training. We select the 50 easiest and hardest episodes and track their difficulty during training. This is done for conv(64) 4 ’s trained on Mini-ImageNet (1-shot 5-way) with different algorithms. The average dif- ficulty of the episodes decreases over time, until convergence (vertical line), after which the model overfits. Additionally, easier episodes remain easy while harder episodes remain hard, indicating that episode difficulty transfers from one set of parameters to the next. Model parameters during training. Lastly, we study the dependence on model parameters dur- ing training. We select the 50 easiest and 50 hardest episodes, i.e., episodes with the lowest and highest difficulties respectively, from 1k test episodes. We track the episode difficulty for all episodes over the training phase and visualize the trend in Figure 5.4 for conv(64) 4 ’s trained using different training algorithms on Mini-ImageNet (1-shot 5-way). Throughout training, easy episodes remain easy and hard episodes remain hard, hence suggesting that episode difficulty trans- fers across different model parameters during training. Since the episode difficulty does not change drastically during the training process, we can estimate it with a running average over the training iterations. This justifies the online modeling of the proposal distribution in Section 5.3.2. 5.5.3 Comparing episode sampling methods We compare different methods for episode sampling. To ensure fair comparisons, we use the offline formulation (Section 5.3.2) so that all sampling methods share the same pre-trained network (the network trained using baseline sampling) when computing proposal likelihoods. We compute results over 2 datasets, 2 network architectures, 4 algorithms and 2 few-shot protocols, totaling in 24 scenarios. Table 5.1 presents results for all methods. 91 Table 5.1: Few-shot accuracies on benchmark datasets for 5-way few-shot episodes in the offline setting. The mean accuracy and the 95% confidence interval are reported for evaluation done over 1k test episodes. The first row in every scenario denotes baseline sampling. Best results for a fixed scenario are shown in bold. Results where a sampling technique is better than or comparable to baseline sampling are denoted by †. Overall, UNIFORM is among the best sampling methods in 19=24 scenarios. Mini-ImageNet Tiered-ImageNet conv(64) 4 ResNet-12 ResNet-12 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) ProtoNet (Euclidean) 49.060.60 65.280.52 49.670.64 67.450.51 59.100.73 76.950.56 + EASY 48.830.61 † 65.920.55 † 51.080.63 † 67.300.52 † 57.680.75 78.100.53 † + HARD 45.690.61 66.470.52 † 52.500.62 † 71.030.51 † 54.850.71 76.150.56 + CURRICULUM 48.230.63 65.770.51 † 50.000.61 † 70.490.51 † 59.150.76 † 78.250.53 † + UNIFORM 48.190.62 66.730.52 † 53.940.63 † 70.790.49 † 58.630.76 † 78.620.55 † ProtoNet (cosine) 50.030.61 61.560.53 52.850.64 62.110.52 60.010.73 72.750.59 + EASY 49.600.61 † 65.170.53 † 53.350.63 † 63.550.53 † 60.030.75 † 74.650.57 † + HARD 49.010.60 66.450.50 † 52.650.63 † 70.150.51 † 55.440.72 75.970.55 † + CURRICULUM 49.380.61 64.120.53 † 53.210.65 † 65.890.52 † 60.370.76 † 75.320.58 † + UNIFORM 50.070.59 † 66.330.52 † 54.270.65 † 70.850.51 † 60.270.75 † 78.360.54 † MAML 46.880.60 55.160.55 49.920.65 63.930.59 55.370.74 72.930.60 + EASY 44.520.60 57.360.59 † 51.620.67 † 64.330.61 † 53.390.79 69.810.68 + HARD 42.930.61 60.420.55 † 49.570.69 † 66.930.55 † 50.480.73 71.200.63 + CURRICULUM 45.420.60 61.610.55 † 52.210.67 † 66.250.60 † 54.130.77 71.470.63 + UNIFORM 46.670.63 † 62.090.55 † 52.650.65 † 66.760.57 † 54.580.77 72.000.66 ANIL 46.590.60 63.470.55 49.650.65 59.510.56 54.770.76 69.280.67 + EASY 44.830.63 62.230.56 49.400.64 † 56.730.60 54.500.80 † 65.450.66 + HARD 43.300.58 59.870.55 47.910.62 62.050.59 † 50.220.71 62.060.65 + CURRICULUM 45.690.60 63.000.54 † 50.220.66 † 61.760.57 † 55.590.78 † 69.830.73 † + UNIFORM 46.930.62 † 62.750.60 49.560.62 † 64.720.60 † 54.150.79 † 70.440.69 † We observe that, although not strictly dominant, UNIFORM tends to outperform other methods as it is within the statistical confidence of the best method in 19=24 scenarios. For the 5=24 scenarios where UNIFORM underperforms, it closely trails behind the best methods. Compared to baseline sampling, the average degradation of UNIFORM is0:83% and at most1:44% (ignoring the standard deviations) in 4=24 scenarios. Conversely, UNIFORM boosts accuracy by as much as 8:74% and on average by 3:86% (ignoring the standard deviations) in 13=24 scenarios. We attribute this overall good performance to the fact that uniform sampling puts a uniform distribution prior over the (unseen) test episodes, with the intention of performing well across the entire difficulty 92 Table 5.2: Few-shot accuracies on benchmark datasets for 5-way few-shot episodes in the offline and online settings. The mean accuracy and the 95% confidence interval are reported for evaluation done over 1k test episodes. The first row in every scenario denotes baseline sampling. Best results for a fixed scenario are shown in bold. Results where a sampling technique is better than or comparable to baseline sampling are denoted by †. UNIFORM (Online) retains most of the performance of the offline formulation while being significantly easier to implement (online is competitive in 15=24 scenarios vs 16=24 for offline). Mini-ImageNet Tiered-ImageNet conv(64) 4 ResNet-12 ResNet-12 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) ProtoNet (Euclidean) 49.060.60 65.280.52 49.670.64 67.450.51 59.100.73 76.950.56 + UNIFORM (Offline) 48.190.62 66.730.52 † 53.940.63 † 70.790.49 † 58.630.76 † 78.620.55 † + UNIFORM (Online) 48.390.62 67.860.50 † 52.970.64 † 70.630.50 † 59.670.70 † 78.730.55 † ProtoNet (cosine) 50.030.61 61.560.53 52.850.64 62.110.52 60.010.73 72.750.59 + UNIFORM (Offline) 50.070.59 † 66.330.52 † 54.270.65 † 70.850.51 † 60.270.75 † 78.360.54 † + UNIFORM (Online) 50.060.61 † 65.990.52 † 53.900.63 † 68.780.51 † 61.370.72 † 77.810.56 † MAML 46.880.60 55.160.55 49.920.65 63.930.59 55.370.74 72.930.60 + UNIFORM (Offline) 46.670.63 † 62.090.55 † 52.650.65 † 66.760.57 † 54.580.77 72.000.66 + UNIFORM (Online) 46.700.61 † 61.620.54 † 51.170.68 † 65.630.57 † 57.150.74 † 71.670.67 ANIL 46.590.60 63.470.55 49.650.65 59.510.56 54.770.76 69.280.67 + UNIFORM (Offline) 46.930.62 † 62.750.60 49.560.62 † 64.720.60 † 54.150.79 † 70.440.69 † + UNIFORM (Online) 46.820.63 † 62.630.59 49.820.68 † 64.510.62 † 55.180.74 † 69.550.71 † spectrum. This acts as a regularizer, forcing the model to be equally discriminative for easy and hard episodes. If we knew the test episode distribution, upweighting episodes that are most likely under that distribution will improve transfer accuracy [60]. However, this uninformative prior is the safest choice without additional information about the test episodes. Second best is baseline sampling as it is statistically competitive on 10=24 scenarios, while EASY, HARD, and CURRICULUM only appear among the better methods in 4, 4, and 9 scenarios, respectively. 5.5.4 Online approximation of the proposal distribution Although the offline formulation is better suited for analysis experiments, it is expensive as it requires a pre-training phase for the proposal network and 2 forward passes during episodic training 93 (one for the episode loss, another for the proposal density). In this subsection, we show that the online formulation faithfully approximates offline sampling and can retain most of the performance improvements from UNIFORM. We take the same 24 scenarios as in the previous subsection, and compare baseline sampling against offline and online UNIFORM. Table 5.2 reports the full suite of results. We observe that baseline is statistically competitive on 9=24 scenarios; on the other hand, of- fline and online UNIFORM perform similarly on aggregate, as they are within the best results in 16=24 and 15=24 scenarios respectively. Similar to its offline counterpart, online UNIFORM does better than or comparable to baseline sampling in 21 out of 24 scenarios. On the 3=24 scenarios where online UNIFORM underperforms compared to baseline, the average degradation is0:92%, and at most1:26% (ignoring the standard deviations). Conversely, in the remaining scenarios, it boosts accuracy by as much as 6:67% and on average by 2:24% (ignoring the standard deviations). Therefore, using online UNIFORM, while computationally comparable to baseline sampling, re- sults in a boost in few-shot performance; when it underperforms, it trails closely. We also compute the mean accuracy difference between the offline and online formulation, which is 0:07% 0:35 accuracy points. This confirms that both the offline and online methods produce quantitatively similar outcomes. 5.5.5 Better sampling improves cross-domain transfer To further validate the role of episode sampling as a way to improve generalization, we evaluate the models trained in the previous subsection on episodes from completely different domains. Specifi- cally, we use the models trained on Mini-ImageNet with baseline and online UNIFORM sampling to evaluate on the test episodes of CUB-200 [235], Describable Textures [43], FGVC Aircrafts [140], and VGG Flowers [153], following the splits of Triantafillou et al. [222]. Table 5.3 displays results for the complete set of experiments available. Out of the 64 total cross-domain scenarios, online UNIFORM does statistically better in 49=64 scenarios, comparable in 12=64 scenarios and worse in only 3=64 scenarios. These results further go to show that sampling matters in episodic training. 94 5.5.6 Better sampling improves few-shot classification The results in the previous subsections suggest that online UNIFORM yields a simple and uni- versally applicable method to improve episode sampling. To validate that state-of-the-art meth- ods can also benefit from better sampling, we take the recently proposed FEAT [255] algorithm and augment it with our IS-based implementation of online UNIFORM. Concretely, we use their open-source implementation 4 to train both baseline and online UNIFORM sampling. We use the prescribed hyper-parameters without any modifications. Results for ResNet-12 on Mini-ImageNet and Tiered-ImageNet are reported in Table 5.4, where online UNIFORM outperforms baseline sam- pling on 3=4 scenarios and is matched on the remaining one. Thus, better episodic sampling can improve few-shot classification even for the very best methods. 5.6 Conclusion This chapter presents a careful study of sampling in the context of few-shot learning, with an eye on episodes and their difficulty. Following an empirical study of episode difficulty, we propose an importance sampling-based method to compare different episode sampling schemes. Our ex- periments suggest that sampling uniformly over episode difficulty performs best across datasets, training algorithms, network architectures and few-shot protocols. Avenues for future work in- clude devising better sampling strategies, analysis beyond few-shot classification (e.g., regression, reinforcement learning), and a theoretical grounding explaining our observations. 4 Available at: https://github.com/Sha-Lab/FEAT 95 1.0 1.5 2.0 2.5 Prototypical Network (Euclidean) 1.0 1.5 2.0 2.5 Prototypical Network (cosine) 1.0 1.5 2.0 2.5 MAML 1 2 Prototypical Network (Euclidean) 1.0 1.5 2.0 2.5 ANIL 1 2 Prototypical Network (cosine) 1 2 MAML 1 2 3 4 ANIL Figure 5.2: Episode difficulty transfers across training algorithms. Scatter plots (with regres- sion lines) of the episode difficulty computed on 1k Mini-ImageNet test episodes (1-shot 5-way) by conv(64) 4 ’s trained using different algorithms. The positive correlation suggests that an episode that is difficult for one training algorithm will be difficult for another. 96 Table 5.3: Few-shot accuracies on benchmark datasets after training on Mini-ImageNet for 5-way few-shot episodes in the offline and online settings. The mean accuracy and the 95% confidence interval are reported for evaluation done over 1;000 test episodes. Best results for a fixed scenario are shown in bold. The first row in every scenario denotes baseline sampling. Compared to baseline sampling, online UNIFORM does statistically better in 49=64 scenarios, comparable in 12=64 scenarios and worse in only 3=64 scenarios. conv(64) 4 ResNet-12 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) CUB-200 ProtoNet (Euclidean) 37.240.53 52.070.53 36.530.54 51.490.56 + UNIFORM (Online) 37.080.53 53.320.53 39.480.56 56.570.55 ProtoNet (cosine) 37.490.54 49.310.53 38.670.60 49.750.57 + UNIFORM (Online) 41.560.58 54.170.53 40.550.60 56.300.55 MAML 34.520.53 47.110.60 35.800.56 45.160.62 + UNIFORM (Online) 35.840.54 46.670.55 37.180.55 46.580.58 ANIL 35.400.54 38.200.56 33.200.54 39.260.58 + UNIFORM (Online) 36.890.55 42.830.58 34.470.56 42.080.58 Describable Textures ProtoNet (Euclidean) 32.050.45 45.030.44 31.870.45 44.100.43 + UNIFORM (Online) 32.690.49 45.230.43 33.550.46 47.370.43 ProtoNet (cosine) 32.090.45 38.440.41 31.480.45 39.460.41 + UNIFORM (Online) 33.630.47 43.280.44 32.690.48 45.560.42 MAML 29.470.46 37.850.47 32.190.48 41.140.46 + UNIFORM (Online) 31.840.49 40.810.44 31.650.46 43.210.44 ANIL 29.860.46 40.690.46 28.850.41 37.040.44 + UNIFORM (Online) 31.290.48 41.420.45 31.380.47 39.030.47 FGVC-Aircraft ProtoNet (Euclidean) 26.030.37 39.410.48 25.980.39 36.760.45 + UNIFORM (Online) 26.180.38 40.230.46 27.430.42 38.490.46 ProtoNet (cosine) 27.110.39 32.140.38 25.230.39 32.070.41 + UNIFORM (Online) 27.150.38 37.780.45 26.890.39 37.420.44 MAML 26.780.38 34.210.41 25.500.39 29.380.40 + UNIFORM (Online) 26.620.39 34.410.44 26.220.39 30.210.43 ANIL 25.670.37 27.170.36 23.270.31 24.520.29 + UNIFORM (Online) 25.600.37 27.920.39 23.780.34 28.700.39 VGG Flowers ProtoNet (Euclidean) 53.500.63 70.960.51 57.740.68 74.870.49 + UNIFORM (Online) 54.720.65 73.590.49 55.940.67 76.620.50 ProtoNet (cosine) 52.940.62 66.040.53 52.980.65 66.790.51 + UNIFORM (Online) 54.230.63 71.930.48 57.060.65 67.310.48 MAML 49.700.60 63.690.54 50.130.64 61.410.63 + UNIFORM (Online) 49.720.60 63.520.54 49.530.65 63.990.58 ANIL 47.030.65 46.400.66 42.050.67 40.010.65 + UNIFORM (Online) 47.480.67 47.080.67 38.940.61 50.250.63 97 Table 5.4: Few-shot accuracies on benchmark datasets for 5-way few-shot episodes using FEAT. The mean accuracy and the 95% confidence interval are reported for evaluation done over 10k test episodes with a ResNet-12. The first row in every scenario denotes baseline sampling. Best results for a fixed scenario are shown in bold. UNIFORM (Online) improves FEAT’s accuracy in 3=4 scenarios, demonstrating that sampling matters even for state-of-the-art few-shot methods. Mini-ImageNet Tiered-ImageNet 1-shot (%) 5-shot (%) 1-shot (%) 5-shot (%) FEAT 66.020.20 81.170.14 70.500.23 84.260.16 + UNIFORM (Online) 66.270.20 81.540.14 70.610.23 84.420.16 98 Chapter 6 Variance-Reduced Optimization for Better Meta-Learning 6.1 Introduction We wish to solve the following minimization problem: q = argmin q E xp [ f(q;x)]; (6.1) where we only have access to samples x and to a first-order oracle that gives us, for a givenq and a given x, the derivative of f(q;x) with respect toq, i.e. ¶ f(q;x) ¶q = g(q;x). It is known [179] that, when f is smooth and strongly convex, there is a converging algorithm for Problem 6.1 that takes the form q t+1 =q t a t g(q t ;x t ), where x t is a sample from p. This algorithm, dubbed stochastic gradient (SG), has a convergence rate of O(1=t) (see for instance [32]), within a constant factor of the minimax rate for this problem. When one has access to the true gradient g(q)= E xp [g(q;x)] rather than just a sample, this rate dramatically improves to O(e nt ) for somen> 0. In addition to hurting the convergence speed, noise in the gradient makes optimization algo- rithms harder to tune. Indeed, while full gradient descent is convergent for constant stepsizea, and also amenable to line searches to find a good value for that stepsize, the stochastic gradient method from [179] with a constant stepsize only converges to a ball around the optimum [194]. 1 Thus, to achieve convergence, one needs to use a decreasing stepsize. While this seems like a simple 1 Under some conditions, it does converge linearly to the optimum [e.g., 226] 99 modification, the precise decrease schedule can have a dramatic impact on the convergence speed. While theory prescribesa t = O(t a ) witha2(1=2;1] in the smooth case, practictioners often use larger stepsizes likea t = O(t 1=2 ) or even constant stepsizes. When the distribution p has finite support, Eq. 6.1 becomes a finite sum and, in that setting, it is possible to achieve efficient variance reduction and drive the noise to zero, allowing stochastic methods to achieve linear convergence rates [120, 100, 267, 138, 203, 48]. Unfortunately, the finite support assumption is critical to these algorithms which, while valid in many contexts, does not have the broad applicability of the standard SG algorithm. Several works have extended these approaches to the online setting by applying these algorithms while increasing the mini-batch size N [14, 94] but they need to revisit past examples multiple times and are not truly online. Another line of work reduces variance by averaging iterates [165, 111, 15, 71, 55, 54, 99]. While these methods converge for a constant stepsize in the stochastic case 2 , their practical speed is heavily dependent on the fraction of iterates kept in the averaging, a hyperparameter that is thus hard to tune, and they are rarely used in deep learning. Our work combines two existing ideas and adds a third: a) At every step, it updates the param- eters using a weighted average of past gradients, like in SAG [120, 196], albeit with a different weighting scheme; b) It reduces the bias and variance induced by the use of these old gradients by transporting them to “equivalent” gradients computed at the current point, similar to [82]; c) It does so implicitly by computing the gradient at a parameter value different from the current one. The resulting gradient estimator can then be used as a plug-in replacement of the stochastic gradient within any optimization scheme. Experimentally, both SG using our estimator and its momentum variant outperform the most commonly used optimizers in deep learning. 2 Under some conditions on f . 100 6.2 Momentum and other approaches to dealing with variance Stochastic variance reduction methods use an average of past gradients to reduce the variance of the gradient estimate. At first glance, it seems like their updates are similar to that of momentum [164], also known as the heavy ball method, which performs the following updates 3 : v t =g t v t1 +(1g t )g(q t ;x t ); v 0 = g(q 0 ;x 0 ) q t+1 =q t a t v t : Wheng t =g, this leads toq t+1 =q t a t g t g(q 0 ;x 0 )+(1g) t å i=1 g ti g(q i ;x i ) ! . Hence, the heavy ball method updates the parameters of the model using an average of past gradients, bearing simi- larity with SAG [120], albeit with exponential instead of uniform weights. Interestingly, while momentum is a popular method for training deep networks, its theoretical analysis in the stochastic setting is limited [213], except in the particular setting when the noise converges to 0 at the optimum [134]. Also surprising is that, despite the apparent similarity with stochastic variance reduction methods, current convergence rates are slower when using g> 0 in the presence of noise [195], although this might be a limitation of the analysis. 6.2.1 Momentum and variance We propose here an analysis of how, on quadratics, using past gradients as done in momentum does not lead to a decrease in variance. If gradients are stochastic, thenD t =q t q is a random variable. Denotinge i the noise at timestep i, i.e. g(q i ;x i )= g(q i )+e i , and writingD t E[D t ]=aå t i=0 N i;t e i , with N i;t the impact of the noise of the i-th datapoint on the t-th iterate, we may now analyze the total impact of eache i on the iterates. Figure 6.1 shows the impact ofe i onD t E[D t ] as measured by N 2 i;t for three datapoints (i= 1, i= 25 and i= 50) as a function of t for stochastic gradient (g = 0, left) and momentum (g = 0:9, right). As we can see, when using momentum, the variance 3 This is slightly different from the standard formulation but equivalent for constantg t . 101 due to a given datapoint first increases as the noise influences both the next iterate (through the parameter update) and the subsequent updates (through the velocity). Due to the weight 1g when a point is first sampled, a larger value ofg leads to a lower immediate impact of the noise of a given point on the iterates. However, a larger g also means that the noise of a given gradient is kept longer, leading to little or no decrease of the total variance (dashed blue curve). Even in the case of stochastic gradient, the noise at a given timestep carries over to subsequent timesteps, even if the old gradients are not used for the update, as the iterate itself depends on the noise. At every timestep, the contribution to the noise of the 1st, the 25th and the 50th points in Fig. 6.1 is unequal. If we assume that thee i are i.i.d., then the total variance would be minimal if the contribution from each point was equal. Further, one can notice that the impact of datapoint i is only a function of t i and not of t. This guarantees that the total noise will not decrease over time. To address these two points, one can increase the momentum parameter over time. In doing so, the noise of new datapoints will have a decreasing impact on the total variance as their gradient is multiplied by 1g t . Figure 6.1c shows the impact N 2 i;t of each noisee i for an increasing momentum g t = 1 1 t . The peak of noise for i= 25 is indeed lower than that of i= 1. However, the variance still does not go to 0. This is because, as the momentum parameter increases, the update is an average of many gradients, including stale ones. Since these gradients were computed at iterates already influenced by the noise over previous datapoints, that past noise is amplified, as testified by the higher peak at i= 1 for the increasing momentum. Ultimately, increasing momentum does not lead to a convergent algorithm in the presence of noise when using a constant stepsize. 6.2.2 SAG and Hessian modelling The impact of the staleness of the gradients on the convergence is not limited to momentum. In SAG, for instance, the excess error after k updates is proportional to 1 min n 1 16b k ; 1 8N o k , compared to the excess error of the full gradient method which is 1 1 k k wherek is the condition 102 (a) Stochastic gradient (b) Momentum -g = 0:9 (c) Momentum -g t = 1 1 t (d) Momentum -g t = 1 1 t with IGT. Figure 6.1: Variance over time and total variance for the stochastic gradient. Variance induced over time by the noise from three different datapoints (i= 1, i= 25 and i= 50) as well as the total variance for SG (g = 0, top left), momentum with fixed g = 0:9 (top right), momentum with increasing g t = 1 1 t without (bottom left) and with (bottom right) transport. The impact of the noise of each gradiente i increases for a few iterations then decreases. Although a largerg reduces the maximum impact of a given datapoint, the total variance does not decrease. With transport, noises are now equal and total variance decreases. The y-axis is on a log scale. 103 number of the problem. 4 The difference between the two rates is larger when the minimum in the SAG rate is the second term. This happens either when b k is small, i.e. the problem is well conditioned and a lot of progress is made at each step, or when N is large, i.e. there are many points to the training set. Both cases imply that a large distance has been travelled between two draws of the same datapoint. Recent works showed that correcting for that staleness by modelling the Hessian [229, 82] leads to improved convergence. As momentum uses stale gradients, the velocity is an average of current and past gradients and thus can be seen as an estimate of the true gradient at a point which is not the current one but rather a convex combination of past iterates. As past iterates depend on the noise of previous gradients, this bias in the gradients amplifies the noise and leads to a non-converging algorithm. We shall thus “transport” the old stochastic gradients g(q i ;x i ) to make them closer to their corresponding value at the current iterate, g(q t ;x i ). Past works did so using the Hessian or an explicit approximation thereof, which can be expensive and difficult to compute and maintain. We will resort to using implicit transport, a new method that aims at compensating the staleness of past gradients without making explicit use of the Hessian. 6.3 Converging optimization through implicit gradient transport Before showing how to combine the advantages of both increasing momentum and gradient trans- port, we demonstrate how to transport gradients implicitly. This transport is only exact under a strong assumption that will not hold in practice. However, this result will serve to convey the intu- ition behind implicit gradient transport. We will show in Section 6.4 how to mitigate the effect of the unsatisfied assumption. 4 Theb k in the convergence rate of SAG is generally larger than thek in the full gradient algorithm. 104 6.3.1 Implicit gradient transport Let us assume that we received samples x 0 ;:::;x t in an online fashion. We wish to approach the full gradient g t (q t )= 1 t+1 å t i=0 g(q t ;x i ) as accurately as possible. We also assume here that a) We have a noisy estimateb g t1 (q t1 ) of g t1 (q t1 ); b) We can compute the gradient g(q;x t ) at any locationq. We shall seek aq such that t t+ 1 b g t1 (q t1 )+ 1 t+ 1 g(q;x t ) g t (q t ): To this end, we shall make the following assumption: Assumption 6.3.1. All individual functions f(;x) are quadratics with the same Hessian H. This is the same assumption as [71, Section 4.1]. Although it is unlikely to hold in practice, we shall see that our method still performs well when that assumption is violated. Under Assumption 6.3.1, we then have g t (q t )= t t+ 1 g t1 (q t )+ 1 t+ 1 g(q t ;x t ) t t+ 1 b g t1 (q t1 )+ 1 t+ 1 g(q t +t(q t q t1 );x t ): Thus, we can transport our current estimate of the gradient by computing the gradient on the new point at a shifted location q =q t + t(q t q t1 ). This extrapolation step is reminiscent of Nesterov’s acceleration with the difference that the factor in front ofq t q t1 , t, is not bounded. 6.3.2 Combining increasing momentum and implicit gradient transport We now describe our main algorithm, Implicit Gradient Transport (IGT). IGT uses an increasing momentumg t = t t+1 . At each step, when updating the velocity, it computes the gradient of the new point at an extrapolated location so that the velocity v t is a good estimate of the true gradient g(q t ). 105 We can rewrite the updates to eliminate the velocity v t , leading to the update: q t+1 = 2t+ 1 t+ 1 q t t t+ 1 q t1 a t+ 1 g(q t +t(q t q t1 );x t ) : (IGT) We see in Fig. 6.1d that IGT allows a reduction in the total variance, thus leading to convergence with a constant stepsize. This is captured by the following proposition: Proposition 6.3.1. If f is a quadratic function with positive definite Hessian H with largest eigen- value L and condition numberk and if the stochastic gradients satisfy: g(q;x)= g(q)+e withe a random i.i.d. noise with covariance bounded by BI, then Eq. IGT with stepsizea = 1=L leads to iteratesq t satisfying E[kq t q k 2 ] 1 1 k 2t kq 0 q k 2 + da 2 B ¯ n 2 0 t ; withn =(2+ 2logk)k for every t> 2k. The proof of Prop. 6.3.1 is provided in Section B. Despite this theoretical result, two limitations remain: First, Prop. 6.3.1 shows that IGT does not improve the dependency on the conditioning of the problem; Second, the assumption of equal Hessians is unlikely to be true in practice, leading to an underestimation of the bias. We address the conditioning issue in the next section and the assumption on the Hessians in Section 6.4. 6.3.3 IGT as a plug-in gradient estimator We demonstrated that the IGT estimator has lower variance than the stochastic gradient estimator for quadratic objectives. IGT can also be used as a drop-in replacement for the stochastic gradient in an existing, popular first order method: the heavy ball (HB). This is captured by the following two propositions: 106 Proposition 6.3.2 (Non-stochastic). In the non-stochastic case, where B= 0, variance is equal to 0 and Heavyball-IGT achieves the accelerated linear rate O p k1 p k+1 t using the known, optimal heavy ball tuning,m = p k1 p k+1 2 ,a =(1+ p m) 2 =L. Proposition 6.3.3 (Online, stochastic). When B> 0, there exist constant hyperparametersa> 0, m> 0 such thatkE[q t q ]k 2 converges to zero linearly, and the variance is ˜ O(1=t). The pseudo-code can be found in Algorithm 4. Algorithm 4 Heavyball-IGT Input: Stepsizea, Momentumm, Initial parametersq 0 v 0 g(q 0 ;x 0 ) ; w 0 av 0 ; q 1 q 0 + w 0 for t= 1;:::;T 1 do g t t t+1 v t g t v t1 +(1g t )g q t + g t 1g t (q t q t1 );x t w t mw t1 av t q t+1 q t + w t end for returnq T 6.4 IGT and Anytime Tail Averaging So far, IGT weighs all gradients equally. This is because, with equal Hessians, one can perfectly transport these gradients irrespective of the distance travelled since they were computed. In prac- tice, the individual Hessians are not equal and might change over time. In that setting, the transport induces an error which grows with the distance travelled. We wish to average a linearly increasing number of gradients, to maintain the O(1=t) rate on the variance, while forgetting about the oldest gradients to decrease the bias. To this end, we shall use anytime tail averaging [119], named in reference to the tail averaging technique used in optimization [99]. 107 Tail averaging is an online averaging technique where only the last points, usually a constant fraction c of the total number of points seen, is kept. Maintaining the exact average at every timestep is memory inefficient and anytime tail averaging performs an approximate averaging us- ingg t = c(t1) 1+c(t1) 1 1 c q 1c t(t1) . We refer the reader to [119] for additional details. 6.5 Impact of IGT on bias and variance in the ideal case To understand the behaviour of IGT when Assumption 6.3.1 is verified, we minimize a strongly convex quadratic function with Hessian Q2R 100100 with condition number 1000, and we have access to the gradient corrupted by noisee t , wheree t N(0;0:3 I 100 ). In that scenario where all Hessians are equal and implicit gradient transport is exact, Fig. 6.2a confirms the O(1=t) rate of IGT with constant stepsize while SGD and HB only converge to a ball around the optimum. To further understand the impact of IGT, we study the quality of the gradient estimate. Standard stochastic methods control the variance of the parameter update by scaling it with a decreasing stepsize, which slows the optimization down. With IGT, we hope to have a low variance while maintaining a norm of the update comparable to that obtained with gradient descent. To validate the quality of our estimator, we optimized a quadratic function using IGT, collecting iterates q t . For each iterate, we computed the squared error between the true gradient and either the stochastic or the IGT gradient. In this case where both estimators are unbiased, this is the trace of the noise covariance of our estimators. The results in Figure 6.2b show that, as expected, this noise decreases linearly for IGT and is constant for SGD. We also analyse the direction and magnitude of the gradient of IGT on the same quadratic setup. Figure 6.2c displays the cosine similarity between the true gradient and either the stochastic or the IGT gradient, as a function of the distance to the optimum. We see that, for the same distance, the IGT gradient is much more aligned with the true gradient than the stochastic gradient is, confirming that variance reduction happens without the need for scaling the estimate. 108 10 1 10 3 10 5 Iterations 10 -3 10 -1 10 1 10 3 10 5 Error SGD HB IGT HB-IGT Convergence (a) 10 1 10 3 10 5 Iterations 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 L 2 Norm SGD IGT Gradient Magnitude Gradient Estimation Error (b) (c) Figure 6.2: Analysis of IGT on quadratic loss functions. (a) Comparison of convergence curves for multiple algorithms. As expected, the IGT family of algorithms converges to the solution while stochastic gradient algorithms can not. (b) The blue and orange curves show the norm of the noise component in the SGD and IGT gradient estimates, respectively. The noise component of SGD remains constant, while it decreases at a rate 1= p t for IGT. The green curve shows the norm of the IGT gradient estimate. (c) Cosine similarity between the full gradient and the SGD/IGT estimates. 6.6 Experiments While Section 6.5 confirms the performance of IGT in the ideal case, the assumption of identical Hessians almost never holds in practice. In this section, we present results on more realistic and larger scale machine learning settings. 6.6.1 Supervised learning CIFAR10 image classification We first consider the task of training a ResNet-56 model [88] on the CIFAR-10 image classification dataset [109]. We use TF official models code and setup [12], varying only the optimizer: SGD, HB, Adam and our algorithm with anytime tail averaging both on its own (ITA) and combined with Heavy Ball (HB-ITA). We tuned the step size for each algorithm by running experiments using a logarithmic grid. To factor in ease of tuning [239], we used Adam’s default parameter values and a value of 0.9 for HB’s parameter. We used a linearly decreasing stepsize as it was shown to be simple and perform well [204]. For each optimizer we selected the hyperparameter combination that is fastest to reach a consistently attainable target train loss [204]. 109 0 20000 40000 60000 Iterations 10 -1 10 0 10 1 Error train SGD HB Adam ITA HB-ITA CIFAR10 0 20000 40000 60000 Iterations 0.2 0.4 0.6 0.8 1.0 Accuracy train SGD HB Adam ITA HB-ITA CIFAR10 0 20000 40000 60000 Iterations 0.2 0.4 0.6 0.8 Accuracy test SGD HB Adam ITA HB-ITA CIFAR10 Figure 6.3: Resnet-56 on CIFAR10. Left: Train loss. Center: Train accuracy. Right: Test accuracy. Selecting the hyperparameter combination reaching the lowest training loss yields qualitatively identical curves. Figure 6.3 presents the results, showing that IGT with the exponential anytime tail average performs favourably, both on its own and combined with Heavy Ball: the learning curves show faster improvement and are much less noisy. ImageNet image classification We also consider the task of training a ResNet-50 model[88] on the larger ImageNet dataset [182]. The setup is similar to the one used for CIFAR10 with the difference that we trained using larger minibatches (1024 instead of 128). In Figure 6.4, one can see that IGT is as fast as Adam for the train loss, faster for the train accuracy and reaches the same final performance, which Adam does not. We do not see the noise reduction we observed with CIFAR10, which could be explained by the larger batch size. IMDb sentiment analysis We train a bi-directional LSTM on the IMDb Large Movie Review Dataset for 200 epochs. [136] We observe that while the training convergence is comparable to HB, HB-ITA performs better in terms of validation and test accuracy. In addition to the baseline and IGT methods, we also train a variant of Adam using the ITA gradients, dubbed Adam-ITA, which performs similarly to Adam. 110 Figure 6.4: ResNet-50 on ImageNet. Left: Train loss. Center: Train accuracy. Right: Test accuracy. 10 2 10 3 10 4 Iterations 2 × 10 2 3 × 10 2 Cost evaluation Optimal SGD ITA GD LQR 0.00 0.25 0.50 0.75 1.00 Iterations 1e5 0.45 0.50 0.55 0.60 Accuracy valid HB Adam HB-ITA Adam-ITA MAML Figure 6.5: Validation curves for different large-scale machine learning settings. Shading indicates one standard deviation computed over three random seeds. Left: Reinforcement learning via policy gradient on a LQR system. Right: Meta-learning using MAML on Mini-Imagenet. 111 6.6.2 Reinforcement learning Linear-quadratic regulator We cast the classical linear-quadratic regulator (LQR) [110] as a policy learning problem to be optimized via gradient descent. Note that despite their simple linear dynamics and a quadratic cost functional, LQR systems are notoriously difficult to optimize due to the non-convexity of the loss landscape. [61] The left chart in Figure 6.5 displays the evaluation cost computed along training and averaged over three random seeds. The first method (Optimal) indicates the cost attained when solving the algebraic Riccati equation of the LQR – this is the optimal solution of the problem. SGD minimizes the costs using the REINFORCE [238] gradient estimator, averaged over 600 trajectories. ITA is similar to SGD but uses the ITA gradient computed from the REINFORCE estimates. Finally, GD uses the analytical gradient by taking the expectation over the policy. We make two observations from the above chart. First, ITA initially suffers from the stochastic gradient estimate but rapidly matches the performance of GD. Notably, both of them converge to a solution significantly better than SGD, demonstrating the effectiveness of the variance reduction mechanism. Second, the convergence curve is smoother for ITA than for SGD, indicating that the ITA iterates are more likely to induce similar policies from one iteration to the next. This property is particularly desirable in reinforcement learning as demonstrated by the popularity of trust-region methods in large-scale applications. [198, 154] 6.6.3 Meta-learning Model-agnostic meta-learning We now investigate the use of IGT in the model-agnostic meta- learning (MAML) setting. [69] We replicate the 5 ways classification setup with 5 adaptation steps on tasks from the Mini-Imagenet dataset [173]. This setting is interesting because of the many sources contributing to noise in the gradient estimates: the stochastic meta-gradient depends on the product of 5 stochastic Hessians computed over only 10 data samples, and is averaged over only 4 tasks. We substitute the meta-optimizer with each method, select the stepsize that maximizes the validation accuracy after 10K iterations, and use it to train the model for 100K iterations. 112 The right graph of Figure 6.5 compares validation accuracies for three random seeds. We ob- serve that methods from the IGT family significantly outperform their stochastic meta-gradient counter-part, both in terms of convergence rate and final accuracy. Those results are also re- flected in the final test accuracies where Adam-ITA (65:16%) performs best, followed by HB-ITA (64:57%), then Adam (63:70%), and finally HB (63:08%). 6.7 Conclusion and open questions We proposed a simple optimizer which, by reusing past gradients and transporting them, offers excellent performance on a variety of problems. While it adds an additional parameter, the ratio of examples to be kept in the tail averaging, it remains competitive across a wide range of such values. Further, by providing a higher quality gradient estimate that can be plugged in any existing optimizer, we expect it to be applicable to a wide range of problems. As the IGT is similar to momentum, this further raises the question on the links between variance reduction and curvature adaptation. Whether there is a way to combine the two without using momentum on top of IGT remains to be seen. 113 Chapter 7 Concluding Remarks This thesis studies how meta-learning can be used to solve niche tasks, which have limited available data. We show how those algorithms are able to take advantage of a large set of pretraining tasks to quickly solve new, unseen tasks at test-time. We also delineate several failure modes for those algorithms — some simple, some not so simple — as well as practical guidelines and algorithms to help practitioners employ those methods on their application domain of interest. In the first part of this thesis, we start in Chapter 2 with a characterization of the solutions found by MAML. The first part of this thesis starts with simple settings illustrating fundamental failure modes of MAML [67]. Those failure modes inspire our subsequent analysis and meth- ods. First, they directly hint at a fundamental working principle of MAML, namely, its stringent requirement on depth to adapt quickly. Further analysis reveals why depth is needed: because the deeper parameters are used to learn better update rules for the parameters that come before them. Second, it motivates learning parameters special-purposed for optimization. We instantiate such parameters through meta-optimizers and demonstrate their efficacy with model architecture where MAML struggles. Following our characterization, we provide guidelines for when to use these meta-learning methods vs. transfer learning ones. Our results suggest that transfer learning is good enough when train and test tasks are similar, otherwise meta-learning is worth the extra computational costs. The second part of this thesis shows how the insights gleaned from our characterization apply to settings where meta-learning is impractical. Specifically, we tackle the challenge of quickly 114 solving new visual reinforcement learning tasks given only a pretrained visual feature extractor. We show how freezing task-agnostic parameters can help stabilize finetuning in visual RL where optimization is known to be unstable, especially for deep feature extractors. Combined with a self- supervised policy consistency objective, this insight yields 2x-5x faster adaptation on tasks ranging from simple video games to robotic control and embodied AI. We conclude the contributions of this thesis with a study of how to sample and optimize with many tasks during the pretraining stage. Our analysis shows that sampling uniformly with respect to task difficulty yields the best generalization. This is intuitively understood as putting a uniform prior over the difficulty of the test tasks. Further, we introduce an estimator designed to reuse information across tasks so as to reduce gradient variance. When plugged into modern optimization methods, this estimator converges faster and yields higher asymptotic accuracy on unseen test tasks. 7.1 Avenues for future work Looking forward, we foresee several directions for future work. A theory for meta-optimization. While our work in Chapter 2 successfully learns optimization parameters, many theoretical and practical challenges remain. First, meta-optimizers incur large computational and memory burden during pretraining. Despite its simplifying assumptions, our KFO would require inordinate amounts of compute to scale to modern vision or language mod- els. For meta-optimizers to become practically viable, we will need to rethink our assumptions in how we learn them (e.g., completely decoupling from a model training), how we parameter- ize them (e.g., including some form of memory), or our choice of simplifying assumptions (e.g., by only adapting a subset of the model parameters). Second, we’ve seen that meta-optimizers can unexpectedly emerge within our models (as in the deep networks trained with MAML); how- ever, it is unclear under which circumstances they do — in other words, can we predict which weights will learn to act as optimizers? This question generalizes beyond the few-shot learning 115 setting, where complex architectures (e.g., Transformers [225]) could learn optimization algo- rithms within their weights — a phenomenon known as mesa-optimization. Characterizing when those weights emerge and their behavior remains an open question. Last but not least, meta- optimizers suffer from the same statistical pitfalls as data models: they overfit, they are biased, and they can be overconfident. They also suffer from pitfalls of their own — for example, prior work has demonstrated their bias towards predicting small and safe updates, at the cost of slowing down optimization [245]. We hope our work can motivate future studies that uncover those challenges and how to best tackle them. Defining and measuring task similarity. Our work proposes two different approaches to mea- sure task similarity. In Chapter 5, we use task difficulty as a proxy for task similarity. For clas- sification tasks, Chapter 3 measures similarity in terms of the average embedding representation associated with each class. Neither is satisfying. The former inconveniently requires solving the task to measure its difficulty, while the latter fails to account for the choice of learning algorithm use to solve the new task. We view measuring similarity as one of the next steps required to build a basis for devising practical directions guiding the choice of data and algorithms for pretraining and finetuning. Bridging the gap between single and multi-task pretraining. Our results in Table 3.1 contain a surprising result: apparently, the best way to pretrain on multiple image classification tasks is to collapse all tasks into a single massively multiclass task. Prior work also reports similar results [53, 231, 219]. This state of affair is perplexing because both approaches pretrain on exactly the same data, which suggests that multi-task pretraining is algorithmically less effective than single task pretraining. How can we bridge this gap? This question is especially relevant in settings where pooling data form multiple tasks is infeasible by design, such as in federated learning. Niche tasks grounded in the real world. We return to our motivation of rising the tide (by designing better adaptation methods) to lift all niche tasks limited by data quantity. We note, 116 however, that this thesis did not include a single experiment from such niche domain — most were based on image classification or reinforcement learning where data is plentiful. Taking a page from Brooks [29], we wish to ground and validate our insights in real-world niche tasks. Those will surely challenge and refine some of the implicit assumptions in our artificial testbeds. For example, few of the methods discussed in this thesis elegantly handle few-shot sequence modelling tasks, which are ubiquitous in the physical, climate, chemical, and biological sciences. We hope the first steps we took in this thesis can be inspirational to such endeavour. 117 Bibliography [1] David Abel, David Hershkowitz, and Michael Littman. Near optimal behavior via approxi- mate state abstraction. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceed- ings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2915–2923, New York, New York, USA, 20–22 Jun 2016. PMLR. URLhttps://proceedings.mlr.press/v48/abel16.html. Cited on p. 52. [2] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE International Conference on Computer Vi- sion, pages 6430–6439, 2019. URL https://www.ics.uci.edu/~fowlkes/papers/ achille_task2vec_iccv2019.pdf. Cited on p. 43. [3] Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. V ATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URLhttps://openreview. net/forum?id=RzYrn625bu8. Cited on p. 49. [4] Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery with One-Shot learning. ACS Cent Sci, 3(4):283–293, April 2017. URL http: //dx.doi.org/10.1021/acscentsci.6b00367. Cited on p. 31. [5] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Batten- berg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Vaino Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seeta- pun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, and Zhenyao Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceed- ings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 173–182. JMLR.org, 2016. Cited on p. 49. 118 [6] Antreas Antoniou, Harrison Edwards, and Amos Storkey. How to train your MAML. In In- ternational Conference on Learning Representations, 2019. URLhttps://openreview. net/forum?id=HJGven05Y7. Cited on p. 32. [7] Sébastien M. R. Arnold and Fei Sha. Embedding adaptation is still needed for few-shot learning. ArXiv, abs/2104.07255, 2021. Cited on p. 7. [8] Sébastien M. R. Arnold and Fei Sha. Policy-induced self-supervision improves representa- tion finetuning in visual rl. In Submission, 2022. Cited on p. 7. [9] Sébastien M. R. Arnold, Pierre-Antoine Manzagol, Reza Babanezhad Harikan- deh, Ioannis Mitliagkas, and Nicolas Le Roux. Reducing the variance in on- line optimization by transporting past gradients. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neu- ral Information Processing Systems, volume 32, pages 5391–5402. Curran Asso- ciates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 1dba5eed8838571e1c80af145184e515-Paper.pdf. Cited on p. 8. [10] Sébastien M. R. Arnold, Guneet S. Dhillon, Avinash Ravichandran, and Stefano Soatto. Uniform sampling over episode difficulty. In Advances in Neural Information Processing Systems, volume 34, 2021. Cited on p. 7. [11] Sébastien M. R. Arnold, Shariq Iqbal, and Fei Sha. When maml can adapt fast and how to assist when it cannot. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 244–252. PMLR, 13–15 Apr 2021. URLhttp://proceedings.mlr.press/v130/arnold21a.html. Cited on p. 7. [12] The TensorFlow Authors. Tensorflow official resnet model. 2018. URLhttps://github. com/tensorflow/models/tree/master/official/resnet. Cited on p. 109. [13] Ba, Kiros, and Hinton. Layer normalization. arXiv preprint arXiv:1607. 06450, 2016. URL http://arxiv.org/pdf/1607.06450. Cited on p. 56. [14] Reza Babanezhad, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Kone˘ cný, and Scott Sallinen. Stop wasting my gradients: Practical SVRG. In Advances in Neural Information Processing Systems, 2015. Cited on p. 100. [15] Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o (1/n). In Advances in Neural Information Processing Systems, pages 773–781, 2013. Cited on p. 100. [16] Maria-Florina Balcan, Mikhail Khodak, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Pro- ceedings of Machine Learning Research, pages 424–433. PMLR, 09–15 Jun 2019. URL http://proceedings.mlr.press/v97/balcan19a.html. Cited on p. 14. 119 [17] Adrien Bardes, Jean Ponce, and Yann LeCun. VICReg: Variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Rep- resentations, 2022. URL https://openreview.net/forum?id=xm6YD62D1Ub. Cited on p. 66. [18] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, March 2000. Cited on p. 10. [19] Harkirat Behl, Atılım Güne¸ s Baydin, and Philip H.S. Torr. Alpha maml: Adaptive model- agnostic meta-learning. In 6th ICML Workshop on Automated Machine Learning, Thirty- sixth International Conference on Machine Learning (ICML 2019), Long Beach, CA, US, 2019. Cited on p. 27. [20] Harkirat Singh Behl, Atılım Güne¸ s Baydin, and Philip H S Torr. Alpha MAML: Adap- tive Model-Agnostic Meta-Learning. May 2019. URL http://arxiv.org/abs/1905. 07435. Cited on p. 32. [21] Y Bengio, S Bengio, and J Cloutier. Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume ii, pages 969 vol.2–, July 1991. Cited on p. 9, 31. [22] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learn- ing. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY , USA, June 2009. Association for Computing Ma- chinery. URLhttps://doi.org/10.1145/1553374.1553380. Cited on p. 85, 87. [23] Luca Bertinetto, João F Henriques, Jack Valmadre, P Torr, and A Vedaldi. Learning feed- forward one-shot learners. NIPS, 2016. URL https://www.semanticscholar.org/ paper/4423357dd21cc59662c6fabaf9839b15ef0fb8a8. Cited on p. 32. [24] Luca Bertinetto, Joao F Henriques, Philip Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. September 2018. URL https://openreview.net/ pdf?id=HyxnZh0ct7. Cited on p. 21, 29, 33, 86. [25] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-Dynamic programming. 27(6), January 1996. URL https://www.researchgate.net/publication/216722122_ Neuro-Dynamic_Programming. Cited on p. 77. [26] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. June 2016. URLhttp://arxiv.org/abs/1606.04838. Cited on p. 77. [27] Matthew Botvinick, Sam Ritter, Jane X Wang, Zeb Kurth-Nelson, Charles Blundell, and Demis Hassabis. Reinforcement learning, fast and slow. Trends in cognitive sciences, 23 (5):408–422, 2019. Cited on p. 77. [28] A Brock, T Lim, J M Ritchie, and N Weston. SMASH: One-Shot model architecture search through HyperNetworks. ICLR, 2018. URL https://www.semanticscholar.org/ paper/e56b10f7cd4bf037beac84da5925dc4544fab974. Cited on p. 32. 120 [29] Rodney A Brooks. Intelligence without representation. Artif. Intell., 47(1-3):139–159, January 1991. URL https://homeostasis.scs.carleton.ca/~soma/adapsec/ readings/brooks1991-representation.pdf. Cited on p. 117. [30] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few- shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Cur- ran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Cited on p. 49. [31] Jon Bruner. Tweets loud and quiet. http://radar.oreilly.com/2013/12/ tweets-loud-and-quiet.html. URL http://radar.oreilly.com/2013/12/ tweets-loud-and-quiet.html. Accessed: 2022-11-25. Cited on p. 2. [32] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015. Cited on p. 99. [33] Aihua Cai, Wenxin Hu, and Jun Zheng. Few-Shot learning for medical image classifica- tion. In Artificial Neural Networks and Machine Learning – ICANN 2020, pages 441– 452. Springer International Publishing, 2020. URL http://dx.doi.org/10.1007/ 978-3-030-61609-0_35. Cited on p. 31. [34] Rich Caruana. Multitask learning. Mach. Learn., 28(1):41–75, July 1997. Cited on p. 10. [35] Pablo Samuel Castro and Doina Precup. Using bisimulation for policy transfer in MDPs. In Twenty-Fourth AAAI Conference on Artificial Intelligence, July 2010. URLhttps://www. aaai.org/ocs/index.php/AAAI/AAAI10/paper/viewPaper/1907. Cited on p. 52. [36] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niebner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV), pages 667–676, Los Alamitos, CA, USA, oct 2017. IEEE Computer Society. doi: 10.1109/3DV .2017.00081. URL https://doi. ieeecomputersociety.org/10.1109/3DV.2017.00081. Cited on p. 57. [37] Wei-Lun Chao, Han-Jia Ye, De-Chuan Zhan, Mark Campbell, and Kilian Q Weinberger. Revisiting meta-learning as supervised learning. arXiv preprint arXiv:2002.00573, 2020. Cited on p. 79. [38] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frame- work for contrastive learning of visual representations. In Hal Daumé Iii and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 2020. URL https://proceedings.mlr.press/v119/chen20j.html. Cited on p. 66. 121 [39] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. November 2020. URLhttp://arxiv.org/abs/2011.10566. Cited on p. xv, 66, 74, 76. [40] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self- supervised vision transformers. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9640–9649, 2021. URL https://github.com/ facebookresearch/moco-v3. Cited on p. 49. [41] Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A new Meta- Baseline for Few-Shot learning. March 2020. URL http://arxiv.org/abs/2003. 04390. Cited on p. 33. [42] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Hen- ryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M Dai, Thanu- malayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. PaLM: Scaling language modeling with pathways. April 2022. URLhttp://arxiv.org/abs/2204.02311. Cited on p. 49. [43] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 3606–3613, 2014. Cited on p. 94. [44] Ignasi Clavera, Anusha Nagabandi, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In International Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=HyztsoC5Y7. Cited on p. 32. [45] Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural gen- eration to benchmark reinforcement learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Pro- ceedings of Machine Learning Research, pages 2048–2056. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/cobbe20a.html. Cited on p. 52. [46] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921–2926. IEEE, 2017. Cited on p. 40. 122 [47] Dominik Csiba and Peter Richtárik. Importance sampling for minibatches. J. Mach. Learn. Res., 19(27):1–21, 2018. URL http://jmlr.org/papers/v19/16-241.html. Cited on p. 77. [48] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014. Cited on p. 100. [49] Giulia Denevi, Carlo Ciliberto, Riccardo Grazzi, and Massimiliano Pontil. Learning-to- learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566–1575. PMLR, 2019. Cited on p. 14. [50] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, June 2009. URL http://dx.doi.org/10.1109/ CVPR.2009.5206848. Cited on p. 50, 52, 56, 57. [51] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009. Cited on p. 87. [52] Guneet Singh Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. In International Conference on Learning Repre- sentations, 2019. Cited on p. 81. [53] Guneet Singh Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. In International Conference on Learning Repre- sentations, 2020. URL https://openreview.net/forum?id=rylXBkrYDS. Cited on p. 33, 43, 116. [54] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386, 2017. Cited on p. 100. [55] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017. Cited on p. 100. [56] Frances Ding, Jean-Stanislas Denain, and Jacob Steinhardt. Grounding representation sim- ilarity through statistical testing. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wort- man Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=_kwj6V53ZqB. Cited on p. 62. [57] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors, Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of Machine Learning Research, pages 1–16. PMLR, 13–15 Nov 2017. URL https://proceedings.mlr.press/v78/dosovitskiy17a.html. Cited on p. 51. 123 [58] Arnaud Doucet, Nando de Freitas, and Neil Gordon, editors. Sequential Monte Carlo Meth- ods in Practice. Springer, New York, NY , 2001. URL https://link.springer.com/ book/10.1007/978-1-4757-3437-9. Cited on p. 86. [59] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In International Conference on Artificial Intelligence and Statistics, pages 1082–1092. PMLR, 2020. Cited on p. 26, 27. [60] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. arXiv preprint arXiv:2102.03832, 2021. Cited on p. 93. [61] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learn- ing, volume 80 of Proceedings of Machine Learning Research, pages 1467–1476, Stock- holmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URLhttp://proceedings. mlr.press/v80/fazel18a.html. Cited on p. 112. [62] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell., 28(4):594–611, April 2006. URL http://dx.doi. org/10.1109/TPAMI.2006.79. Cited on p. 29, 31, 85. [63] Christopher Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Measuring and harnessing transference in Multi-Task learning. October 2020. URL http://arxiv.org/abs/2010.15413. Cited on p. 43. [64] Chelsea Finn. Learning to Learn with Gradients. PhD thesis, UC Berkeley, July 2018. Cited on p. 32. [65] Chelsea Finn. Learning to Learn with Gradients. PhD thesis, UC Berkeley, July 2018. Cited on p. 4. [66] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. In International Confer- ence on Learning Representations, 2018. URL https://openreview.net/forum?id= HyjC5yWCW. Cited on p. 26. [67] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135, International Convention Centre, Sydney, Australia, 2017. PMLR. Cited on p. 3, 9, 21, 114. [68] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. volume 70 of Proceedings of Machine Learning Research, pages 1126–1135, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URLhttp://proceedings.mlr.press/v70/finn17a.html. Cited on p. 30, 32. 124 [69] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Ma- chine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017. Cited on p. 78, 80, 86, 88, 112. [70] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta- learning. volume 97 of Proceedings of Machine Learning Research, pages 1920–1930, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings. mlr.press/v97/finn19a.html. Cited on p. 32. [71] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. In Conference on Learning Theory, pages 658–695, 2015. Cited on p. 100, 105. [72] Sebastian Flennerhag, Andrei A Rusu, Razvan Pascanu, Hujun Yin, and Raia Hadsell. Meta- Learning with warped gradient descent. August 2019. URL http://arxiv.org/abs/ 1909.00025. Cited on p. 32, 86. [73] Sebastian Flennerhag, Andrei A. Rusu, Razvan Pascanu, Francesco Visin, Hujun Yin, and Raia Hadsell. Meta-learning with warped gradient descent. In International Confer- ence on Learning Representations, 2020. URL https://openreview.net/forum?id= rkeiQlBFPB. Cited on p. 9, 10, 23, 24, 27. [74] Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rocktäschel, Eric Xing, and Shimon Whiteson. Dice: The infinitely differentiable monte carlo estimator. In International Conference on Machine Learning, pages 1524–1533, 2018. Cited on p. 27. [75] Michael P Friedlander and Mark Schmidt. Hybrid Deterministic-Stochastic methods for data fitting. SIAM J. Sci. Comput., 34(3):A1380–A1405, January 2012. URL https: //doi.org/10.1137/110830629. Cited on p. 77. [76] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S M Ali Eslami. Conditional neu- ral processes. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th Interna- tional Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1704–1713, Stockholmsmässan, Stockholm Sweden, 2018. PMLR. URL http://proceedings.mlr.press/v80/garnelo18a.html. Cited on p. 86. [77] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, S M Ali Eslami, and Yee Whye Teh. Neural processes. July 2018. URLhttp://arxiv.org/ abs/1807.01622. Cited on p. 32. [78] Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. Deepmdp: Learning continuous latent space models for representation learning. In Kama- lika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2170–2179. PMLR, 2019. URLhttp://proceedings.mlr.press/v97/gelada19a.html. Cited on p. 52. 125 [79] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018. Cited on p. 78, 86, 88. [80] Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regulariza- tion of discrete gradient dynamics in linear neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems, volume 32. Curran Asso- ciates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ f39ae9ff3a81f499230c4126e01f421b-Paper.pdf. Cited on p. 14. [81] Paul Glasserman. Monte Carlo methods in financial engineering. Springer, New York, 2004. ISBN 0387004513 9780387004518 1441918221 9781441918222. Cited on p. 81. [82] Robert Gower, Nicolas Le Roux, and Francis Bach. Tracking the gradients using the hes- sian: A new look at variance reducing stochastic methods. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artifi- cial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 707–715, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/gower18a.html. Cited on p. 100, 104. [83] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting Gradient-Based Meta-Learning as hierarchical bayes. January 2018. URLhttp://arxiv. org/abs/1801.08930. Cited on p. 9, 32, 86. [84] Simon Guiroy, Vikas Verma, and Christopher Pal. Towards understanding generalization in Gradient-Based Meta-Learning. arXiv preprint arXiv:1907.07287, July 2019. Cited on p. 27. [85] Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=0oabwyZbOu. Cited on p. 51. [86] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intell. Syst., 24(2):8–12, March 2009. URL https://research.google/pubs/ pub35179/. Cited on p. 1. [87] Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. arXiv [cs.LG], March 2022. URL http://arxiv.org/abs/2203. 04955. Cited on p. 51. [88] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. Cited on p. 109, 110. [89] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. Cited on p. 39. 126 [90] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. Cited on p. 87. [91] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. November 2019. URL http://arxiv. org/abs/1911.05722. Cited on p. 66. [92] D Hendrycks, K Gimpel arXiv preprint arXiv:1606.08415, and 2016. Bridging nonlinear- ities and stochastic regularizers with gaussian error linear units. arxiv.org, 2016. URL https://arxiv.org/abs/1606.08415. Cited on p. 54, 56. [93] Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Con- ference on Machine Learning - Volume 70, ICML’17, page 1480–1490. JMLR.org, 2017. Cited on p. 52. [94] Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Vari- ance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, pages 2305–2313, 2015. Cited on p. 100. [95] T Hospedales, A Antoniou, P Micaelli, and others. Meta-learning in neural networks: A survey. arXiv preprint arXiv, 2020. URLhttps://arxiv.org/abs/2004.05439. Cited on p. 80. [96] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. Cited on p. 39. [97] Gary B Huang and Erik Learned-Miller. Labeled faces in the wild: Updates and new reporting procedures. Dept. Comput. Sci. , Univ. Massachusetts Amherst, Amherst, MA, USA, Tech. Rep, pages 14–003, 2014. URL http://vis-www.cs.umass.edu/lfw/ lfw_update.pdf. Cited on p. 40. [98] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train- ing by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015. Cited on p. 87. [99] Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: Mini-batching, aver- aging, and model misspecification. Journal of Machine Learning Research, 18(223):1–42, 2018. URLhttp://jmlr.org/papers/v18/16-595.html. Cited on p. 100, 107. [100] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315– 323, 2013. Cited on p. 100. 127 [101] Tyler B Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance sampling. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7265– 7275. Curran Associates, Inc., 2018. Cited on p. 86. [102] Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. March 2018. URLhttp://arxiv.org/abs/1803. 00942. Cited on p. 77, 86. [103] Ahmed Khalifa, Philip Bontrager, Sam Earle, and Julian Togelius. Pcgrl: Procedural content generation via reinforcement learning. In Proceedings of the Sixteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, AIIDE’20. AAAI Press, 2020. ISBN 978-1-57735-849-7. Cited on p. 52. [104] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014. URLhttp://arxiv.org/abs/1412.6980. Cited on p. 56. [105] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017. Cited on p. 50, 51. [106] Augustine Kong. A note on importance sampling using standardized weights. University of Chicago, Dept. of Statistics, Tech. Rep, 348, 1992. Cited on p. 83. [107] Jan Koutník, Giuseppe Cuccu, Jürgen Schmidhuber, and Faustino Gomez. Evolving large- scale neural networks for vision-based reinforcement learning. In Proceedings of the 15th annual conference on Genetic and evolutionary computation, GECCO ’13, pages 1061– 1068, New York, NY , USA, July 2013. Association for Computing Machinery. URL https://doi.org/10.1145/2463372.2463509. Cited on p. 51. [108] A Krizhevsky and G Hinton. Learning multiple layers of features from tiny images. 2009. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 222.9220&rep=rep1&type=pdf. Cited on p. 34. [109] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf. Cited on p. 109. [110] Huibert Kwakernaak. Linear optimal control systems, volume 1. Cited on p. 112. [111] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012. Cited on p. 100. [112] Steinar Laenen and Luca Bertinetto. On episodes, prototypical networks, and few-shot learning. December 2020. URLhttp://arxiv.org/abs/2012.09831. Cited on p. 81, 86. 128 [113] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, Decem- ber 2015. URLhttp://dx.doi.org/10.1126/science.aab3050. Cited on p. 34. [114] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, Decem- ber 2015. Cited on p. 21. [115] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Build- ing machines that learn and think like people. Behavioral and brain sciences, 40, 2017. Cited on p. 77. [116] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. The omniglot challenge: a 3-year progress report. February 2019. URL http://arxiv.org/abs/1902.03477. Cited on p. 34. [117] Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. In Hal Daumé III and Aarti Singh, editors, Pro- ceedings of the 37th International Conference on Machine Learning, volume 119 of Pro- ceedings of Machine Learning Research, pages 5639–5650. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/laskin20a.html. Cited on p. 51, 70. [118] Alessandro Lazaric. Transfer in reinforcement learning: A framework and a survey. In Marco Wiering and Martijn van Otterlo, editors, Reinforcement Learning: State-of-the-Art, pages 143–173. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. URL https:// doi.org/10.1007/978-3-642-27645-3_5. Cited on p. 51. [119] Nicolas Le Roux. Anytime tail averaging. arXiv preprint arXiv:1902.05083, 2019. Cited on p. 107, 108. [120] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012. Cited on p. 100, 101. [121] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-Learning with differentiable convex optimization. April 2019. URL http://arxiv.org/abs/ 1904.03758. Cited on p. 33, 39. [122] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019. Cited on p. 27, 81, 86. [123] Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric and subspace. volume 80 of Proceedings of Machine Learning Research, pages 2927–2936, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URLhttp: //proceedings.mlr.press/v80/lee18a.html. Cited on p. 32. 129 [124] Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise met- ric and subspace. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th In- ternational Conference on Machine Learning, volume 80 of Proceedings of Machine Learn- ing Research, pages 2927–2936, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URLhttp://proceedings.mlr.press/v80/lee18a.html. Cited on p. 9, 10, 11, 23, 27. [125] Lucas Lehnert, Michael L Littman, and Michael J Frank. Reward-predictive representations generalize across tasks in reinforcement learning. PLoS Comput. Biol., 16(10):e1008317, October 2020. URL http://dx.doi.org/10.1371/journal.pcbi.1008317. Cited on p. 52. [126] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learn quickly for Few-Shot learning. July 2017. URL http://arxiv.org/abs/1707.09835. Cited on p. 32. [127] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to learn quickly for Few-Shot learning. arXiv preprint arXiv:1707.09835, July 2017. Cited on p. 10, 11, 23, 24, 27. [128] Moshe Lichtenstein, Prasanna Sattigeri, Rogerio Feris, Raja Giryes, and Leonid Karlin- sky. TAFSSL: Task-Adaptive feature Sub-Space learning for few-shot classification. March 2020. URLhttp://arxiv.org/abs/2003.06670. Cited on p. 33. [129] Michael Littman and Richard S Sutton. Predictive representations of state. Adv. Neural Inf. Process. Syst., 14, 2001. URLhttps://web.eecs.umich.edu/~baveja/Papers/ psr.pdf. Cited on p. 52. [130] Chenghao Liu, Zhihao Wang, Doyen Sahoo, Yuan Fang, Kun Zhang, and Steven C H Hoi. Adaptive task sampling for Meta-Learning. July 2020. URL http://arxiv.org/abs/ 2007.08735. Cited on p. 77, 87. [131] Guoqing Liu, Chuheng Zhang, Li Zhao, Tao Qin, Jinhua Zhu, Li Jian, Nenghai Yu, and Tie-Yan Liu. Return-based contrastive representation learning for reinforcement learn- ing. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=_TM6rT7tXke. Cited on p. 52. [132] Jun S Liu. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and computing, 6(2):113–119, 1996. Cited on p. 83. [133] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Sain- ing Xie. A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. Cited on p. 56, 57. [134] Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochas- tic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017. Cited on p. 101. 130 [135] Dana Adriana Lups , a-T˘ ataru and Radu Lix˘ androiu. YouTube channels, subscribers, uploads and views: A multidimensional analysis of the first 1700 channels from july 2022. Sus- tain. Sci. Pract. Policy, 14(20):13112, October 2022. URL https://www.mdpi.com/ 2071-1050/14/20/13112. Cited on p. 2. [136] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y . Ng, and Christo- pher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Lin- guistics. URLhttp://www.aclweb.org/anthology/P11-1015. Cited on p. 110. [137] David J C MacKay. Information Theory, Inference & Learning Algorithms. Cambridge University Press, USA, 2002. URLhttps://dl.acm.org/citation.cfm?id=971143. Cited on p. 38. [138] Julien Mairal. Optimization with first-order surrogate functions. In Proceedings of The 30th International Conference on Machine Learning, pages 783–791, 2013. Cited on p. 100. [139] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- Grained visual classification of aircraft. June 2013. URL http://arxiv.org/abs/ 1306.5151. Cited on p. 33. [140] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013. Cited on p. 94. [141] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac gym: High performance GPU-Based physics simulation for robot learning. August 2021. URLhttp://arxiv.org/abs/2108.10470. Cited on p. 51. [142] Bogdan Mazoure, Remi Tachet des Combes, Thang Long Doan, Philip Bachman, and Devon Hjelm. Deep reinforcement and infomax learning. In NeurIPS 2020. ACM, De- cember 2020. URLhttps://www.microsoft.com/en-us/research/publication/ deep-reinforcement-and-infomax-learning/. Cited on p. 52. [143] Alexey N Medvedev, Renaud Lambiotte, and Jean-Charles Delvenne. The anatomy of red- dit: An overview of academic research. In Dynamics On and Of Complex Networks III, pages 183–204. Springer International Publishing, 2019. URLhttp://dx.doi.org/10. 1007/978-3-030-14683-2_9. Cited on p. 2. [144] E G Miller, N E Matsakis, and P A Viola. Learning from one example through shared densities on transforms. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662), volume 1, pages 464–471 vol.1, June 2000. URLhttp://dx.doi.org/10.1109/CVPR.2000.855856. Cited on p. 29, 31, 85. [145] George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. Introduction to WordNet: An on-line lexical database *. Int J Lexicography, 3(4): 131 235–244, 1990. URLhttps://academic.oup.com/ijl/article-lookup/doi/10. 1093/ijl/3.4.235. Cited on p. 34. [146] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural atten- tive meta-learner. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1DmUzWAW. Cited on p. 39. [147] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Ku- maran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015. URL http: //dx.doi.org/10.1038/nature14236. Cited on p. 51, 54. [148] V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016. Cited on p. 54. [149] Jesse Mu, Percy Liang, and Noah Goodman. Shaping visual representations with lan- guage for few-shot classification. In Proceedings of the 58th Annual Meeting of the As- sociation for Computational Linguistics, pages 4823–4830, Online, July 2020. Associa- tion for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.436. URL https: //www.aclweb.org/anthology/2020.acl-main.436. Cited on p. 31. [150] Alex Nichol, Joshua Achiam, and John Schulman. On First-Order Meta-Learning algo- rithms. March 2018. URLhttp://arxiv.org/abs/1803.02999. Cited on p. 86. [151] Alex Nichol, Joshua Achiam, and John Schulman. On First-Order Meta-Learning algo- rithms. arXiv preprint arXiv:1803.02999, March 2018. Cited on p. 9, 27. [152] M Nilsback and A Zisserman. A visual vocabulary for flower classification. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1447–1454, June 2006. URL http://dx.doi.org/10.1109/CVPR. 2006.42. Cited on p. 34. [153] M-E Nilsback and Andrew Zisserman. A visual vocabulary for flower classification. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1447–1454. IEEE, 2006. Cited on p. 94. [154] OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub W. Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation. CoRR, abs/1808.00177, 2018. URLhttp://arxiv.org/abs/1808.00177. Cited on p. 112. [155] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. TADAM: Task dependent adaptive metric for improved few-shot learning. In S Ben- gio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, 132 editors, Advances in Neural Information Processing Systems 31, pages 721– 731. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7352-tadam-task-dependent-adaptive-metric-for-improved-few-shot-learning. pdf. Cited on p. 29, 33, 34. [156] Boris N Oreshkin, Pau Rodriguez, and Alexandre Lacoste. Tadam: task dependent adaptive metric for improved few-shot learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 719–729, 2018. Cited on p. 87. [157] Pedro A Ortega, Jane X Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann, Siddhant M Jayaku- mar, Tom McGrath, Kevin Miller, Mohammad Azar, Ian Osband, Neil Rabinowitz, András György, Silvia Chiappa, Simon Osindero, Yee Whye Teh, Hado van Hasselt, Nando de Fre- itas, Matthew Botvinick, and Shane Legg. Meta-learning of sequential strategies. May 2019. URLhttp://arxiv.org/abs/1905.03030. Cited on p. 32. [158] Brendan O’Donoghue and Emmanuel Candes. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3):715–732, 2015. Cited on p. 147. [159] Charles Packer, Katelyn Gao, Jernej Kos, Philipp Krähenbühl, V Koltun, and D Song. As- sessing generalization in deep reinforcement learning. ArXiv, 2018. URL https://www. semanticscholar.org/paper/caea502325b6a82b1b437c62585992609b5aa542. Cited on p. 52. [160] Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The un- surprising effectiveness of Pre-Trained vision models for control. March 2022. URL http://arxiv.org/abs/2203.03580. Cited on p. 52. [161] Eunbyung Park and Junier B Oliva. Meta-curvature. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 57c0531e13f40b91b3b0f1a30b529a1d-Paper.pdf. Cited on p. 10, 11, 23, 24, 27. [162] Eunbyung Park and Junier B Oliva. Meta-Curvature. February 2019. URLhttp://arxiv. org/abs/1902.03356. Cited on p. 32, 86. [163] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adver- sarial reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceed- ings of the 34th International Conference on Machine Learning, volume 70 of Proceed- ings of Machine Learning Research, pages 2817–2826. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/pinto17a.html. Cited on p. 52. [164] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. Cited on p. 101. 133 [165] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by av- eraging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. Cited on p. 100. [166] Viraj Prabhu, Anitha Kannan, Murali Ravuri, Manish Chaplain, David Sontag, and Xavier Amatriain. Few-Shot learning for dermatological disease diagnosis. 106:532–552, 2019. URLhttp://proceedings.mlr.press/v106/prabhu19a.html. Cited on p. 31. [167] Hang Qi, Matthew Brown, and David G Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5822–5830, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/ html/Qi_Low-Shot_Learning_With_CVPR_2018_paper.html. Cited on p. 32. [168] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or fea- ture reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations, 2019. Cited on p. 78, 80, 86, 88. [169] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or fea- ture reuse? towards understanding the effectiveness of {maml}. In International Confer- ence on Learning Representations, 2020. URL https://openreview.net/forum?id= rkgMkCEtPB. Cited on p. 10, 18, 21, 22, 26, 27, 29, 32, 46. [170] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-Learning with implicit gradients. September 2019. URLhttp://arxiv.org/abs/1909.04630. Cited on p. 32, 86. [171] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off- policy meta-reinforcement learning via probabilistic context variables. volume 97 of Pro- ceedings of Machine Learning Research, pages 5331–5340, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/rakelly19a. html. Cited on p. 32. [172] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017, 2017. URL https:// openreview.net/pdf?id=rJY0-Kcll. Cited on p. 32, 33. [173] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview.net/ forum?id=rJY0-Kcll. Cited on p. 86, 87, 112. [174] Avinash Ravichandran, Rahul Bhotika, and Stefano Soatto. Few-Shot learning with em- bedded class models and Shot-Free meta training. May 2019. URLhttp://arxiv.org/ abs/1905.04398. Cited on p. 86. [175] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: An astounding baseline for recognition. In 2014 IEEE Confer- ence on Computer Vision and Pattern Recognition Workshops, pages 512–519, 2014. doi: 10.1109/CVPRW.2014.131. Cited on p. 49. 134 [176] Mengye Ren, Sachin Ravi, Eleni Triantafillou, Jake Snell, Kevin Swersky, Josh B. Tenen- baum, Hugo Larochelle, and Richard S. Zemel. Meta-learning for semi-supervised few- shot classification. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HJcSzz-CZ. Cited on p. 29, 34. [177] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenen- baum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, 2018. Cited on p. 87. [178] Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Stat., 22(3):400–407, September 1951. URL https://projecteuclid.org/euclid.aoms/ 1177729586. Cited on p. 77. [179] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathe- matical Statistics, 22(3):400–407, 1951. Cited on p. 99. [180] Pau Rodríguez, Issam Laradji, Alexandre Drouin, and Alexandre Lacoste. Embedding prop- agation: Smoother manifold for few-shot classification. European Conference on Computer Vision, 2020. Cited on p. 33. [181] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. ProMP: Proximal meta-policy search. In International Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=SkxXCi0qFX. Cited on p. 27. [182] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y. Cited on p. 110. [183] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis., 115(3): 211–252, December 2015. URL https://doi.org/10.1007/s11263-015-0816-y. Cited on p. 34. [184] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In Interna- tional Conference on Learning Representations, 2019. URLhttps://openreview.net/ forum?id=BJgklhAcK7. Cited on p. 32. [185] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing be- tween capsules. NIPS, 2017. URL https://www.semanticscholar.org/paper/ c4c06578f4870e4b126e6837907929f3c900b99f. Cited on p. 33. [186] Mandana Samiei, Tobias Würfl, Tristan Deleu, Martin Weiss, Francis Dutil, Thomas Fevens, Geneviève Boucher, Sebastien Lemieux, and Joseph Paul Cohen. The TCGA Meta-Dataset 135 clinical benchmark. October 2019. URL http://arxiv.org/abs/1910.08636. Cited on p. 33. [187] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lilli- crap. Meta-Learning with Memory-Augmented neural networks. In Maria Florina Bal- can and Kilian Q Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1842–1850, New York, New York, USA, 2016. PMLR. URL http://proceedings. mlr.press/v48/santoro16.html. Cited on p. 32, 86. [188] Nikunj Saunshi, Yi Zhang, Mikhail Khodak, and Sanjeev Arora. A sample complexity separation between non-convex and convex meta-learning. In International Conference on Machine Learning, pages 8512–8521. PMLR, 2020. Cited on p. 14, 26, 27. [189] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bha- vana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. Cited on p. iii, 50, 51, 56. [190] Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of seman- tic development in deep neural networks. Proc. Natl. Acad. Sci. U. S. A., 116(23):11537– 11546, June 2019. URL http://dx.doi.org/10.1073/pnas.1820226116. Cited on p. 14. [191] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. November 2015. URLhttp://arxiv.org/abs/1511.05952. Cited on p. 86. [192] Juergen Schmidhuber. Evolutionary Principles in Self-Referential Learning. PhD thesis, 1987. URL http://people.idsia.ch/~juergen/diploma1987ocr.pdf. Cited on p. 31, 85. [193] Juergen Schmidhuber. Evolutionary Principles in Self-Referential Learning. PhD thesis, 1987. Cited on p. 9. [194] Mark Schmidt. Convergence rate of stochastic gradient with constant step size. 2014. Cited on p. 99. [195] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Convergence rates of inexact proximal- gradient methods for convex optimization. In Advances in Neural Information Processing Systems 24, 2011. Cited on p. 101. [196] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017. Cited on p. 100. [197] Brigit Schroeder and Yin Cui. FGVCx fungi classification challenge 2018, 2018. URL http://github.com/visipedia/fgvcx_fungi_comp. Cited on p. 34. 136 [198] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889– 1897, 2015. Cited on p. 112. [199] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. July 2017. URLhttp://arxiv.org/abs/1707.06347. Cited on p. 54, 67. [200] Max Schwarzer, Ankesh Anand, Rishab Goel, R Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representa- tions. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=uCQfPZwRaUu. Cited on p. xv, 52, 74, 76. [201] Ozan Sener and Vladlen Koltun. Multi-Task learning as Multi-Objective optimization. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, edi- tors, Advances in Neural Information Processing Systems, volume 31, pages 527–538. Cur- ran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/ file/432aca3a1e345e339f35a30c8f65edce-Paper.pdf. Cited on p. 33. [202] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Marco Hutter and Roland Siegwart, editors, Field and Service Robotics, pages 621–635, Cham, 2018. Springer International Publishing. ISBN 978-3-319-67361-5. Cited on p. 51. [203] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regu- larized loss. Journal of Machine Learning Research, 14(1):567–599, February 2013. ISSN 1532-4435. Cited on p. 100. [204] Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49, 2019. URL http://jmlr.org/ papers/v20/18-789.html. Cited on p. 109. [205] Samuel Sanford Shapiro and Martin B Wilk. An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591–611, 1965. Cited on p. 89. [206] Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Clau- dia Pérez-D’Arpino, Shyamal Buch, Sanjana Srivastava, Lyne Tchapmi, Micael Tchapmi, Kent Vainio, Josiah Wong, Li Fei-Fei, and Silvio Savarese. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In 2021 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 7520–7527, 2021. doi: 10.1109/IROS51168.2021.9636667. Cited on p. 57. [207] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016. Cited on p. 85. 137 [208] P J Smith, M Shafi, and Hongsheng Gao. Quick simulation: a review of importance sam- pling techniques in communications systems. IEEE J. Sel. Areas Commun., 15(4):597–613, May 1997. URLhttp://dx.doi.org/10.1109/49.585771. Cited on p. 86. [209] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pages 4077–4087, 2017. Cited on p. 33, 39. [210] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learn- ing. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4080–4090, 2017. Cited on p. 27, 78, 80, 86, 88. [211] Milan Sulc, Lukas Picek, Jiri Matas, Thomas Jeppesen, and Jacob Heilmann- Clausen. Fungi recognition: A practical use case. In The IEEE Winter Con- ference on Applications of Computer Vision, pages 2316–2324, 2020. URL https://openaccess.thecvf.com/content_WACV_2020/papers/Sulc_Fungi_ Recognition_A_Practical_Use_Case_WACV_2020_paper.pdf. Cited on p. 34. [212] Qianru Sun, Yaoyao Liu, Zhaozheng Chen, Tat-Seng Chua, and Bernt Schiele. Meta- Transfer learning through hard tasks. October 2019. URL http://arxiv.org/abs/ 1910.03648. Cited on p. 77, 86. [213] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013. Cited on p. 101. [214] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. Cited on p. 39. [215] Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Train- ing home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS), 2021. Cited on p. 57. [216] Remi Tachet des Combes, Philip Bachman, and Harm van Seijen. Learning invariances for policy generalization, 2018. URL https://openreview.net/forum?id=BJHRaK1PG. Cited on p. iii, 50, 53, 54. [217] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. DeepMind control suite. January 2018. URL http://arxiv.org/ abs/1801.00690. Cited on p. iii, 50, 51, 55. 138 [218] Sebastian Thrun and Lorien Pratt, editors. Learning to Learn. Kluwer Academic Publishers, Norwell, MA, USA, 1998. URL https://dl.acm.org/citation.cfm?id=296635. Cited on p. 85. [219] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Re- thinking Few-Shot image classification: a good embedding is all you need? March 2020. URLhttp://arxiv.org/abs/2003.11539. Cited on p. 29, 33, 34, 46, 116. [220] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, October 2012. URL http://dx.doi.org/10.1109/IROS.2012. 6386109. Cited on p. 55. [221] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset: A dataset of datasets for learning to learn from few exam- ples. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=rkgAGAVKPr. Cited on p. 29, 33, 34. [222] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Meta-dataset: A dataset of datasets for learning to learn from few exam- ples. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=rkgAGAVKPr. Cited on p. 94. [223] Joaquin Vanschoren. Meta-Learning: A survey. October 2018. URLhttp://arxiv.org/ abs/1810.03548. Cited on p. 32. [224] Joaquin Vanschoren. Meta-Learning. In Frank Hutter, Lars Kotthoff, and Joaquin Van- schoren, editors, Automated Machine Learning: Methods, Systems, Challenges, pages 35– 61. Springer International Publishing, Cham, 2019. URL https://doi.org/10.1007/ 978-3-030-05318-5_2. Cited on p. 9. [225] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. URL https://arxiv.org/pdf/1706.03762.pdf. Cited on p. 116. [226] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019. Cited on p. 99. [227] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In D D Lee, M Sugiyama, U V Luxburg, I Guyon, and R Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3630– 3638. Curran Associates, Inc., 2016. Cited on p. 9, 21, 29, 33. [228] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wier- stra. Matching networks for one shot learning. In Proceedings of the 30th International 139 Conference on Neural Information Processing Systems, pages 3637–3645, 2016. Cited on p. 86, 87. [229] Hoi-To Wai, Wei Shi, Angelia Nedic, and Anna Scaglione. Curvature-aided incremental aggregated gradient method. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 526–532. IEEE, 2017. Cited on p. 104. [230] Che Wang, Xufang Luo, Keith Ross, and Dongsheng Li. VRL3: A data-driven framework for visual deep reinforcement learning. arXiv [cs.CV], February 2022. URL http:// arxiv.org/abs/2202.10324. Cited on p. 52. [231] Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens van der Maaten. SimpleShot: Revisiting Nearest-Neighbor classification for Few-Shot learning. November 2019. URL http://arxiv.org/abs/1911.04623. Cited on p. 33, 116. [232] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv., 53(3):1–34, June 2020. URLhttps://doi.org/10.1145/3386252. Cited on p. 32, 80. [233] Jason Wei and Kai Zou. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 6381–6387, Stroudsburg, PA, USA, 2019. As- sociation for Computational Linguistics. URLhttps://www.aclweb.org/anthology/ D19-1670. Cited on p. 32. [234] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Be- longie, and Pietro Perona. Caltech-UCSD birds 200. 2010. URL http://www.vision. caltech.edu/visipedia/papers/WelinderEtal10_CUB-200.pdf. Cited on p. 33. [235] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Be- longie, and Pietro Perona. Caltech-ucsd birds 200. 2010. Cited on p. 94. [236] D. H. D. West. Updating mean and variance estimates: An improved method. Commun. ACM, 22(9):532–535, September 1979. ISSN 0001-0782. doi: 10.1145/359146.359153. URLhttps://doi.org/10.1145/359146.359153. Cited on p. 84. [237] Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1gX8C4YPr. Cited on p. 57, 58. [238] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Machine learning, 8(3-4):229–256, 1992. Cited on p. 112. [239] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148–4158, 2017. Cited on p. 109. 140 [240] Sam Witty, Jun K. Lee, Emma Tosch, Akanksha Atrey, Kaleigh Clary, Michael L. Littman, and David Jensen. Measuring and characterizing generalization in deep reinforcement learning. Applied AI Letters, 2(4):e45, 2021. doi: https://doi.org/10.1002/ail2.45. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/ail2.45. Cited on p. 51. [241] C Wu, R Manmatha, A J Smola, and P Krähenbühl. Sampling matters in deep embedding learning. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2859– 2867, October 2017. URL http://dx.doi.org/10.1109/ICCV.2017.309. Cited on p. 78, 85, 86. [242] Sen Wu, Hongyang R Zhang, and Christopher Ré. Understanding and improving informa- tion transfer in Multi-Task learning. September 2019. URLhttps://openreview.net/ pdf?id=SylzhkBtDB. Cited on p. 43. [243] Xiaoxia Wu, Ethan Dyer, and Behnam Neyshabur. When do curricula work? December 2020. URLhttp://arxiv.org/abs/2012.03107. Cited on p. 87. [244] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in stochastic meta-optimization. In International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=H1MczcgR-. Cited on p. 27. [245] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in stochastic meta-optimization. In International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=H1MczcgR-. Cited on p. 116. [246] Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018. Cited on p. 57. [247] Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv [cs.CV], March 2022. URL http://arxiv.org/abs/2203. 06173. Cited on p. 52. [248] Chen Xing, Negar Rostamzadeh, Boris Oreshkin, and Pedro O O Pinheiro. Adaptive cross- modal few-shot learning. Advances in Neural Information Processing Systems, 32:4847– 4857, 2019. Cited on p. 32. [249] Jincheng Xu and Qingfeng Du. Learning transferable features in meta-learning for few-shot text classification. Pattern Recognit. Lett., 135:271–278, July 2020. URL http://www. sciencedirect.com/science/article/pii/S016786552030177X. Cited on p. 31. [250] Jun Yamada, Karl Pertsch, Anisha Gunjal, and Joseph J Lim. Task-induced representation learning. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=OzyXtIZAzFv. Cited on p. 52. [251] Mengjiao Yang and Ofir Nachum. Representation matters: Offline pretraining for se- quential decision making. In Marina Meila and Tong Zhang, editors, Proceedings of the 141 38th International Conference on Machine Learning, volume 139 of Proceedings of Ma- chine Learning Research, pages 11784–11794. PMLR, 18–24 Jul 2021. URL https: //proceedings.mlr.press/v139/yang21h.html. Cited on p. 52. [252] Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=GY6-6sTvGaf. Cited on p. 55, 71. [253] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual contin- uous control: Improved data-augmented reinforcement learning. In International Confer- ence on Learning Representations, 2022. URL https://openreview.net/forum?id= _SJ-_yyes8. Cited on p. 51, 55, 56, 67, 71. [254] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via em- bedding adaptation with set-to-set functions. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8808–8817, 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Ye_Few-Shot_ Learning_via_Embedding_Adaptation_With_Set-to-Set_Functions_CVPR_ 2020_paper.html. Cited on p. 33. [255] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set functions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8808–8817, 2020. Cited on p. 78, 86, 95. [256] Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. Adv. Neural Inf. Process. Syst., 34, 2021. URL https://proceedings.neurips.cc/paper/2021/file/ d5eca8dc3820cad9fe56a3bafda65ca1-Paper.pdf. Cited on p. 51. [257] Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian Model-Agnostic Meta-Learning. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 7332–7342. Curran Asso- ciates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ e1021d43911ca2c1845910d84f40aeae-Paper.pdf. Cited on p. 32, 86. [258] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in Neural Information Processing Systems, 27:3320– 3328, 2014. Cited on p. 49, 50. [259] Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. Diverse Few-Shot text classification with multiple metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1206–1215, New Orleans, Louisiana, June 2018. Association for Computational Lin- guistics. URLhttps://www.aclweb.org/anthology/N18-1109. Cited on p. 31. 142 [260] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, and Neil Houlsby. A large-scale study of representation learning with the visual task adaptation benchmark. October 2019. URLhttp://arxiv.org/abs/1910. 04867. Cited on p. 34. [261] Amy Zhang, Nicolas Ballas, and Joelle Pineau. A dissection of overfitting and generaliza- tion in continuous reinforcement learning. June 2018. URL http://arxiv.org/abs/ 1806.07937. Cited on p. 52. [262] Amy Zhang, Rowan Thomas McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. In In- ternational Conference on Learning Representations, 2021. URLhttps://openreview. net/forum?id=-2FCwDKRREu. Cited on p. 52. [263] Cheng Zhang, Cengiz Öztireli, Stephan Mandt, and Giampiero Salvi. Active mini-batch sampling using repulsive point processes. In Proceedings of the AAAI Conference on Artifi- cial Intelligence, volume 33, pages 5741–5748, 2019. Cited on p. 77. [264] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. DeepEMD: Differentiable earth mover’s distance for Few-Shot learning. March 2020. URL http://arxiv.org/abs/ 2003.06777. Cited on p. 33. [265] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12203–12213, 2020. Cited on p. 81, 86. [266] Jian Zhang and Ioannis Mitliagkas. Yellowfin and the art of momentum tuning. In SysML, 2019. Cited on p. 156, 160. [267] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition num- ber independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980–988, 2013. Cited on p. 100. [268] Wancong Zhang, Anthony GX-Chen, Vlad Sobal, Yann LeCun, and Nicolas Carion. Light- weight probing of unsupervised representations for reinforcement learning. arXiv [cs.LG], August 2022. URLhttp://arxiv.org/abs/2208.12345. Cited on p. 62. [269] Peilin Zhao and Tong Zhang. Accelerating minibatch stochastic gradient descent using stratified sampling. May 2014. URL http://arxiv.org/abs/1405.3080. Cited on p. 77. [270] Allan Zhou, Tom Knowles, and Chelsea Finn. Meta-Learning symmetries by reparameteri- zation. July 2020. URLhttp://arxiv.org/abs/2007.02933. Cited on p. 32. 143 [271] Yucan Zhou, Yu Wang, Jianfei Cai, Yu Zhou, Qinghua Hu, and Weiping Wang. Expert training: Task hardness aware Meta-Learning for Few-Shot classification. July 2020. URL http://arxiv.org/abs/2007.06240. Cited on p. 77. [272] Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. volume 97 of Proceedings of Machine Learning Re- search, pages 7693–7702, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/zintgraf19a.html. Cited on p. 32. 144 Appendices B Proofs for Implicit Gradient Transport B.1 Transport formula g t (q t )= t t+ 1 g t1 (q t )+ 1 t+ 1 g(q t ;x t ) = t t+ 1 (g t1 (q t1 )+ H(q t q t1 ))+ 1 t+ 1 g(q t ;x t ) (Quadratic f ) = t t+ 1 g t1 (q t1 )+ 1 t+ 1 (g(q t ;x t )+tH(q t q t1 )) = t t+ 1 g t1 (q t1 )+ 1 t+ 1 g(q t +t(q t q t1 );x t ) (Identical Hessians) t t+ 1 b g t1 (q t1 )+ 1 t+ 1 g(q t +t(q t q t1 );x t ): (b g t1 is an approximation) B.2 Proof of Prop. 6.3.1 In this proof, we assume that g is a strongly-convex quadratic function with Hessian H. At timestep t, we have access to a stochastic gradient g(q;x t )= g(q t )+e t where thee t are i.i.d. with variance Cs 2 H. We first prove a simple lemma: Lemma B.1. If v 0 = g(q 0 )+e 0 and, for t> 0, we have v t = t t+ 1 v t1 + 1 t+ 1 g(q t +t(q t q t1 ))+ 1 t+ 1 e t ; 145 then v t = g(q t )+ 1 t+ 1 t å i=0 e i : Proof. Per our assumption, this is true for t= 0. Now let us prove the result by induction. Assume this is true for t 1. Then we have: v t = t t+ 1 v t1 + 1 t+ 1 g(q t +t(q t q t1 ))+ 1 t+ 1 e t = t t+ 1 g(q t1 )+ 1 t+ 1 t1 å i=0 e i + 1 t+ 1 g(q t +t(q t q t1 ))+ 1 t+ 1 e t (recurrence assumption) = t t+ 1 g(q t1 )+ 1 t+ 1 t1 å i=0 e i + g(q t ) t t+ 1 g(q t1 )+ 1 t+ 1 e t (g is quadratic) = g(q t )+ 1 t+ 1 t å i=0 e i : This concludes the proof. Lemma B.2. Let us assume we perform the following iterative updates: v t = t t+ 1 v t1 + 1 t+ 1 g(q t +t(q t q t1 ))+ 1 t+ 1 e t q t+1 =q t av t ; starting from v 0 = g(q 0 )+e 0 . Then, denotingD t =q t q , we have D t =(IaH) t D 0 a t1 å i=0 N i;t e i 146 with N i;0 = 0 N i;t =(IaH)N i;t1 + 1 i<t 1 t I: Proof. The result is true for t = 0. We now prove the result for all t by induction. Let us assume this is true for t 1. Using Lemma B.1, we have v t1 = g(q t1 )+ 1 t t1 å i=0 e i and thus, using g(q t1 )= HD t1 , D t =D t1 av t1 =D t1 aHD t1 a t t1 å i=0 e i =(IaH)D t1 a t t1 å i=0 e i =(IaH) t D 0 a t2 å i=0 (IaH)N i;t1 e i a t t1 å i=0 e i (recurrence assumption) =(IaH) t D 0 a t1 å i=0 N i;t e i with N i;t =(IaH)N i;t1 + 1 i<t 1 t I: This concludes the proof. For the following lemma, we will assume that the Hessian is diagonal and will focus on one di- mension with eigenvalue h. Indeed, we know that there are no interactions between the eigenspaces and that we can analyze each of them independently [158]. 147 Lemma B.3. Denote r h = 1ah. We assumea 1 L . Then, for any i and any t, we have N i;t 0 (Positivity) N i;t = 0 if t i (Zero-start) N i;t log 2 i(1 r h ) if i< t 2 1 r h ) (Constant bound) N i;t max n 1+ r h ;2log 2 i(1r h ) o t(1 r h ) if 2 1 r h t: (Decreasing bound) Proof. The Zero-start case i t is immediate from the recursion of Lemma B.2. The Positivity property of N i;t is also immediate from the recursion since the stepsizea is such that r h = 1ah is positive. We now turn to the Constant bound property. We have, for t> i, N i;t = r h N i;t1 + 1 t N i;t1 + 1 t : Thus, N i;t N i;t1 1 t . Summing these inequalities, we get a telescopic sum and, finally: N i;t t å j=i+1 1 j Z t x=i dx x = log t i : This bound is trivial in the case i= 0. In that case, we keep the first term in the sum separate and get N 0;t 1+ logt: 148 In the remainder, we shall keep the log t i bound for simplicity. The upper bound on the right-hand size is increasing with t and its value for t = 2 1r h is thus an upper bound for all smaller values of t. Replacing t with 2 1r h leads to N i; 2 1r h log 2 1r h i ! = log 2 i(1 r h ) : This proves the third inequality. We shall now prove the decreasing bound by induction. This bound states that, for t large enough, each N i;t decreases as O(1=t). Using the second and third inequalities, we have N i; 2 1r h log 2 i(1 r h ) 2 1r h 2 1r h = log 2 i(1r h ) 1 r h 2 2 1r h max n 1+ r h ;2log 2 i(1r h ) o 2 1r h (1 r h ) : The maximum will help us prove the last property. Thus, for t= 2 1r h , we have N i;t max n 1+ r h ;2log 1 i(1r h ) o t(1 r h ) n i t ; withn i = max n 1+r h ;2log 1 i(1r h ) o (1r h ) . The Decreasing bound is verified for t= 2 1r h . 149 We now show that if, for any t> 2 1r h , we have N i;t1 n i t1 , then N i;t n i t . Assume that there is such at t. Then N i;t = r h N i;t1 + 1 t r h n i t 1 + 1 t = r h tn i +t 1 t(t 1) = (t 1)n i +(r h 1)tn i +n i +t 1 t(t 1) = n i t + (r h 1)tn i +n i +t 1 t(t 1) : We now shall prove that (r h 1)tn i +n i + t 1=[(r h 1)n i + 1]t+n i 1 is negative. First, we have that (r h 1)n i + 1= 1 max 1+ r h ;2log 1 i(1 r h ) 0: Then, [(r h 1)n i + 1]t+n i 1 0() t n i 1 (1 r h )n i 1 since(r h 1)n i + 1 0. Thus, the property is true for every t n i 1 (1r h )n i 1 . In addition, we have n i 1+ r h 1 r h n i (1 r h ) 1+ r h 2n i (1 r h ) 2n i (1 r h ) 1+ r h 2 1 r h n i 1 n i (1 r h ) 1 ; 150 and the property is also true for every t 2 1r h . This concludes the proof. Finally, we can prove the Proposition 6.3.1: Proof. The expectation ofD t is immediate using Lemma B.2 and the fact that thee i are indepen- dent, zero-mean noises. The variance is equal to V[D t ]=a 2 Bå t i=0 N 2 i;t . While our analysis was only along one eigenspace of the Hessian with associated eigenvalue h, we must now sum over all dimensions. We will thus define ¯ n i = max n 2am;2log 1 iam o am for i> 0 ¯ n 0 = 2+ 2log 1 am am ; which is, for every i, the maximumn i across all dimensions. We get V[D t ] da 2 B t å i=0 ¯ n 2 i t 2 da 2 B t å i=0 ¯ n 2 0 t 2 sincen i n i+1 8i da 2 B ¯ n 2 0 t : Since we have E[q t q ]=(IaH) t (q 0 q ); we get E[kq t q k 2 ]=kE[q t q ]k 2 +V[D t ] (q 0 q ) > (IaH) 2t (q 0 q )+ da 2 B ¯ n 2 0 t 1 1 k 2t kq 0 q k 2 + da 2 B ¯ n 2 0 t : 151 This concludes the proof. B.3 Proof of Proposition 6.3.2 and Proposition 6.3.3 In this section we list and prove all lemmas used in the proofs of Proposition 6.3.2 and Proposi- tion 6.3.3; all lemmas are stated in the same conditions as the proposition. We start the following proposition: Proposition B.4. Let f be a quadratic function with positive definite Hessian H with largest eigen- value L and condition numberk and if the stochastic gradients satisfy g(q;x)= g(q)+e withe a random uncorrelated noise with covariance bounded by BI. Then, Algorithm 4 leads to iteratesq t satisfying E[q t q ]= 0 B @ I 0 1 C A A t 0 B @ E[q 1 q ] E[q 0 q ] 1 C A (B.1) where A= 0 B @ IaH+mI mI I 0 1 C A (B.2) governs the dynamics of this bias. In particular, when its spectral radius,r(A) is less than 1, the iterates converge linearly toq . In a similar fashion, the variance dynamics of Heavyball-IGT are governed by the matrix D i = 0 B B B B @ (1ah i +m) 2 + 2a 2 h 2 i m 2 2m(1ah i +m) 2 1 0 0 1ah i +m 0 m 1 C C C C A 152 If the spectral radius of D i ,r(D i ), is strictly less than 1 or all i, then there exist constants t 0 > 0 and C> 0 for which Var(q t ) 2a 2 dBC log(t) t ; for t> t 0 where B is a bound on the variance of noise variablese i . Lemma B.5 (IGT estimator as true gradient plus noise average). If v 0 = g(q 0 )+e 0 and for t> 0 we have v t = t t+ 1 v t1 + 1 t+ 1 g(q t +t(q t q t1 ))+ 1 t+ 1 e t ; then v t = g(q t )+ 1 t+ 1 t å i=0 e i : This lemma is already proved in the previous section for the IGT estimator (Lemma B.1) and is just repeated here for completeness. We will use this result in the next few lemmas. Lemma B.6 (The IGT gradient estimator is unbiased on quadratics). For the IGT gradient estima- tor, v t , corresponding to parametersq t we have E [v t ]= g( E q t ); where the expectation is over all gradient noise vectorse 0 ;e 1 ;:::;e t . Proof. The proof proceeds by induction. The base case holds as we have E [v 0 ]= E [g 0 +e 0 ]= g(q 0 ): For the inductive case, we can write 153 E[v t ]=E t t+ 1 v t1 + 1 t+ 1 ˆ g(q t +t(q t q t1 )) =E t t+ 1 v t1 + 1 t+ 1 g t + t t+ 1 g t t t+ 1 g t1 + 1 t+ 1 e t = t t+ 1 E[v t1 g t1 ]+E[g t ]+ t t+ 1 E[e t ] =E[g t ]= g(E[q t ]): Where, in the third equality,E[v t1 g t1 ]= 0 by the inductive assumption, and the last equal- ity because the gradient of a quadratic function is linear. Lemma B.7 (Bounding the IGT gradient variance). Let v t be the IGT gradient estimator. Then Var[v t ] 2h 2 Var[q t q ? ]+ 2B t ; where B is the variance of the homoscedastic noisee t . Proof. Var[v t ]= Var " g t + 1 t+ 1 t å i=0 e i # = Var[hq t ]+ Var " 1 t+ 1 t å i=0 e i # + 2Cov " hq t ; 1 t+ 1 t å i=0 e i # 2Var[hq t ]+ 2Var " 1 t+ 1 t å i=0 e i # = 2h 2 Var[q t q ? ]+ 2 B t 154 Now that we have these basic results on the IGT estimator, we can analyze the evolution of the bias and variance of Heavyball-IGT. We use the quadratic assumption to decouple the vector dynamics of Heavyball-IGT into independent scalar dynamics. If the Hessian, H, has eigenvalues L h 1 h 2 ::: h n = L=k, then we can assume without loss of generality that H is diagonal with H ii = h i . Lemma B.8 (Evolution of bias for scalar quadratic). Assume that the Hessian, second derivative, is h. Starting with v 0 = g(q 0 )+e 0 and w 0 = 0, performing the following iterative updates (Heavyball- IGT, Algorithm 4): v t = t t+ 1 v t1 + 1 t+ 1 g(q+t(q t q t1 ))+ 1 t+ 1 e t ; w t+1 =mw t +av t ; q t+1 =q t w t+1 results in D t = A t D 0 a t1 å i=0 N i;t 2 6 4 e i 0 3 7 5 where N j;0 = 0 22 ; N i;t = AN i;t1 + 1 i<t 1 t I, D t = 2 6 4 q t q q t1 q 3 7 5 and A= 0 B @ 1ah+m m 1 0 1 C A : Proof. The proof proceeds by induction. First notice that for t = 0 the equality naturally holds. We make the inductive assumption that it holds for t 1, and start by using Lemma B.5: 155 D t = AD t1 a t t1 å i=0 2 6 4 e i 0 3 7 5 = A(A t1 D 0 a t2 å i=0 N i;t 2 6 4 e i 0 3 7 5 ) a t t1 å i=0 2 6 4 e i 0 3 7 5 (Inductive assumption) = A t D 0 a( t2 å i=0 AN i;t 2 6 4 e i 0 3 7 5 + 1 t t1 å i=0 2 6 4 e i 0 3 7 5 ) = A t D 0 a t1 å i=0 N i;t 2 6 4 e i 0 3 7 5 (Def. of N i;t ) Lemma B.9 (Evolution of variance). Let U t = Var[q t ] and V t = Cov[q t ;q t1 ], whereq t is the t-th iterate of Heavyball-IGT on a 1-dimensional quadratic function with curvature h. The following matrix describes the variance dynamics of Heavyball-IGT. D= 0 B B B B @ (1ah+m) 2 + 2a 2 h 2 m 2 2m(1ah+m) 2 1 0 0 1ah+m 0 m 1 C C C C A (B.3) If the spectral radius of D, r(D), is strictly less than 1, then there exist constants t 0 > 0 and C> 0 for which Var(q t ) 2a 2 BC log(t) t , where B is a bound on the variance of the noise. Proof. The proof (and lemma) is similar to the proof of Lemma 9 in [266]. We start by expanding U t+1 as follows. 156 U t+1 =E (q t+1 ¯ q t+1 ) 2 =E (q t av t +m(q t q t1 ) ¯ q t +ag t m( ¯ q t ¯ q t1 )) 2 =E[(q t ag t +m(q t q t1 ) ¯ q t +ag t m( ¯ q t ¯ q t1 )+a(g t v t )) 2 ] =E ((1ah+m)(q t ¯ q t )m(q t1 ¯ q t1 )) 2 +a 2 E (g t v t ) 2 E ((1ah+m)(q t ¯ q t )m(q t1 ¯ q t1 )) 2 +a 2 2h 2 E (q t ¯ q t ) 2 + 2B t+ 1 (1ah+m) 2 + 2a 2 m 2 E (q t ¯ q t ) 2 2m(1ah+m)E (q t ¯ q t )(q t1 ¯ q t1 ) +m 2 E (q t1 ¯ q t1 ) 2 +a 2 2B t+ 1 : Where the fourth equality is obtained since we know that the IGT gradient estimator is un- biased, i.e. E[g t v t ]= 0. The first inequality stems from Lemma B.7. We similarly expand V t . V t =E (q t ¯ q t )(q t1 ¯ q t1 ) =E (1ah+m)(q t1 ¯ q t1 )m(q t2 ¯ q t2 )+a(g t v t ) (q t1 ¯ q t1 ) =(1ah+m)E (q t1 ¯ q t1 ) 2 mE (q t1 ¯ q t1 )(q t2 ¯ q t2 ) From the above expressions, we obtain 157 0 B B B B @ U t+1 U t V t+1 1 C C C C A D 0 B B B B @ U t U t1 V t 1 C C C C A + 0 B B B B @ a 2 2B t+1 0 0 1 C C C C A 2a 2 B t å i=0 D i 0 B B B B @ 1 t+1i 0 0 1 C C C C A 2a 2 B 0 B B B B @ s1 å i=0 D i 0 B B B B @ 1 t+1i 0 0 1 C C C C A + t å i=s D i 0 B B B B @ 1 t+1i 0 0 1 C C C C A 1 C C C C A where an inequality of vectors implies the corresponding elementwise inequalities. If the spectral radius of D, r(D) is strictly less than 1, then there exists constant C 0 > 0 such that 0 B B B B @ 1 0 0 1 C C C C A T s1 å i=0 D i 0 B B B B @ 1 t+1i 0 0 1 C C C C A C 0 s1 å i=0 1 t+ 1 i C 0 s t+ 2 s 158 If the spectral radius of D, r(D), is strictly less than 1, then there exists constant z > 0 and constant C 00 (z)> 0 such that,r(D)+z < 1 and 0 B B B B @ 1 0 0 1 C C C C A T t å i=s D i 0 B B B B @ 1 t+1i 0 0 1 C C C C A 0 B B B B @ 1 0 0 1 C C C C A T t å i=s D i 0 B B B B @ 1 0 0 1 C C C C A C 00 t å i=s (r(D)+z) s =C 00 (t s+ 1)(r(D)+z) s Letr 0 =r(D)+z and s=d2log 1=r 0 te. Then(r(D)+z) s = 1=t 2 , and putting the above two bounds together, U t+1 2a 2 B 2C 0 log 1=r 0 t t+ 2 2log 1=r 0 t +C 00 t 2log 1=r 0 t+ 1 t 2 ! 2a 2 BC log(t+ 1) t+ 1 where the last inequality holds for t> t 0 for some t 0 and some constant C> 0. We can now prove Proposition B.4. Proof of Proposition B.4. The bias statement of the proposition follows directly from taking an expectation on the bound of Lemma B.4, and the variance statement from summing up the d different variance terms given for each scalar component by Lemma B.5. 159 B.3.1 Proof of Proposition 6.3.2 This Proposition follows from the observation that, in the noiseless case, e t = 0 in our model. In that case, Lemma B.6 shows that Heavyball-IGT reduces to the heavy ball, and the rest follows from the optimal tuning of the heavy ball [266]. B.3.2 Proof of Proposition 6.3.3 Proof. Like we did in previous proofs, we can assume without loss of generality that the Hessian, H, is diagonal with elements h i . For a diagonal H, matrix A can be permuted to be block diagonal with blocks A i = 0 B @ 1ah i +m m 1 0 1 C A : To prove thatr(A)< 1 it suffices to prove thatr(A i )< 1 for all i. For the rest of the proof we will focus on the dynamics along a single eigendirection with curvature h i . The rest of this proof used D to denote D i , A to denote A i and h to denote h i . To make explicit the dependence of matrices A and D on hyperparameters and curvature, we write A(a;m;h) and D(a;m;h). Let 0 0, such thatr(A(a;m;h))< 1, and the spectral radius of D isr(D(a;m;h))< 1. Then the previous lemma implies that bias converges linearly, and variance is O(log(t)=t). To argue the existence of m > 0, we will perform eigenvalue perturbation analysis using the Bauer-Fike theorem. Note that A(a;m;h)= A(a;m 0 ;h)+mD A where D A = 0 B @ 1 1 0 0 1 C A : 160 Similarly, D(a;m;h) D(a;m 0 ;h)+mD D where D D = 0 B B B B @ 2(1ah) 0 2(1ah) 0 0 0 1 0 1 1 C C C C A : This last approximate inequality is a first-order approximation, in the sense that we are working with arbitrarily small, positive values of m, and we have kept terms linear in m but ignored higher order terms, likem 2 . We will apply the Bauer-Fike theorem to bound the eigenvalues of D(a;m;h). Consider the eigendecomposition D(a;m 0 ;h)= VLV 1 . We can compute V = 0 B B B B @ 0 0 12ah+3a 2 h 2 1ah 0 1 1 1ah 1 0 1 1 C C C C A and V 1 = 0 B B B B @ 1ah 12ah+3a 2 h 2 1 12ah+3a 2 h 2 0 0 1 0 1 0 0 1 C C C C A : Note that because we assumea< 2=(3h) we get 1ah> 0. Also, 12ah+3a 2 h 2 > 0 regardless of the choice of hyperparameters. This means that matrices V and V 1 are singular and of finite norm. The norm ofD D is also finite. The Bauer-Fike theorem states that, ifn is an eigenvalue of D(a;m 0 ;h), then there exists an eigenvaluel of D(a;m;h) such that jlnjkVk p kV 1 k p kmD D k p ; for any p-norm. Since by constructionjnjr(D(a;m 0 ;h))< 1, the above means that there exists a sufficiently small, but strictly positive value of m, such that l < 1. By repeating this argument 161 for all pairs of eigenvalues, we get the stated result. The same argument can be repeated to prove the existence of a strictly positivem such thatr(A(a;m;h))< 1. 162
Abstract (if available)
Abstract
The success of modern machine learning (ML) stems from the unreasonable effectiveness of large data. But what about niche tasks with limited data? Some methods are able to quickly solve those tasks by first pretraining ML models on many generic tasks in a way that lets them quickly adapt to unseen new tasks. Those methods are known to ``learn how to learn'' and thus fall under the umbrella of meta-learning. While meta-learning can be successful, the inductive biases that enable fast adaptation remain poorly understood.
This thesis takes a first step towards an understanding of meta-learning and reveals a set of guidelines which help design novel and improved methods for fast adaptation. Our core contribution is a study of the solutions found by meta-learning. We uncover the working principles that let them adapt so quickly: their parameters partition into three groups, one to compute task-agnostic features, another for task-specific features, and a third that accelerates adaptation to new tasks. Building on those insights we introduce several methods to drastically speed up adaptation, which we validate on a wide range of architectures, methods, and datasets.
We hope this thesis can inspire future work on quickly and effectively solving real-world tasks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Algorithms and systems for continual robot learning
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Towards learning generalization
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Planning and learning for long-horizon collaborative manipulation tasks
PDF
Towards understanding language in perception and embodiment
PDF
Modeling, learning, and leveraging similarity
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Visual representation learning with structural prior
PDF
On virtual, augmented, and mixed reality for socially assistive robotics
PDF
Towards socially assistive robot support methods for physical activity behavior change
PDF
Closing the reality gap via simulation-based inference and control
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Program-guided framework for your interpreting and acquiring complex skills with learning robots
PDF
AI-driven experimental design for learning of process parameter models for robotic processing applications
Asset Metadata
Creator
Arnold, Sébastien Marc Renato
(author)
Core Title
Quickly solving new tasks, with meta-learning and without
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
05/05/2023
Defense Date
12/05/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,fast adaptation,few-shot learning,machine learning,meta-learning,multi-task learning,OAI-PMH Harvest,optimization,robotics,solving new tasks,transfer learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Matarić, Maja (
committee chair
), Sha, Fei (
committee chair
), Avestimehr, Salman (
committee member
), Nikolaidis, Stefanos (
committee member
), Thomason, Jesse (
committee member
)
Creator Email
smr.arnold@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113099966
Unique identifier
UC113099966
Identifier
etd-ArnoldSbas-11784.pdf (filename)
Legacy Identifier
etd-ArnoldSbas-11784
Document Type
Dissertation
Format
theses (aat)
Rights
Arnold, Sébastien Marc Renato
Internet Media Type
application/pdf
Type
texts
Source
20230505-usctheses-batch-1038
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
fast adaptation
few-shot learning
machine learning
meta-learning
multi-task learning
optimization
robotics
solving new tasks
transfer learning